initial: bootstrap from BukidBountyApp base

This commit is contained in:
Jonathan Sykes
2026-06-06 18:43:00 +08:00
commit eb4a5731fb
5674 changed files with 160857 additions and 0 deletions

View File

@@ -0,0 +1,374 @@
# CDN Publisher Microservice — Implementation Plan
A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public `file_list` rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app.
**Status:** planning + execution. New repo: **`cdn-relay`** (binary/CLI also named `cdn-relay`; the plan's prose still refers to the *role* as "CDN Publisher microservice").
## Language / runtime
**Go 1.23+.** Picked for: static single-binary Docker images (`FROM scratch` ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of `/bytes`, and mature libraries (`go-git` or shelling to `git`, `google/go-github`, `chi` router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust.
---
## Goals
1. Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound).
2. Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology.
3. Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path.
4. Provide a **synchronous "simple upload" mode**: any authorized client `POST`s a file, the service dedupes by sha256 against its own `publish_log`, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyApp `file_list` flow.
## Non-goals
- Not a webhook responder. The main app does **not** push events; the microservice **pulls** work on its own schedule (or on operator-triggered runs).
- Not a media transformer. No resizing, transcoding, or compression.
- Not a private-asset gateway. Anything published is public, forever.
---
## Architecture
```
┌────────────────────┐ 1. GET unpublished ┌───────────────────────┐
│ BukidBountyApp │◄───────────────────────► │ CDN Publisher │
│ (main Hyperf app) │ 2. fetch bytes │ microservice │
│ │ │ (Node/Go/Python) │
│ Postgres │ │ │
│ + file_content │ │ Local clones of │
│ + file_list │ │ cdn repos │
└────────────────────┘ └────────────┬──────────┘
▲ │
│ 4. POST /internal/cdn/published │ 3. git push
│ { hashkey, cdn_url } ▼
│ ┌───────────────────────┐
└────────────────────────────────────── │ GitHub │
│ (private org/account)│
│ bb-cdn-7f3a9e2c │
│ bb-cdn-1a8b3f0d │
│ … │
└────────────┬──────────┘
jsDelivr CDN edge
```
### Why pull, not webhook
- Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid.
- Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks.
---
## Data contract (main app side)
Already in place after this conversation:
- `file_list.is_public` — boolean, default false. Only `is_public = 1 AND cdn_url IS NULL` rows are eligible.
- `file_list.cdn_url` — full jsDelivr URL written back on success.
- `file_list.file_type` — used for path organization in the CDN repo (e.g. `app_logo/<filehash>.png` vs `profile_photo/<filehash>.jpg`).
- `file_content.filehash` — sha256 of bytes; used as the file's content-addressed name in the CDN repo.
- `file_content.mimetype` — drives extension selection.
- `file_content.size_in_bytes` — used for the per-file size cap and repo-size accounting.
### New endpoints on the main app
Both protected by an `Authorization: Bearer <CDN_SERVICE_TOKEN>` header (token in `.env`, validated by middleware). No user session.
**`GET /internal/cdn/pending?limit=50`**
Returns rows ready for publish:
```json
{
"items": [
{
"hashkey": "<filelist hashkey>",
"filehash": "<sha256>",
"mimetype": "image/png",
"size_in_bytes": 142883,
"file_type": "app_logo",
"filename": "app_logo_1715260000.png"
}
]
}
```
Filter: `is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size`. Order by `id ASC` (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns `[]`.
**`GET /internal/cdn/bytes/{filelist_hashkey}`**
Streams the raw bytes. Same auth. Reuses the existing `viewFilebyFileListHash` plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates `cdn_url` in the first place — it must always read from the DB).
**`POST /internal/cdn/published`**
Body:
```json
{
"hashkey": "<filelist hashkey>",
"cdn_url": "https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@<commit-sha>/profile_photo/<filehash>.jpg"
}
```
Updates `file_list.cdn_url` for that row. Idempotent — if `cdn_url` already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it).
**`POST /internal/cdn/failed`** *(optional, v2)*
Body: `{ hashkey, error }`. Logs the failure for operator visibility. The row stays eligible for retry next tick.
---
---
## Simple upload mode (synchronous, client-facing)
A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call.
### Endpoints
All require a valid token (see Auth section below) via `Authorization: Bearer <token>`.
**`POST /v1/upload`** — multipart/form-data
- field `file` (required): the file bytes
- field `file_type` (optional): folder name, default `misc`
- field `mimetype` (optional): override; otherwise sniffed from bytes + filename
Flow:
1. Stream body to a temp file while computing sha256.
2. Lookup `publish_log` by `filehash`. If found and `status='reported'` (or `'pushed'`) and `cdn_url IS NOT NULL` → return the existing URL immediately (no GitHub work).
3. Otherwise: same publish path as the polling loop — write to active repo's clone at `{file_type}/{sha256}.{ext}`, commit, push, record in `publish_log`, return URL.
4. Response:
```json
{ "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "<sha256>", "deduped": true|false, "size_bytes": 12345 }
```
**`GET /v1/lookup/{sha256}`** — cheap dedup probe without uploading. Returns the existing `cdn_url` or 404.
**`GET /v1/docs`** — renders the API guide (HTML or markdown). **Only served when a valid token is presented** — unauthenticated callers get 401, never the docs. This keeps the surface unindexable.
**`GET /v1/health`** — unauthenticated, returns `{"ok": true}` for orchestrator probes.
### Concurrency note
Simple-mode uploads share the same active repo and the same single-writer `git push` lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by `repo_max_bytes / batch_size` git ops). For high-throughput callers, prefer queueing many uploads then issuing one `git push` but that's a v2 optimization; v1 commits per request when not batchable.
---
## Authentication & token management
The service has its own token store (separate from the `CDN_SERVICE_TOKEN` used for main-appmicroservice traffic that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped.
### Schema
```sql
CREATE TABLE api_tokens (
id INTEGER PRIMARY KEY,
token_hash TEXT NOT NULL UNIQUE, -- sha256 of the raw token; raw shown once at creation
name TEXT NOT NULL, -- human label, e.g. "ci-pipeline"
scopes TEXT NOT NULL, -- csv: "upload,lookup,docs" or "admin"
ip_allow TEXT, -- csv of CIDRs; null = any
ip_deny TEXT, -- csv of CIDRs; evaluated before allow
expires_at TIMESTAMPTZ, -- null = never
created_at TIMESTAMPTZ DEFAULT now(),
last_used_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ
);
CREATE TABLE api_token_audit (
id INTEGER PRIMARY KEY,
token_id INTEGER REFERENCES api_tokens(id),
ip TEXT NOT NULL,
path TEXT NOT NULL,
status INTEGER NOT NULL,
ts TIMESTAMPTZ DEFAULT now()
);
```
### Validation pipeline (every request)
1. Extract bearer token sha256 lookup `api_tokens` by `token_hash`.
2. Reject if: not found, `revoked_at IS NOT NULL`, `expires_at < now()`, or scope doesn't cover the route.
3. Resolve client IP. Trust `X-Forwarded-For` only when `TRUSTED_PROXIES` env lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.)
4. If `ip_deny` matches 403.
5. If `ip_allow` is set and doesn't match 403.
6. Update `last_used_at`, write `api_token_audit` row, proceed.
### Admin endpoints (scope = `admin`)
Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via:
- **`POST /v1/admin/tokens`** body: `{ name, scopes, ip_allow?, ip_deny?, ttl_hours? }`. Response includes the **raw token once** (never retrievable again) and the token id.
- **`GET /v1/admin/tokens`** list (no raw values, just metadata + last-used).
- **`POST /v1/admin/tokens/{id}/revoke`** sets `revoked_at = now()`.
- **`GET /v1/admin/audit?token_id=...&limit=...`** recent usage.
CLI shortcuts (same binary): `cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24`, `cdn-relay token revoke <id>`, `cdn-relay token list`. Useful for ops when the HTTP surface itself is locked down.
### Storage of raw tokens
Never. We store `sha256(token)` only. If lost, revoke and reissue.
---
## Microservice internals
### State (its own database, e.g. SQLite or Postgres)
```sql
CREATE TABLE cdn_repos (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- "bb-cdn-7f3a9e2c"
github_owner TEXT NOT NULL,
local_clone_path TEXT NOT NULL, -- where it's checked out on disk
size_used_bytes BIGINT NOT NULL DEFAULT 0,
is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target
is_full BOOLEAN NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
retired_at TIMESTAMPTZ
);
CREATE TABLE publish_log (
id INTEGER PRIMARY KEY,
filelist_hashkey TEXT NOT NULL,
filehash TEXT NOT NULL,
cdn_repo_id INTEGER REFERENCES cdn_repos(id),
commit_sha TEXT,
cdn_url TEXT,
status TEXT NOT NULL, -- "pending" | "pushed" | "reported" | "failed"
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(filelist_hashkey)
);
```
`cdn_repos.size_used_bytes` is the source of truth for rotation. Recomputed by a periodic `du -sb` of the local clone; updated incrementally after each push.
### Configuration
```toml
# config.toml
main_app_base_url = "https://bukidbounty.example.com"
main_app_token = "<env: CDN_SERVICE_TOKEN>"
github_owner = "<env: GH_OWNER>"
github_token = "<env: GH_TOKEN>"
poll_interval_sec = 30
batch_size = 50
per_file_max_bytes = 50_000_000 # 50 MB hard cap
repo_max_bytes = 800_000_000 # 800 MB rotation threshold
repo_name_prefix = "bb-cdn-"
clone_root = "/var/lib/cdn-relay/repos"
```
`github_owner` is intentionally not committed. The repo name pattern `bb-cdn-<random8hex>` is generated at rotation time so existing repos can't be enumerated by guessing.
### Repo rotation algorithm
```
on each batch flush:
active = select * from cdn_repos where is_active = 1 limit 1
if active is null OR active.size_used_bytes >= repo_max_bytes:
if active: mark active.is_active = 0, is_full = 1, retired_at = now()
new_name = repo_name_prefix + random_hex(8)
create_github_repo(new_name) # via GitHub API, public, empty
git_clone(new_name, clone_root/new_name)
insert cdn_repos (name=new_name, is_active=1, …)
active = the new row
return active
```
The retired repo's existing `cdn_url`s never need updating they already encode the repo name and a frozen commit SHA.
### Publish loop (per tick)
```
1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size
2. for each item in resp.items:
if publish_log row already exists for hashkey: skip
insert publish_log (status=pending)
3. group items by active repo (rotating mid-batch if size cap hit)
4. for each item:
bytes = GET {main_app}/internal/cdn/bytes/{hashkey} # streamed
ext = mimetype_to_ext(item.mimetype)
path = "{file_type}/{filehash}.{ext}" # file_type used as folder
write bytes to active.local_clone_path/path
stage with `git add`
5. once batch staged:
commit = git commit -m "publish batch <timestamp>"
git push origin main
sha = <commit sha>
6. for each item in batch:
cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}"
update publish_log set status=pushed, commit_sha, cdn_url
POST {main_app}/internal/cdn/published { hashkey, cdn_url }
update publish_log set status=reported
7. update active.size_used_bytes (incremental sum + occasional du reconciliation)
```
Steps 27 run inside a single advisory lock (`flock` or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee.
### Failure modes
| Failure | Recovery |
| --- | --- |
| Main app `/pending` 5xx | Skip tick, retry next |
| `/bytes` 404 | Mark `publish_log.failed`, continue batch (file was deleted between listing and fetch) |
| `git push` rejected | Roll back local commit (`git reset --hard HEAD~1`), mark batch failed, retry next tick |
| `/published` 5xx | Row stays in `publish_log.status=pushed`; reconciler re-POSTs on next tick (using `commit_sha` + `cdn_url` from log) |
| Microservice crash mid-batch | On boot, find `publish_log.status=pending` rows, decide: did the commit happen? `git log --oneline | head -1` vs known last sha if a new commit exists with our staged paths, mark pushed and report; else reset and retry |
### Backfill mode
Same code path. Just an operator command: `cdn-relay backfill --limit=10000` that bypasses the polling sleep and runs `/pending` requests until the response is empty. No new logic.
### Per-file size cap
Already enforced via the `size_in_bytes <= ?max_size` filter in `/pending` the main app never offers oversized rows. Microservice can also double-check before write.
### Mime → extension table
Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a `.bin` extension and jsDelivr serves `application/octet-stream` defensive, not catastrophic, and easy to fix later.
---
## Local-machine v0 (before the microservice exists)
The `is_public`, `file_type`, `cdn_url`, and `resolvedUrl()` plumbing in this conversation is enough to support a **manual** publish workflow today:
```bash
# Hand-edit DB to flip is_public=1 on a known row
psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';"
# Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha
cp ./tmp/<filehash>.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/<filehash>.png
cd ~/cdn-repos/bb-cdn-7f3a9e2c
git add . && git commit -m "manual" && git push
SHA=$(git rev-parse HEAD)
# Hand-write the cdn_url back
psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@${SHA}/app_logo/<filehash>.png' WHERE hashkey = '...';"
```
Tedious but proves the redirect path end-to-end before committing to the microservice build.
A small artisan command (`php artisan cdn:publish-manual <filelist_hashkey> <cdn_url>`) could wrap step 3 to avoid raw SQL easy to add later, out of scope for this plan.
---
## Open decisions for the microservice conversation
1. ~~**Language/runtime**~~: **Decided — Go 1.23+.**
2. **Hosting**: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones.
3. **Mimetype-to-folder rules**: `file_type` defaults to `misc/` when null (both polling and simple-upload modes).
4. **Commit batching**: one commit per `/pending` batch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck.
5. **Repo creation**: dedicated GitHub machine user with a PAT scoped to `repo`. Token stored in env, never in DB.
6. **Public visibility check**: refuse to mark a repo `is_active` if GitHub API reports it as private.
7. **Rate limiting**: simple-upload mode needs per-token rate limits (e.g. `60/min`, `10MB/s`) token-bucket in memory keyed by `token_id`. v1 uses a single global default; per-token overrides v2.
---
## Summary of what already exists in BukidBountyApp to support this
- `file_list.cdn_url` (migration `2026_05_09_120000_add_cdn_url_to_file_list.php`)
- `file_list.is_public` (default false) and `file_list.file_type` (migration `2026_05_09_120100_add_is_public_and_file_type_to_file_list.php`)
- `FileList::resolvedUrl()` prefers CDN URL when set, otherwise local route
- `FilesMainController::viewFilebyFileListHash` 302-redirects to CDN URL when set, so all existing `<img :src="'/RequestData/File/' + hash">` references benefit transparently
- `FilesMainController::generateURLforFileListHash` returns CDN URL when set in DB
- `FilesMainController::uploadFileList` accepts a `?string $file_type` parameter; every existing caller sets one explicitly (or null for the generic `UploadFilefromRequest` endpoint)
Still missing (to be built when the microservice is built):
- `/internal/cdn/pending`, `/internal/cdn/bytes/{hash}`, `/internal/cdn/published` endpoints + bearer middleware
- Management UI for flipping `is_public` and assigning `file_type` to existing rows
- The microservice itself