# CDN Publisher Microservice — Implementation Plan A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public `file_list` rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app. **Status:** planning + execution. New repo: **`cdn-relay`** (binary/CLI also named `cdn-relay`; the plan's prose still refers to the *role* as "CDN Publisher microservice"). ## Language / runtime **Go 1.23+.** Picked for: static single-binary Docker images (`FROM scratch` ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of `/bytes`, and mature libraries (`go-git` or shelling to `git`, `google/go-github`, `chi` router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust. --- ## Goals 1. Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound). 2. Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology. 3. Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path. 4. Provide a **synchronous "simple upload" mode**: any authorized client `POST`s a file, the service dedupes by sha256 against its own `publish_log`, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyApp `file_list` flow. ## Non-goals - Not a webhook responder. The main app does **not** push events; the microservice **pulls** work on its own schedule (or on operator-triggered runs). - Not a media transformer. No resizing, transcoding, or compression. - Not a private-asset gateway. Anything published is public, forever. --- ## Architecture ``` ┌────────────────────┐ 1. GET unpublished ┌───────────────────────┐ │ BukidBountyApp │◄───────────────────────► │ CDN Publisher │ │ (main Hyperf app) │ 2. fetch bytes │ microservice │ │ │ │ (Node/Go/Python) │ │ Postgres │ │ │ │ + file_content │ │ Local clones of │ │ + file_list │ │ cdn repos │ └────────────────────┘ └────────────┬──────────┘ ▲ │ │ 4. POST /internal/cdn/published │ 3. git push │ { hashkey, cdn_url } ▼ │ ┌───────────────────────┐ └────────────────────────────────────── │ GitHub │ │ (private org/account)│ │ bb-cdn-7f3a9e2c │ │ bb-cdn-1a8b3f0d │ │ … │ └────────────┬──────────┘ │ ▼ jsDelivr CDN edge ``` ### Why pull, not webhook - Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid. - Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks. --- ## Data contract (main app side) Already in place after this conversation: - `file_list.is_public` — boolean, default false. Only `is_public = 1 AND cdn_url IS NULL` rows are eligible. - `file_list.cdn_url` — full jsDelivr URL written back on success. - `file_list.file_type` — used for path organization in the CDN repo (e.g. `app_logo/.png` vs `profile_photo/.jpg`). - `file_content.filehash` — sha256 of bytes; used as the file's content-addressed name in the CDN repo. - `file_content.mimetype` — drives extension selection. - `file_content.size_in_bytes` — used for the per-file size cap and repo-size accounting. ### New endpoints on the main app Both protected by an `Authorization: Bearer ` header (token in `.env`, validated by middleware). No user session. **`GET /internal/cdn/pending?limit=50`** Returns rows ready for publish: ```json { "items": [ { "hashkey": "", "filehash": "", "mimetype": "image/png", "size_in_bytes": 142883, "file_type": "app_logo", "filename": "app_logo_1715260000.png" } ] } ``` Filter: `is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size`. Order by `id ASC` (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns `[]`. **`GET /internal/cdn/bytes/{filelist_hashkey}`** Streams the raw bytes. Same auth. Reuses the existing `viewFilebyFileListHash` plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates `cdn_url` in the first place — it must always read from the DB). **`POST /internal/cdn/published`** Body: ```json { "hashkey": "", "cdn_url": "https://cdn.jsdelivr.net/gh//bb-cdn-7f3a9e2c@/profile_photo/.jpg" } ``` Updates `file_list.cdn_url` for that row. Idempotent — if `cdn_url` already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it). **`POST /internal/cdn/failed`** *(optional, v2)* Body: `{ hashkey, error }`. Logs the failure for operator visibility. The row stays eligible for retry next tick. --- --- ## Simple upload mode (synchronous, client-facing) A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call. ### Endpoints All require a valid token (see Auth section below) via `Authorization: Bearer `. **`POST /v1/upload`** — multipart/form-data - field `file` (required): the file bytes - field `file_type` (optional): folder name, default `misc` - field `mimetype` (optional): override; otherwise sniffed from bytes + filename Flow: 1. Stream body to a temp file while computing sha256. 2. Lookup `publish_log` by `filehash`. If found and `status='reported'` (or `'pushed'`) and `cdn_url IS NOT NULL` → return the existing URL immediately (no GitHub work). 3. Otherwise: same publish path as the polling loop — write to active repo's clone at `{file_type}/{sha256}.{ext}`, commit, push, record in `publish_log`, return URL. 4. Response: ```json { "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "", "deduped": true|false, "size_bytes": 12345 } ``` **`GET /v1/lookup/{sha256}`** — cheap dedup probe without uploading. Returns the existing `cdn_url` or 404. **`GET /v1/docs`** — renders the API guide (HTML or markdown). **Only served when a valid token is presented** — unauthenticated callers get 401, never the docs. This keeps the surface unindexable. **`GET /v1/health`** — unauthenticated, returns `{"ok": true}` for orchestrator probes. ### Concurrency note Simple-mode uploads share the same active repo and the same single-writer `git push` lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by `repo_max_bytes / batch_size` git ops). For high-throughput callers, prefer queueing many uploads then issuing one `git push` — but that's a v2 optimization; v1 commits per request when not batchable. --- ## Authentication & token management The service has its own token store (separate from the `CDN_SERVICE_TOKEN` used for main-app↔microservice traffic — that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped. ### Schema ```sql CREATE TABLE api_tokens ( id INTEGER PRIMARY KEY, token_hash TEXT NOT NULL UNIQUE, -- sha256 of the raw token; raw shown once at creation name TEXT NOT NULL, -- human label, e.g. "ci-pipeline" scopes TEXT NOT NULL, -- csv: "upload,lookup,docs" or "admin" ip_allow TEXT, -- csv of CIDRs; null = any ip_deny TEXT, -- csv of CIDRs; evaluated before allow expires_at TIMESTAMPTZ, -- null = never created_at TIMESTAMPTZ DEFAULT now(), last_used_at TIMESTAMPTZ, revoked_at TIMESTAMPTZ ); CREATE TABLE api_token_audit ( id INTEGER PRIMARY KEY, token_id INTEGER REFERENCES api_tokens(id), ip TEXT NOT NULL, path TEXT NOT NULL, status INTEGER NOT NULL, ts TIMESTAMPTZ DEFAULT now() ); ``` ### Validation pipeline (every request) 1. Extract bearer token → sha256 → lookup `api_tokens` by `token_hash`. 2. Reject if: not found, `revoked_at IS NOT NULL`, `expires_at < now()`, or scope doesn't cover the route. 3. Resolve client IP. Trust `X-Forwarded-For` only when `TRUSTED_PROXIES` env lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.) 4. If `ip_deny` matches → 403. 5. If `ip_allow` is set and doesn't match → 403. 6. Update `last_used_at`, write `api_token_audit` row, proceed. ### Admin endpoints (scope = `admin`) Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via: - **`POST /v1/admin/tokens`** — body: `{ name, scopes, ip_allow?, ip_deny?, ttl_hours? }`. Response includes the **raw token once** (never retrievable again) and the token id. - **`GET /v1/admin/tokens`** — list (no raw values, just metadata + last-used). - **`POST /v1/admin/tokens/{id}/revoke`** — sets `revoked_at = now()`. - **`GET /v1/admin/audit?token_id=...&limit=...`** — recent usage. CLI shortcuts (same binary): `cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24`, `cdn-relay token revoke `, `cdn-relay token list`. Useful for ops when the HTTP surface itself is locked down. ### Storage of raw tokens Never. We store `sha256(token)` only. If lost, revoke and reissue. --- ## Microservice internals ### State (its own database, e.g. SQLite or Postgres) ```sql CREATE TABLE cdn_repos ( id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE, -- "bb-cdn-7f3a9e2c" github_owner TEXT NOT NULL, local_clone_path TEXT NOT NULL, -- where it's checked out on disk size_used_bytes BIGINT NOT NULL DEFAULT 0, is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target is_full BOOLEAN NOT NULL DEFAULT 0, created_at TIMESTAMPTZ DEFAULT now(), retired_at TIMESTAMPTZ ); CREATE TABLE publish_log ( id INTEGER PRIMARY KEY, filelist_hashkey TEXT NOT NULL, filehash TEXT NOT NULL, cdn_repo_id INTEGER REFERENCES cdn_repos(id), commit_sha TEXT, cdn_url TEXT, status TEXT NOT NULL, -- "pending" | "pushed" | "reported" | "failed" attempts INTEGER NOT NULL DEFAULT 0, last_error TEXT, created_at TIMESTAMPTZ DEFAULT now(), updated_at TIMESTAMPTZ DEFAULT now(), UNIQUE(filelist_hashkey) ); ``` `cdn_repos.size_used_bytes` is the source of truth for rotation. Recomputed by a periodic `du -sb` of the local clone; updated incrementally after each push. ### Configuration ```toml # config.toml main_app_base_url = "https://bukidbounty.example.com" main_app_token = "" github_owner = "" github_token = "" poll_interval_sec = 30 batch_size = 50 per_file_max_bytes = 50_000_000 # 50 MB hard cap repo_max_bytes = 800_000_000 # 800 MB rotation threshold repo_name_prefix = "bb-cdn-" clone_root = "/var/lib/cdn-relay/repos" ``` `github_owner` is intentionally not committed. The repo name pattern `bb-cdn-` is generated at rotation time so existing repos can't be enumerated by guessing. ### Repo rotation algorithm ``` on each batch flush: active = select * from cdn_repos where is_active = 1 limit 1 if active is null OR active.size_used_bytes >= repo_max_bytes: if active: mark active.is_active = 0, is_full = 1, retired_at = now() new_name = repo_name_prefix + random_hex(8) create_github_repo(new_name) # via GitHub API, public, empty git_clone(new_name, clone_root/new_name) insert cdn_repos (name=new_name, is_active=1, …) active = the new row return active ``` The retired repo's existing `cdn_url`s never need updating — they already encode the repo name and a frozen commit SHA. ### Publish loop (per tick) ``` 1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size 2. for each item in resp.items: if publish_log row already exists for hashkey: skip insert publish_log (status=pending) 3. group items by active repo (rotating mid-batch if size cap hit) 4. for each item: bytes = GET {main_app}/internal/cdn/bytes/{hashkey} # streamed ext = mimetype_to_ext(item.mimetype) path = "{file_type}/{filehash}.{ext}" # file_type used as folder write bytes to active.local_clone_path/path stage with `git add` 5. once batch staged: commit = git commit -m "publish batch " git push origin main sha = 6. for each item in batch: cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}" update publish_log set status=pushed, commit_sha, cdn_url POST {main_app}/internal/cdn/published { hashkey, cdn_url } update publish_log set status=reported 7. update active.size_used_bytes (incremental sum + occasional du reconciliation) ``` Steps 2–7 run inside a single advisory lock (`flock` or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee. ### Failure modes | Failure | Recovery | | --- | --- | | Main app `/pending` 5xx | Skip tick, retry next | | `/bytes` 404 | Mark `publish_log.failed`, continue batch (file was deleted between listing and fetch) | | `git push` rejected | Roll back local commit (`git reset --hard HEAD~1`), mark batch failed, retry next tick | | `/published` 5xx | Row stays in `publish_log.status=pushed`; reconciler re-POSTs on next tick (using `commit_sha` + `cdn_url` from log) | | Microservice crash mid-batch | On boot, find `publish_log.status=pending` rows, decide: did the commit happen? `git log --oneline | head -1` vs known last sha — if a new commit exists with our staged paths, mark pushed and report; else reset and retry | ### Backfill mode Same code path. Just an operator command: `cdn-relay backfill --limit=10000` that bypasses the polling sleep and runs `/pending` requests until the response is empty. No new logic. ### Per-file size cap Already enforced via the `size_in_bytes <= ?max_size` filter in `/pending` — the main app never offers oversized rows. Microservice can also double-check before write. ### Mime → extension table Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a `.bin` extension and jsDelivr serves `application/octet-stream` — defensive, not catastrophic, and easy to fix later. --- ## Local-machine v0 (before the microservice exists) The `is_public`, `file_type`, `cdn_url`, and `resolvedUrl()` plumbing in this conversation is enough to support a **manual** publish workflow today: ```bash # Hand-edit DB to flip is_public=1 on a known row psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';" # Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha cp ./tmp/.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/.png cd ~/cdn-repos/bb-cdn-7f3a9e2c git add . && git commit -m "manual" && git push SHA=$(git rev-parse HEAD) # Hand-write the cdn_url back psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh//bb-cdn-7f3a9e2c@${SHA}/app_logo/.png' WHERE hashkey = '...';" ``` Tedious but proves the redirect path end-to-end before committing to the microservice build. A small artisan command (`php artisan cdn:publish-manual `) could wrap step 3 to avoid raw SQL — easy to add later, out of scope for this plan. --- ## Open decisions for the microservice conversation 1. ~~**Language/runtime**~~: **Decided — Go 1.23+.** 2. **Hosting**: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones. 3. **Mimetype-to-folder rules**: `file_type` defaults to `misc/` when null (both polling and simple-upload modes). 4. **Commit batching**: one commit per `/pending` batch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck. 5. **Repo creation**: dedicated GitHub machine user with a PAT scoped to `repo`. Token stored in env, never in DB. 6. **Public visibility check**: refuse to mark a repo `is_active` if GitHub API reports it as private. 7. **Rate limiting**: simple-upload mode needs per-token rate limits (e.g. `60/min`, `10MB/s`) — token-bucket in memory keyed by `token_id`. v1 uses a single global default; per-token overrides v2. --- ## Summary of what already exists in BukidBountyApp to support this - `file_list.cdn_url` (migration `2026_05_09_120000_add_cdn_url_to_file_list.php`) - `file_list.is_public` (default false) and `file_list.file_type` (migration `2026_05_09_120100_add_is_public_and_file_type_to_file_list.php`) - `FileList::resolvedUrl()` — prefers CDN URL when set, otherwise local route - `FilesMainController::viewFilebyFileListHash` — 302-redirects to CDN URL when set, so all existing `` references benefit transparently - `FilesMainController::generateURLforFileListHash` — returns CDN URL when set in DB - `FilesMainController::uploadFileList` — accepts a `?string $file_type` parameter; every existing caller sets one explicitly (or null for the generic `UploadFilefromRequest` endpoint) Still missing (to be built when the microservice is built): - `/internal/cdn/pending`, `/internal/cdn/bytes/{hash}`, `/internal/cdn/published` endpoints + bearer middleware - Management UI for flipping `is_public` and assigning `file_type` to existing rows - The microservice itself