19 KiB
CDN Publisher Microservice — Implementation Plan
A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public file_list rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app.
Status: planning + execution. New repo: cdn-relay (binary/CLI also named cdn-relay; the plan's prose still refers to the role as "CDN Publisher microservice").
Language / runtime
Go 1.23+. Picked for: static single-binary Docker images (FROM scratch ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of /bytes, and mature libraries (go-git or shelling to git, google/go-github, chi router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust.
Goals
- Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound).
- Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology.
- Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path.
- Provide a synchronous "simple upload" mode: any authorized client
POSTs a file, the service dedupes by sha256 against its ownpublish_log, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyAppfile_listflow.
Non-goals
- Not a webhook responder. The main app does not push events; the microservice pulls work on its own schedule (or on operator-triggered runs).
- Not a media transformer. No resizing, transcoding, or compression.
- Not a private-asset gateway. Anything published is public, forever.
Architecture
┌────────────────────┐ 1. GET unpublished ┌───────────────────────┐
│ BukidBountyApp │◄───────────────────────► │ CDN Publisher │
│ (main Hyperf app) │ 2. fetch bytes │ microservice │
│ │ │ (Node/Go/Python) │
│ Postgres │ │ │
│ + file_content │ │ Local clones of │
│ + file_list │ │ cdn repos │
└────────────────────┘ └────────────┬──────────┘
▲ │
│ 4. POST /internal/cdn/published │ 3. git push
│ { hashkey, cdn_url } ▼
│ ┌───────────────────────┐
└────────────────────────────────────── │ GitHub │
│ (private org/account)│
│ bb-cdn-7f3a9e2c │
│ bb-cdn-1a8b3f0d │
│ … │
└────────────┬──────────┘
│
▼
jsDelivr CDN edge
Why pull, not webhook
- Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid.
- Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks.
Data contract (main app side)
Already in place after this conversation:
file_list.is_public— boolean, default false. Onlyis_public = 1 AND cdn_url IS NULLrows are eligible.file_list.cdn_url— full jsDelivr URL written back on success.file_list.file_type— used for path organization in the CDN repo (e.g.app_logo/<filehash>.pngvsprofile_photo/<filehash>.jpg).file_content.filehash— sha256 of bytes; used as the file's content-addressed name in the CDN repo.file_content.mimetype— drives extension selection.file_content.size_in_bytes— used for the per-file size cap and repo-size accounting.
New endpoints on the main app
Both protected by an Authorization: Bearer <CDN_SERVICE_TOKEN> header (token in .env, validated by middleware). No user session.
GET /internal/cdn/pending?limit=50
Returns rows ready for publish:
{
"items": [
{
"hashkey": "<filelist hashkey>",
"filehash": "<sha256>",
"mimetype": "image/png",
"size_in_bytes": 142883,
"file_type": "app_logo",
"filename": "app_logo_1715260000.png"
}
]
}
Filter: is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size. Order by id ASC (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns [].
GET /internal/cdn/bytes/{filelist_hashkey}
Streams the raw bytes. Same auth. Reuses the existing viewFilebyFileListHash plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates cdn_url in the first place — it must always read from the DB).
POST /internal/cdn/published
Body:
{
"hashkey": "<filelist hashkey>",
"cdn_url": "https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@<commit-sha>/profile_photo/<filehash>.jpg"
}
Updates file_list.cdn_url for that row. Idempotent — if cdn_url already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it).
POST /internal/cdn/failed (optional, v2)
Body: { hashkey, error }. Logs the failure for operator visibility. The row stays eligible for retry next tick.
Simple upload mode (synchronous, client-facing)
A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call.
Endpoints
All require a valid token (see Auth section below) via Authorization: Bearer <token>.
POST /v1/upload — multipart/form-data
- field
file(required): the file bytes - field
file_type(optional): folder name, defaultmisc - field
mimetype(optional): override; otherwise sniffed from bytes + filename
Flow:
- Stream body to a temp file while computing sha256.
- Lookup
publish_logbyfilehash. If found andstatus='reported'(or'pushed') andcdn_url IS NOT NULL→ return the existing URL immediately (no GitHub work). - Otherwise: same publish path as the polling loop — write to active repo's clone at
{file_type}/{sha256}.{ext}, commit, push, record inpublish_log, return URL. - Response:
{ "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "<sha256>", "deduped": true|false, "size_bytes": 12345 }
GET /v1/lookup/{sha256} — cheap dedup probe without uploading. Returns the existing cdn_url or 404.
GET /v1/docs — renders the API guide (HTML or markdown). Only served when a valid token is presented — unauthenticated callers get 401, never the docs. This keeps the surface unindexable.
GET /v1/health — unauthenticated, returns {"ok": true} for orchestrator probes.
Concurrency note
Simple-mode uploads share the same active repo and the same single-writer git push lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by repo_max_bytes / batch_size git ops). For high-throughput callers, prefer queueing many uploads then issuing one git push — but that's a v2 optimization; v1 commits per request when not batchable.
Authentication & token management
The service has its own token store (separate from the CDN_SERVICE_TOKEN used for main-app↔microservice traffic — that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped.
Schema
CREATE TABLE api_tokens (
id INTEGER PRIMARY KEY,
token_hash TEXT NOT NULL UNIQUE, -- sha256 of the raw token; raw shown once at creation
name TEXT NOT NULL, -- human label, e.g. "ci-pipeline"
scopes TEXT NOT NULL, -- csv: "upload,lookup,docs" or "admin"
ip_allow TEXT, -- csv of CIDRs; null = any
ip_deny TEXT, -- csv of CIDRs; evaluated before allow
expires_at TIMESTAMPTZ, -- null = never
created_at TIMESTAMPTZ DEFAULT now(),
last_used_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ
);
CREATE TABLE api_token_audit (
id INTEGER PRIMARY KEY,
token_id INTEGER REFERENCES api_tokens(id),
ip TEXT NOT NULL,
path TEXT NOT NULL,
status INTEGER NOT NULL,
ts TIMESTAMPTZ DEFAULT now()
);
Validation pipeline (every request)
- Extract bearer token → sha256 → lookup
api_tokensbytoken_hash. - Reject if: not found,
revoked_at IS NOT NULL,expires_at < now(), or scope doesn't cover the route. - Resolve client IP. Trust
X-Forwarded-Foronly whenTRUSTED_PROXIESenv lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.) - If
ip_denymatches → 403. - If
ip_allowis set and doesn't match → 403. - Update
last_used_at, writeapi_token_auditrow, proceed.
Admin endpoints (scope = admin)
Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via:
POST /v1/admin/tokens— body:{ name, scopes, ip_allow?, ip_deny?, ttl_hours? }. Response includes the raw token once (never retrievable again) and the token id.GET /v1/admin/tokens— list (no raw values, just metadata + last-used).POST /v1/admin/tokens/{id}/revoke— setsrevoked_at = now().GET /v1/admin/audit?token_id=...&limit=...— recent usage.
CLI shortcuts (same binary): cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24, cdn-relay token revoke <id>, cdn-relay token list. Useful for ops when the HTTP surface itself is locked down.
Storage of raw tokens
Never. We store sha256(token) only. If lost, revoke and reissue.
Microservice internals
State (its own database, e.g. SQLite or Postgres)
CREATE TABLE cdn_repos (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- "bb-cdn-7f3a9e2c"
github_owner TEXT NOT NULL,
local_clone_path TEXT NOT NULL, -- where it's checked out on disk
size_used_bytes BIGINT NOT NULL DEFAULT 0,
is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target
is_full BOOLEAN NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
retired_at TIMESTAMPTZ
);
CREATE TABLE publish_log (
id INTEGER PRIMARY KEY,
filelist_hashkey TEXT NOT NULL,
filehash TEXT NOT NULL,
cdn_repo_id INTEGER REFERENCES cdn_repos(id),
commit_sha TEXT,
cdn_url TEXT,
status TEXT NOT NULL, -- "pending" | "pushed" | "reported" | "failed"
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(filelist_hashkey)
);
cdn_repos.size_used_bytes is the source of truth for rotation. Recomputed by a periodic du -sb of the local clone; updated incrementally after each push.
Configuration
# config.toml
main_app_base_url = "https://bukidbounty.example.com"
main_app_token = "<env: CDN_SERVICE_TOKEN>"
github_owner = "<env: GH_OWNER>"
github_token = "<env: GH_TOKEN>"
poll_interval_sec = 30
batch_size = 50
per_file_max_bytes = 50_000_000 # 50 MB hard cap
repo_max_bytes = 800_000_000 # 800 MB rotation threshold
repo_name_prefix = "bb-cdn-"
clone_root = "/var/lib/cdn-relay/repos"
github_owner is intentionally not committed. The repo name pattern bb-cdn-<random8hex> is generated at rotation time so existing repos can't be enumerated by guessing.
Repo rotation algorithm
on each batch flush:
active = select * from cdn_repos where is_active = 1 limit 1
if active is null OR active.size_used_bytes >= repo_max_bytes:
if active: mark active.is_active = 0, is_full = 1, retired_at = now()
new_name = repo_name_prefix + random_hex(8)
create_github_repo(new_name) # via GitHub API, public, empty
git_clone(new_name, clone_root/new_name)
insert cdn_repos (name=new_name, is_active=1, …)
active = the new row
return active
The retired repo's existing cdn_urls never need updating — they already encode the repo name and a frozen commit SHA.
Publish loop (per tick)
1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size
2. for each item in resp.items:
if publish_log row already exists for hashkey: skip
insert publish_log (status=pending)
3. group items by active repo (rotating mid-batch if size cap hit)
4. for each item:
bytes = GET {main_app}/internal/cdn/bytes/{hashkey} # streamed
ext = mimetype_to_ext(item.mimetype)
path = "{file_type}/{filehash}.{ext}" # file_type used as folder
write bytes to active.local_clone_path/path
stage with `git add`
5. once batch staged:
commit = git commit -m "publish batch <timestamp>"
git push origin main
sha = <commit sha>
6. for each item in batch:
cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}"
update publish_log set status=pushed, commit_sha, cdn_url
POST {main_app}/internal/cdn/published { hashkey, cdn_url }
update publish_log set status=reported
7. update active.size_used_bytes (incremental sum + occasional du reconciliation)
Steps 2–7 run inside a single advisory lock (flock or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee.
Failure modes
| Failure | Recovery |
|---|---|
Main app /pending 5xx |
Skip tick, retry next |
/bytes 404 |
Mark publish_log.failed, continue batch (file was deleted between listing and fetch) |
git push rejected |
Roll back local commit (git reset --hard HEAD~1), mark batch failed, retry next tick |
/published 5xx |
Row stays in publish_log.status=pushed; reconciler re-POSTs on next tick (using commit_sha + cdn_url from log) |
| Microservice crash mid-batch | On boot, find publish_log.status=pending rows, decide: did the commit happen? `git log --oneline |
Backfill mode
Same code path. Just an operator command: cdn-relay backfill --limit=10000 that bypasses the polling sleep and runs /pending requests until the response is empty. No new logic.
Per-file size cap
Already enforced via the size_in_bytes <= ?max_size filter in /pending — the main app never offers oversized rows. Microservice can also double-check before write.
Mime → extension table
Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a .bin extension and jsDelivr serves application/octet-stream — defensive, not catastrophic, and easy to fix later.
Local-machine v0 (before the microservice exists)
The is_public, file_type, cdn_url, and resolvedUrl() plumbing in this conversation is enough to support a manual publish workflow today:
# Hand-edit DB to flip is_public=1 on a known row
psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';"
# Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha
cp ./tmp/<filehash>.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/<filehash>.png
cd ~/cdn-repos/bb-cdn-7f3a9e2c
git add . && git commit -m "manual" && git push
SHA=$(git rev-parse HEAD)
# Hand-write the cdn_url back
psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@${SHA}/app_logo/<filehash>.png' WHERE hashkey = '...';"
Tedious but proves the redirect path end-to-end before committing to the microservice build.
A small artisan command (php artisan cdn:publish-manual <filelist_hashkey> <cdn_url>) could wrap step 3 to avoid raw SQL — easy to add later, out of scope for this plan.
Open decisions for the microservice conversation
Language/runtime: Decided — Go 1.23+.- Hosting: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones.
- Mimetype-to-folder rules:
file_typedefaults tomisc/when null (both polling and simple-upload modes). - Commit batching: one commit per
/pendingbatch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck. - Repo creation: dedicated GitHub machine user with a PAT scoped to
repo. Token stored in env, never in DB. - Public visibility check: refuse to mark a repo
is_activeif GitHub API reports it as private. - Rate limiting: simple-upload mode needs per-token rate limits (e.g.
60/min,10MB/s) — token-bucket in memory keyed bytoken_id. v1 uses a single global default; per-token overrides v2.
Summary of what already exists in BukidBountyApp to support this
file_list.cdn_url(migration2026_05_09_120000_add_cdn_url_to_file_list.php)file_list.is_public(default false) andfile_list.file_type(migration2026_05_09_120100_add_is_public_and_file_type_to_file_list.php)FileList::resolvedUrl()— prefers CDN URL when set, otherwise local routeFilesMainController::viewFilebyFileListHash— 302-redirects to CDN URL when set, so all existing<img :src="'/RequestData/File/' + hash">references benefit transparentlyFilesMainController::generateURLforFileListHash— returns CDN URL when set in DBFilesMainController::uploadFileList— accepts a?string $file_typeparameter; every existing caller sets one explicitly (or null for the genericUploadFilefromRequestendpoint)
Still missing (to be built when the microservice is built):
/internal/cdn/pending,/internal/cdn/bytes/{hash},/internal/cdn/publishedendpoints + bearer middleware- Management UI for flipping
is_publicand assigningfile_typeto existing rows - The microservice itself