Files
BarangaySystem/docs/tasks/cdn-microservice-plan.md
2026-06-06 18:43:00 +08:00

19 KiB
Raw Permalink Blame History

CDN Publisher Microservice — Implementation Plan

A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public file_list rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app.

Status: planning + execution. New repo: cdn-relay (binary/CLI also named cdn-relay; the plan's prose still refers to the role as "CDN Publisher microservice").

Language / runtime

Go 1.23+. Picked for: static single-binary Docker images (FROM scratch ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of /bytes, and mature libraries (go-git or shelling to git, google/go-github, chi router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust.


Goals

  1. Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound).
  2. Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology.
  3. Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path.
  4. Provide a synchronous "simple upload" mode: any authorized client POSTs a file, the service dedupes by sha256 against its own publish_log, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyApp file_list flow.

Non-goals

  • Not a webhook responder. The main app does not push events; the microservice pulls work on its own schedule (or on operator-triggered runs).
  • Not a media transformer. No resizing, transcoding, or compression.
  • Not a private-asset gateway. Anything published is public, forever.

Architecture

┌────────────────────┐    1. GET unpublished    ┌───────────────────────┐
│  BukidBountyApp    │◄───────────────────────► │  CDN Publisher        │
│  (main Hyperf app) │    2. fetch bytes         │  microservice         │
│                    │                          │  (Node/Go/Python)     │
│  Postgres          │                          │                       │
│  + file_content    │                          │  Local clones of      │
│  + file_list       │                          │  cdn repos            │
└────────────────────┘                          └────────────┬──────────┘
         ▲                                                   │
         │ 4. POST /internal/cdn/published                   │ 3. git push
         │    { hashkey, cdn_url }                           ▼
         │                                       ┌───────────────────────┐
         └────────────────────────────────────── │  GitHub               │
                                                 │  (private org/account)│
                                                 │  bb-cdn-7f3a9e2c      │
                                                 │  bb-cdn-1a8b3f0d      │
                                                 │  …                    │
                                                 └────────────┬──────────┘
                                                              │
                                                              ▼
                                                 jsDelivr CDN edge

Why pull, not webhook

  • Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid.
  • Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks.

Data contract (main app side)

Already in place after this conversation:

  • file_list.is_public — boolean, default false. Only is_public = 1 AND cdn_url IS NULL rows are eligible.
  • file_list.cdn_url — full jsDelivr URL written back on success.
  • file_list.file_type — used for path organization in the CDN repo (e.g. app_logo/<filehash>.png vs profile_photo/<filehash>.jpg).
  • file_content.filehash — sha256 of bytes; used as the file's content-addressed name in the CDN repo.
  • file_content.mimetype — drives extension selection.
  • file_content.size_in_bytes — used for the per-file size cap and repo-size accounting.

New endpoints on the main app

Both protected by an Authorization: Bearer <CDN_SERVICE_TOKEN> header (token in .env, validated by middleware). No user session.

GET /internal/cdn/pending?limit=50 Returns rows ready for publish:

{
  "items": [
    {
      "hashkey": "<filelist hashkey>",
      "filehash": "<sha256>",
      "mimetype": "image/png",
      "size_in_bytes": 142883,
      "file_type": "app_logo",
      "filename": "app_logo_1715260000.png"
    }
  ]
}

Filter: is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size. Order by id ASC (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns [].

GET /internal/cdn/bytes/{filelist_hashkey} Streams the raw bytes. Same auth. Reuses the existing viewFilebyFileListHash plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates cdn_url in the first place — it must always read from the DB).

POST /internal/cdn/published Body:

{
  "hashkey": "<filelist hashkey>",
  "cdn_url": "https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@<commit-sha>/profile_photo/<filehash>.jpg"
}

Updates file_list.cdn_url for that row. Idempotent — if cdn_url already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it).

POST /internal/cdn/failed (optional, v2) Body: { hashkey, error }. Logs the failure for operator visibility. The row stays eligible for retry next tick.



Simple upload mode (synchronous, client-facing)

A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call.

Endpoints

All require a valid token (see Auth section below) via Authorization: Bearer <token>.

POST /v1/upload — multipart/form-data

  • field file (required): the file bytes
  • field file_type (optional): folder name, default misc
  • field mimetype (optional): override; otherwise sniffed from bytes + filename

Flow:

  1. Stream body to a temp file while computing sha256.
  2. Lookup publish_log by filehash. If found and status='reported' (or 'pushed') and cdn_url IS NOT NULL → return the existing URL immediately (no GitHub work).
  3. Otherwise: same publish path as the polling loop — write to active repo's clone at {file_type}/{sha256}.{ext}, commit, push, record in publish_log, return URL.
  4. Response:
{ "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "<sha256>", "deduped": true|false, "size_bytes": 12345 }

GET /v1/lookup/{sha256} — cheap dedup probe without uploading. Returns the existing cdn_url or 404.

GET /v1/docs — renders the API guide (HTML or markdown). Only served when a valid token is presented — unauthenticated callers get 401, never the docs. This keeps the surface unindexable.

GET /v1/health — unauthenticated, returns {"ok": true} for orchestrator probes.

Concurrency note

Simple-mode uploads share the same active repo and the same single-writer git push lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by repo_max_bytes / batch_size git ops). For high-throughput callers, prefer queueing many uploads then issuing one git push — but that's a v2 optimization; v1 commits per request when not batchable.


Authentication & token management

The service has its own token store (separate from the CDN_SERVICE_TOKEN used for main-app↔microservice traffic — that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped.

Schema

CREATE TABLE api_tokens (
  id INTEGER PRIMARY KEY,
  token_hash TEXT NOT NULL UNIQUE,       -- sha256 of the raw token; raw shown once at creation
  name TEXT NOT NULL,                    -- human label, e.g. "ci-pipeline"
  scopes TEXT NOT NULL,                  -- csv: "upload,lookup,docs" or "admin"
  ip_allow TEXT,                         -- csv of CIDRs; null = any
  ip_deny TEXT,                          -- csv of CIDRs; evaluated before allow
  expires_at TIMESTAMPTZ,                -- null = never
  created_at TIMESTAMPTZ DEFAULT now(),
  last_used_at TIMESTAMPTZ,
  revoked_at TIMESTAMPTZ
);

CREATE TABLE api_token_audit (
  id INTEGER PRIMARY KEY,
  token_id INTEGER REFERENCES api_tokens(id),
  ip TEXT NOT NULL,
  path TEXT NOT NULL,
  status INTEGER NOT NULL,
  ts TIMESTAMPTZ DEFAULT now()
);

Validation pipeline (every request)

  1. Extract bearer token → sha256 → lookup api_tokens by token_hash.
  2. Reject if: not found, revoked_at IS NOT NULL, expires_at < now(), or scope doesn't cover the route.
  3. Resolve client IP. Trust X-Forwarded-For only when TRUSTED_PROXIES env lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.)
  4. If ip_deny matches → 403.
  5. If ip_allow is set and doesn't match → 403.
  6. Update last_used_at, write api_token_audit row, proceed.

Admin endpoints (scope = admin)

Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via:

  • POST /v1/admin/tokens — body: { name, scopes, ip_allow?, ip_deny?, ttl_hours? }. Response includes the raw token once (never retrievable again) and the token id.
  • GET /v1/admin/tokens — list (no raw values, just metadata + last-used).
  • POST /v1/admin/tokens/{id}/revoke — sets revoked_at = now().
  • GET /v1/admin/audit?token_id=...&limit=... — recent usage.

CLI shortcuts (same binary): cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24, cdn-relay token revoke <id>, cdn-relay token list. Useful for ops when the HTTP surface itself is locked down.

Storage of raw tokens

Never. We store sha256(token) only. If lost, revoke and reissue.


Microservice internals

State (its own database, e.g. SQLite or Postgres)

CREATE TABLE cdn_repos (
  id INTEGER PRIMARY KEY,
  name TEXT NOT NULL UNIQUE,         -- "bb-cdn-7f3a9e2c"
  github_owner TEXT NOT NULL,
  local_clone_path TEXT NOT NULL,    -- where it's checked out on disk
  size_used_bytes BIGINT NOT NULL DEFAULT 0,
  is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target
  is_full BOOLEAN NOT NULL DEFAULT 0,
  created_at TIMESTAMPTZ DEFAULT now(),
  retired_at TIMESTAMPTZ
);

CREATE TABLE publish_log (
  id INTEGER PRIMARY KEY,
  filelist_hashkey TEXT NOT NULL,
  filehash TEXT NOT NULL,
  cdn_repo_id INTEGER REFERENCES cdn_repos(id),
  commit_sha TEXT,
  cdn_url TEXT,
  status TEXT NOT NULL,              -- "pending" | "pushed" | "reported" | "failed"
  attempts INTEGER NOT NULL DEFAULT 0,
  last_error TEXT,
  created_at TIMESTAMPTZ DEFAULT now(),
  updated_at TIMESTAMPTZ DEFAULT now(),
  UNIQUE(filelist_hashkey)
);

cdn_repos.size_used_bytes is the source of truth for rotation. Recomputed by a periodic du -sb of the local clone; updated incrementally after each push.

Configuration

# config.toml
main_app_base_url   = "https://bukidbounty.example.com"
main_app_token      = "<env: CDN_SERVICE_TOKEN>"
github_owner        = "<env: GH_OWNER>"
github_token        = "<env: GH_TOKEN>"
poll_interval_sec   = 30
batch_size          = 50
per_file_max_bytes  = 50_000_000          # 50 MB hard cap
repo_max_bytes      = 800_000_000         # 800 MB rotation threshold
repo_name_prefix    = "bb-cdn-"
clone_root          = "/var/lib/cdn-relay/repos"

github_owner is intentionally not committed. The repo name pattern bb-cdn-<random8hex> is generated at rotation time so existing repos can't be enumerated by guessing.

Repo rotation algorithm

on each batch flush:
  active = select * from cdn_repos where is_active = 1 limit 1
  if active is null OR active.size_used_bytes >= repo_max_bytes:
    if active: mark active.is_active = 0, is_full = 1, retired_at = now()
    new_name = repo_name_prefix + random_hex(8)
    create_github_repo(new_name)        # via GitHub API, public, empty
    git_clone(new_name, clone_root/new_name)
    insert cdn_repos (name=new_name, is_active=1, …)
    active = the new row
  return active

The retired repo's existing cdn_urls never need updating — they already encode the repo name and a frozen commit SHA.

Publish loop (per tick)

1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size
2. for each item in resp.items:
     if publish_log row already exists for hashkey: skip
     insert publish_log (status=pending)
3. group items by active repo (rotating mid-batch if size cap hit)
4. for each item:
     bytes = GET {main_app}/internal/cdn/bytes/{hashkey}   # streamed
     ext   = mimetype_to_ext(item.mimetype)
     path  = "{file_type}/{filehash}.{ext}"   # file_type used as folder
     write bytes to active.local_clone_path/path
     stage with `git add`
5. once batch staged:
     commit = git commit -m "publish batch <timestamp>"
     git push origin main
     sha = <commit sha>
6. for each item in batch:
     cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}"
     update publish_log set status=pushed, commit_sha, cdn_url
     POST {main_app}/internal/cdn/published { hashkey, cdn_url }
     update publish_log set status=reported
7. update active.size_used_bytes (incremental sum + occasional du reconciliation)

Steps 27 run inside a single advisory lock (flock or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee.

Failure modes

Failure Recovery
Main app /pending 5xx Skip tick, retry next
/bytes 404 Mark publish_log.failed, continue batch (file was deleted between listing and fetch)
git push rejected Roll back local commit (git reset --hard HEAD~1), mark batch failed, retry next tick
/published 5xx Row stays in publish_log.status=pushed; reconciler re-POSTs on next tick (using commit_sha + cdn_url from log)
Microservice crash mid-batch On boot, find publish_log.status=pending rows, decide: did the commit happen? `git log --oneline

Backfill mode

Same code path. Just an operator command: cdn-relay backfill --limit=10000 that bypasses the polling sleep and runs /pending requests until the response is empty. No new logic.

Per-file size cap

Already enforced via the size_in_bytes <= ?max_size filter in /pending — the main app never offers oversized rows. Microservice can also double-check before write.

Mime → extension table

Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a .bin extension and jsDelivr serves application/octet-stream — defensive, not catastrophic, and easy to fix later.


Local-machine v0 (before the microservice exists)

The is_public, file_type, cdn_url, and resolvedUrl() plumbing in this conversation is enough to support a manual publish workflow today:

# Hand-edit DB to flip is_public=1 on a known row
psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';"

# Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha
cp ./tmp/<filehash>.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/<filehash>.png
cd ~/cdn-repos/bb-cdn-7f3a9e2c
git add . && git commit -m "manual" && git push
SHA=$(git rev-parse HEAD)

# Hand-write the cdn_url back
psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@${SHA}/app_logo/<filehash>.png' WHERE hashkey = '...';"

Tedious but proves the redirect path end-to-end before committing to the microservice build.

A small artisan command (php artisan cdn:publish-manual <filelist_hashkey> <cdn_url>) could wrap step 3 to avoid raw SQL — easy to add later, out of scope for this plan.


Open decisions for the microservice conversation

  1. Language/runtime: Decided — Go 1.23+.
  2. Hosting: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones.
  3. Mimetype-to-folder rules: file_type defaults to misc/ when null (both polling and simple-upload modes).
  4. Commit batching: one commit per /pending batch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck.
  5. Repo creation: dedicated GitHub machine user with a PAT scoped to repo. Token stored in env, never in DB.
  6. Public visibility check: refuse to mark a repo is_active if GitHub API reports it as private.
  7. Rate limiting: simple-upload mode needs per-token rate limits (e.g. 60/min, 10MB/s) — token-bucket in memory keyed by token_id. v1 uses a single global default; per-token overrides v2.

Summary of what already exists in BukidBountyApp to support this

  • file_list.cdn_url (migration 2026_05_09_120000_add_cdn_url_to_file_list.php)
  • file_list.is_public (default false) and file_list.file_type (migration 2026_05_09_120100_add_is_public_and_file_type_to_file_list.php)
  • FileList::resolvedUrl() — prefers CDN URL when set, otherwise local route
  • FilesMainController::viewFilebyFileListHash — 302-redirects to CDN URL when set, so all existing <img :src="'/RequestData/File/' + hash"> references benefit transparently
  • FilesMainController::generateURLforFileListHash — returns CDN URL when set in DB
  • FilesMainController::uploadFileList — accepts a ?string $file_type parameter; every existing caller sets one explicitly (or null for the generic UploadFilefromRequest endpoint)

Still missing (to be built when the microservice is built):

  • /internal/cdn/pending, /internal/cdn/bytes/{hash}, /internal/cdn/published endpoints + bearer middleware
  • Management UI for flipping is_public and assigning file_type to existing rows
  • The microservice itself