Files
BarangaySystem/docs/tasks/cdn-microservice-plan.md
2026-06-06 18:43:00 +08:00

375 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CDN Publisher Microservice — Implementation Plan
A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public `file_list` rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app.
**Status:** planning + execution. New repo: **`cdn-relay`** (binary/CLI also named `cdn-relay`; the plan's prose still refers to the *role* as "CDN Publisher microservice").
## Language / runtime
**Go 1.23+.** Picked for: static single-binary Docker images (`FROM scratch` ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of `/bytes`, and mature libraries (`go-git` or shelling to `git`, `google/go-github`, `chi` router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust.
---
## Goals
1. Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound).
2. Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology.
3. Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path.
4. Provide a **synchronous "simple upload" mode**: any authorized client `POST`s a file, the service dedupes by sha256 against its own `publish_log`, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyApp `file_list` flow.
## Non-goals
- Not a webhook responder. The main app does **not** push events; the microservice **pulls** work on its own schedule (or on operator-triggered runs).
- Not a media transformer. No resizing, transcoding, or compression.
- Not a private-asset gateway. Anything published is public, forever.
---
## Architecture
```
┌────────────────────┐ 1. GET unpublished ┌───────────────────────┐
│ BukidBountyApp │◄───────────────────────► │ CDN Publisher │
│ (main Hyperf app) │ 2. fetch bytes │ microservice │
│ │ │ (Node/Go/Python) │
│ Postgres │ │ │
│ + file_content │ │ Local clones of │
│ + file_list │ │ cdn repos │
└────────────────────┘ └────────────┬──────────┘
▲ │
│ 4. POST /internal/cdn/published │ 3. git push
│ { hashkey, cdn_url } ▼
│ ┌───────────────────────┐
└────────────────────────────────────── │ GitHub │
│ (private org/account)│
│ bb-cdn-7f3a9e2c │
│ bb-cdn-1a8b3f0d │
│ … │
└────────────┬──────────┘
jsDelivr CDN edge
```
### Why pull, not webhook
- Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid.
- Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks.
---
## Data contract (main app side)
Already in place after this conversation:
- `file_list.is_public` — boolean, default false. Only `is_public = 1 AND cdn_url IS NULL` rows are eligible.
- `file_list.cdn_url` — full jsDelivr URL written back on success.
- `file_list.file_type` — used for path organization in the CDN repo (e.g. `app_logo/<filehash>.png` vs `profile_photo/<filehash>.jpg`).
- `file_content.filehash` — sha256 of bytes; used as the file's content-addressed name in the CDN repo.
- `file_content.mimetype` — drives extension selection.
- `file_content.size_in_bytes` — used for the per-file size cap and repo-size accounting.
### New endpoints on the main app
Both protected by an `Authorization: Bearer <CDN_SERVICE_TOKEN>` header (token in `.env`, validated by middleware). No user session.
**`GET /internal/cdn/pending?limit=50`**
Returns rows ready for publish:
```json
{
"items": [
{
"hashkey": "<filelist hashkey>",
"filehash": "<sha256>",
"mimetype": "image/png",
"size_in_bytes": 142883,
"file_type": "app_logo",
"filename": "app_logo_1715260000.png"
}
]
}
```
Filter: `is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size`. Order by `id ASC` (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns `[]`.
**`GET /internal/cdn/bytes/{filelist_hashkey}`**
Streams the raw bytes. Same auth. Reuses the existing `viewFilebyFileListHash` plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates `cdn_url` in the first place — it must always read from the DB).
**`POST /internal/cdn/published`**
Body:
```json
{
"hashkey": "<filelist hashkey>",
"cdn_url": "https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@<commit-sha>/profile_photo/<filehash>.jpg"
}
```
Updates `file_list.cdn_url` for that row. Idempotent — if `cdn_url` already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it).
**`POST /internal/cdn/failed`** *(optional, v2)*
Body: `{ hashkey, error }`. Logs the failure for operator visibility. The row stays eligible for retry next tick.
---
---
## Simple upload mode (synchronous, client-facing)
A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call.
### Endpoints
All require a valid token (see Auth section below) via `Authorization: Bearer <token>`.
**`POST /v1/upload`** — multipart/form-data
- field `file` (required): the file bytes
- field `file_type` (optional): folder name, default `misc`
- field `mimetype` (optional): override; otherwise sniffed from bytes + filename
Flow:
1. Stream body to a temp file while computing sha256.
2. Lookup `publish_log` by `filehash`. If found and `status='reported'` (or `'pushed'`) and `cdn_url IS NOT NULL` → return the existing URL immediately (no GitHub work).
3. Otherwise: same publish path as the polling loop — write to active repo's clone at `{file_type}/{sha256}.{ext}`, commit, push, record in `publish_log`, return URL.
4. Response:
```json
{ "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "<sha256>", "deduped": true|false, "size_bytes": 12345 }
```
**`GET /v1/lookup/{sha256}`** — cheap dedup probe without uploading. Returns the existing `cdn_url` or 404.
**`GET /v1/docs`** — renders the API guide (HTML or markdown). **Only served when a valid token is presented** — unauthenticated callers get 401, never the docs. This keeps the surface unindexable.
**`GET /v1/health`** — unauthenticated, returns `{"ok": true}` for orchestrator probes.
### Concurrency note
Simple-mode uploads share the same active repo and the same single-writer `git push` lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by `repo_max_bytes / batch_size` git ops). For high-throughput callers, prefer queueing many uploads then issuing one `git push` but that's a v2 optimization; v1 commits per request when not batchable.
---
## Authentication & token management
The service has its own token store (separate from the `CDN_SERVICE_TOKEN` used for main-appmicroservice traffic that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped.
### Schema
```sql
CREATE TABLE api_tokens (
id INTEGER PRIMARY KEY,
token_hash TEXT NOT NULL UNIQUE, -- sha256 of the raw token; raw shown once at creation
name TEXT NOT NULL, -- human label, e.g. "ci-pipeline"
scopes TEXT NOT NULL, -- csv: "upload,lookup,docs" or "admin"
ip_allow TEXT, -- csv of CIDRs; null = any
ip_deny TEXT, -- csv of CIDRs; evaluated before allow
expires_at TIMESTAMPTZ, -- null = never
created_at TIMESTAMPTZ DEFAULT now(),
last_used_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ
);
CREATE TABLE api_token_audit (
id INTEGER PRIMARY KEY,
token_id INTEGER REFERENCES api_tokens(id),
ip TEXT NOT NULL,
path TEXT NOT NULL,
status INTEGER NOT NULL,
ts TIMESTAMPTZ DEFAULT now()
);
```
### Validation pipeline (every request)
1. Extract bearer token sha256 lookup `api_tokens` by `token_hash`.
2. Reject if: not found, `revoked_at IS NOT NULL`, `expires_at < now()`, or scope doesn't cover the route.
3. Resolve client IP. Trust `X-Forwarded-For` only when `TRUSTED_PROXIES` env lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.)
4. If `ip_deny` matches 403.
5. If `ip_allow` is set and doesn't match 403.
6. Update `last_used_at`, write `api_token_audit` row, proceed.
### Admin endpoints (scope = `admin`)
Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via:
- **`POST /v1/admin/tokens`** body: `{ name, scopes, ip_allow?, ip_deny?, ttl_hours? }`. Response includes the **raw token once** (never retrievable again) and the token id.
- **`GET /v1/admin/tokens`** list (no raw values, just metadata + last-used).
- **`POST /v1/admin/tokens/{id}/revoke`** sets `revoked_at = now()`.
- **`GET /v1/admin/audit?token_id=...&limit=...`** recent usage.
CLI shortcuts (same binary): `cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24`, `cdn-relay token revoke <id>`, `cdn-relay token list`. Useful for ops when the HTTP surface itself is locked down.
### Storage of raw tokens
Never. We store `sha256(token)` only. If lost, revoke and reissue.
---
## Microservice internals
### State (its own database, e.g. SQLite or Postgres)
```sql
CREATE TABLE cdn_repos (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE, -- "bb-cdn-7f3a9e2c"
github_owner TEXT NOT NULL,
local_clone_path TEXT NOT NULL, -- where it's checked out on disk
size_used_bytes BIGINT NOT NULL DEFAULT 0,
is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target
is_full BOOLEAN NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT now(),
retired_at TIMESTAMPTZ
);
CREATE TABLE publish_log (
id INTEGER PRIMARY KEY,
filelist_hashkey TEXT NOT NULL,
filehash TEXT NOT NULL,
cdn_repo_id INTEGER REFERENCES cdn_repos(id),
commit_sha TEXT,
cdn_url TEXT,
status TEXT NOT NULL, -- "pending" | "pushed" | "reported" | "failed"
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
UNIQUE(filelist_hashkey)
);
```
`cdn_repos.size_used_bytes` is the source of truth for rotation. Recomputed by a periodic `du -sb` of the local clone; updated incrementally after each push.
### Configuration
```toml
# config.toml
main_app_base_url = "https://bukidbounty.example.com"
main_app_token = "<env: CDN_SERVICE_TOKEN>"
github_owner = "<env: GH_OWNER>"
github_token = "<env: GH_TOKEN>"
poll_interval_sec = 30
batch_size = 50
per_file_max_bytes = 50_000_000 # 50 MB hard cap
repo_max_bytes = 800_000_000 # 800 MB rotation threshold
repo_name_prefix = "bb-cdn-"
clone_root = "/var/lib/cdn-relay/repos"
```
`github_owner` is intentionally not committed. The repo name pattern `bb-cdn-<random8hex>` is generated at rotation time so existing repos can't be enumerated by guessing.
### Repo rotation algorithm
```
on each batch flush:
active = select * from cdn_repos where is_active = 1 limit 1
if active is null OR active.size_used_bytes >= repo_max_bytes:
if active: mark active.is_active = 0, is_full = 1, retired_at = now()
new_name = repo_name_prefix + random_hex(8)
create_github_repo(new_name) # via GitHub API, public, empty
git_clone(new_name, clone_root/new_name)
insert cdn_repos (name=new_name, is_active=1, …)
active = the new row
return active
```
The retired repo's existing `cdn_url`s never need updating they already encode the repo name and a frozen commit SHA.
### Publish loop (per tick)
```
1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size
2. for each item in resp.items:
if publish_log row already exists for hashkey: skip
insert publish_log (status=pending)
3. group items by active repo (rotating mid-batch if size cap hit)
4. for each item:
bytes = GET {main_app}/internal/cdn/bytes/{hashkey} # streamed
ext = mimetype_to_ext(item.mimetype)
path = "{file_type}/{filehash}.{ext}" # file_type used as folder
write bytes to active.local_clone_path/path
stage with `git add`
5. once batch staged:
commit = git commit -m "publish batch <timestamp>"
git push origin main
sha = <commit sha>
6. for each item in batch:
cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}"
update publish_log set status=pushed, commit_sha, cdn_url
POST {main_app}/internal/cdn/published { hashkey, cdn_url }
update publish_log set status=reported
7. update active.size_used_bytes (incremental sum + occasional du reconciliation)
```
Steps 27 run inside a single advisory lock (`flock` or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee.
### Failure modes
| Failure | Recovery |
| --- | --- |
| Main app `/pending` 5xx | Skip tick, retry next |
| `/bytes` 404 | Mark `publish_log.failed`, continue batch (file was deleted between listing and fetch) |
| `git push` rejected | Roll back local commit (`git reset --hard HEAD~1`), mark batch failed, retry next tick |
| `/published` 5xx | Row stays in `publish_log.status=pushed`; reconciler re-POSTs on next tick (using `commit_sha` + `cdn_url` from log) |
| Microservice crash mid-batch | On boot, find `publish_log.status=pending` rows, decide: did the commit happen? `git log --oneline | head -1` vs known last sha if a new commit exists with our staged paths, mark pushed and report; else reset and retry |
### Backfill mode
Same code path. Just an operator command: `cdn-relay backfill --limit=10000` that bypasses the polling sleep and runs `/pending` requests until the response is empty. No new logic.
### Per-file size cap
Already enforced via the `size_in_bytes <= ?max_size` filter in `/pending` the main app never offers oversized rows. Microservice can also double-check before write.
### Mime → extension table
Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a `.bin` extension and jsDelivr serves `application/octet-stream` defensive, not catastrophic, and easy to fix later.
---
## Local-machine v0 (before the microservice exists)
The `is_public`, `file_type`, `cdn_url`, and `resolvedUrl()` plumbing in this conversation is enough to support a **manual** publish workflow today:
```bash
# Hand-edit DB to flip is_public=1 on a known row
psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';"
# Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha
cp ./tmp/<filehash>.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/<filehash>.png
cd ~/cdn-repos/bb-cdn-7f3a9e2c
git add . && git commit -m "manual" && git push
SHA=$(git rev-parse HEAD)
# Hand-write the cdn_url back
psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@${SHA}/app_logo/<filehash>.png' WHERE hashkey = '...';"
```
Tedious but proves the redirect path end-to-end before committing to the microservice build.
A small artisan command (`php artisan cdn:publish-manual <filelist_hashkey> <cdn_url>`) could wrap step 3 to avoid raw SQL easy to add later, out of scope for this plan.
---
## Open decisions for the microservice conversation
1. ~~**Language/runtime**~~: **Decided — Go 1.23+.**
2. **Hosting**: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones.
3. **Mimetype-to-folder rules**: `file_type` defaults to `misc/` when null (both polling and simple-upload modes).
4. **Commit batching**: one commit per `/pending` batch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck.
5. **Repo creation**: dedicated GitHub machine user with a PAT scoped to `repo`. Token stored in env, never in DB.
6. **Public visibility check**: refuse to mark a repo `is_active` if GitHub API reports it as private.
7. **Rate limiting**: simple-upload mode needs per-token rate limits (e.g. `60/min`, `10MB/s`) token-bucket in memory keyed by `token_id`. v1 uses a single global default; per-token overrides v2.
---
## Summary of what already exists in BukidBountyApp to support this
- `file_list.cdn_url` (migration `2026_05_09_120000_add_cdn_url_to_file_list.php`)
- `file_list.is_public` (default false) and `file_list.file_type` (migration `2026_05_09_120100_add_is_public_and_file_type_to_file_list.php`)
- `FileList::resolvedUrl()` prefers CDN URL when set, otherwise local route
- `FilesMainController::viewFilebyFileListHash` 302-redirects to CDN URL when set, so all existing `<img :src="'/RequestData/File/' + hash">` references benefit transparently
- `FilesMainController::generateURLforFileListHash` returns CDN URL when set in DB
- `FilesMainController::uploadFileList` accepts a `?string $file_type` parameter; every existing caller sets one explicitly (or null for the generic `UploadFilefromRequest` endpoint)
Still missing (to be built when the microservice is built):
- `/internal/cdn/pending`, `/internal/cdn/bytes/{hash}`, `/internal/cdn/published` endpoints + bearer middleware
- Management UI for flipping `is_public` and assigning `file_type` to existing rows
- The microservice itself