375 lines
19 KiB
Markdown
375 lines
19 KiB
Markdown
# CDN Publisher Microservice — Implementation Plan
|
||
|
||
A separate webserver (Docker container, deployed independently from BukidBountyApp) that owns the lifecycle of pushing public `file_list` rows to jsDelivr-fronted GitHub repos and reporting the resulting CDN URL back to the main app.
|
||
|
||
**Status:** planning + execution. New repo: **`cdn-relay`** (binary/CLI also named `cdn-relay`; the plan's prose still refers to the *role* as "CDN Publisher microservice").
|
||
|
||
## Language / runtime
|
||
|
||
**Go 1.23+.** Picked for: static single-binary Docker images (`FROM scratch` ~15 MB), millisecond cold-starts, low memory floor (the publish loop is mostly IO + git shell-out), goroutine-friendly concurrent fetching of `/bytes`, and mature libraries (`go-git` or shelling to `git`, `google/go-github`, `chi` router). Faster than Node/Python for the CPU-bound bits (sha256 streaming, multipart parsing) without the build complexity of Rust.
|
||
|
||
---
|
||
|
||
## Goals
|
||
|
||
1. Decouple GitHub push operations from the main Hyperf request lifecycle (single-writer, slow, network-bound).
|
||
2. Centralize repo-rotation logic (track repo sizes, allocate new repos when full) so the main app stays ignorant of CDN topology.
|
||
3. Provide a backfill mode for one-shot publishing of large historical batches without touching the live request path.
|
||
4. Provide a **synchronous "simple upload" mode**: any authorized client `POST`s a file, the service dedupes by sha256 against its own `publish_log`, and either returns the existing CDN URL or publishes-and-returns in a single request. This makes the service usable directly (CLI, third-party integrations) without going through the BukidBountyApp `file_list` flow.
|
||
|
||
## Non-goals
|
||
|
||
- Not a webhook responder. The main app does **not** push events; the microservice **pulls** work on its own schedule (or on operator-triggered runs).
|
||
- Not a media transformer. No resizing, transcoding, or compression.
|
||
- Not a private-asset gateway. Anything published is public, forever.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
┌────────────────────┐ 1. GET unpublished ┌───────────────────────┐
|
||
│ BukidBountyApp │◄───────────────────────► │ CDN Publisher │
|
||
│ (main Hyperf app) │ 2. fetch bytes │ microservice │
|
||
│ │ │ (Node/Go/Python) │
|
||
│ Postgres │ │ │
|
||
│ + file_content │ │ Local clones of │
|
||
│ + file_list │ │ cdn repos │
|
||
└────────────────────┘ └────────────┬──────────┘
|
||
▲ │
|
||
│ 4. POST /internal/cdn/published │ 3. git push
|
||
│ { hashkey, cdn_url } ▼
|
||
│ ┌───────────────────────┐
|
||
└────────────────────────────────────── │ GitHub │
|
||
│ (private org/account)│
|
||
│ bb-cdn-7f3a9e2c │
|
||
│ bb-cdn-1a8b3f0d │
|
||
│ … │
|
||
└────────────┬──────────┘
|
||
│
|
||
▼
|
||
jsDelivr CDN edge
|
||
```
|
||
|
||
### Why pull, not webhook
|
||
|
||
- Push-based (webhook) requires the main app to retry, queue, and authenticate to the microservice. That's contention the user explicitly wants to avoid.
|
||
- Pull-based: microservice runs on a cron tick (e.g. every 30s) and asks "give me up to N unpublished rows." The main app stays a dumb data store. Failures are self-recovering — next tick re-asks.
|
||
|
||
---
|
||
|
||
## Data contract (main app side)
|
||
|
||
Already in place after this conversation:
|
||
|
||
- `file_list.is_public` — boolean, default false. Only `is_public = 1 AND cdn_url IS NULL` rows are eligible.
|
||
- `file_list.cdn_url` — full jsDelivr URL written back on success.
|
||
- `file_list.file_type` — used for path organization in the CDN repo (e.g. `app_logo/<filehash>.png` vs `profile_photo/<filehash>.jpg`).
|
||
- `file_content.filehash` — sha256 of bytes; used as the file's content-addressed name in the CDN repo.
|
||
- `file_content.mimetype` — drives extension selection.
|
||
- `file_content.size_in_bytes` — used for the per-file size cap and repo-size accounting.
|
||
|
||
### New endpoints on the main app
|
||
|
||
Both protected by an `Authorization: Bearer <CDN_SERVICE_TOKEN>` header (token in `.env`, validated by middleware). No user session.
|
||
|
||
**`GET /internal/cdn/pending?limit=50`**
|
||
Returns rows ready for publish:
|
||
```json
|
||
{
|
||
"items": [
|
||
{
|
||
"hashkey": "<filelist hashkey>",
|
||
"filehash": "<sha256>",
|
||
"mimetype": "image/png",
|
||
"size_in_bytes": 142883,
|
||
"file_type": "app_logo",
|
||
"filename": "app_logo_1715260000.png"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
Filter: `is_public = 1 AND cdn_url IS NULL AND size_in_bytes <= ?max_size`. Order by `id ASC` (FIFO, deterministic for resume). The microservice can call this in a tight loop until it returns `[]`.
|
||
|
||
**`GET /internal/cdn/bytes/{filelist_hashkey}`**
|
||
Streams the raw bytes. Same auth. Reuses the existing `viewFilebyFileListHash` plumbing but bypasses the CDN-redirect short-circuit (since the microservice is what populates `cdn_url` in the first place — it must always read from the DB).
|
||
|
||
**`POST /internal/cdn/published`**
|
||
Body:
|
||
```json
|
||
{
|
||
"hashkey": "<filelist hashkey>",
|
||
"cdn_url": "https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@<commit-sha>/profile_photo/<filehash>.jpg"
|
||
}
|
||
```
|
||
Updates `file_list.cdn_url` for that row. Idempotent — if `cdn_url` already set, return 200 without overwriting (or overwrite if newer; pick one and stick to it).
|
||
|
||
**`POST /internal/cdn/failed`** *(optional, v2)*
|
||
Body: `{ hashkey, error }`. Logs the failure for operator visibility. The row stays eligible for retry next tick.
|
||
|
||
---
|
||
|
||
---
|
||
|
||
## Simple upload mode (synchronous, client-facing)
|
||
|
||
A second surface exposed by the same binary. Distinct from the BukidBountyApp pull loop — this is for direct clients (curl, scripts, third-party services) that want a one-shot "give me a CDN URL for this file" call.
|
||
|
||
### Endpoints
|
||
|
||
All require a valid token (see Auth section below) via `Authorization: Bearer <token>`.
|
||
|
||
**`POST /v1/upload`** — multipart/form-data
|
||
- field `file` (required): the file bytes
|
||
- field `file_type` (optional): folder name, default `misc`
|
||
- field `mimetype` (optional): override; otherwise sniffed from bytes + filename
|
||
|
||
Flow:
|
||
1. Stream body to a temp file while computing sha256.
|
||
2. Lookup `publish_log` by `filehash`. If found and `status='reported'` (or `'pushed'`) and `cdn_url IS NOT NULL` → return the existing URL immediately (no GitHub work).
|
||
3. Otherwise: same publish path as the polling loop — write to active repo's clone at `{file_type}/{sha256}.{ext}`, commit, push, record in `publish_log`, return URL.
|
||
4. Response:
|
||
```json
|
||
{ "cdn_url": "https://cdn.jsdelivr.net/gh/...", "filehash": "<sha256>", "deduped": true|false, "size_bytes": 12345 }
|
||
```
|
||
|
||
**`GET /v1/lookup/{sha256}`** — cheap dedup probe without uploading. Returns the existing `cdn_url` or 404.
|
||
|
||
**`GET /v1/docs`** — renders the API guide (HTML or markdown). **Only served when a valid token is presented** — unauthenticated callers get 401, never the docs. This keeps the surface unindexable.
|
||
|
||
**`GET /v1/health`** — unauthenticated, returns `{"ok": true}` for orchestrator probes.
|
||
|
||
### Concurrency note
|
||
|
||
Simple-mode uploads share the same active repo and the same single-writer `git push` lock as the polling loop. A simple-mode request that arrives mid-batch waits for the lock (typically <1s; bounded by `repo_max_bytes / batch_size` git ops). For high-throughput callers, prefer queueing many uploads then issuing one `git push` — but that's a v2 optimization; v1 commits per request when not batchable.
|
||
|
||
---
|
||
|
||
## Authentication & token management
|
||
|
||
The service has its own token store (separate from the `CDN_SERVICE_TOKEN` used for main-app↔microservice traffic — that one is a single shared secret in env). Tokens here are user-facing: issued, expirable, revocable, IP-scoped.
|
||
|
||
### Schema
|
||
|
||
```sql
|
||
CREATE TABLE api_tokens (
|
||
id INTEGER PRIMARY KEY,
|
||
token_hash TEXT NOT NULL UNIQUE, -- sha256 of the raw token; raw shown once at creation
|
||
name TEXT NOT NULL, -- human label, e.g. "ci-pipeline"
|
||
scopes TEXT NOT NULL, -- csv: "upload,lookup,docs" or "admin"
|
||
ip_allow TEXT, -- csv of CIDRs; null = any
|
||
ip_deny TEXT, -- csv of CIDRs; evaluated before allow
|
||
expires_at TIMESTAMPTZ, -- null = never
|
||
created_at TIMESTAMPTZ DEFAULT now(),
|
||
last_used_at TIMESTAMPTZ,
|
||
revoked_at TIMESTAMPTZ
|
||
);
|
||
|
||
CREATE TABLE api_token_audit (
|
||
id INTEGER PRIMARY KEY,
|
||
token_id INTEGER REFERENCES api_tokens(id),
|
||
ip TEXT NOT NULL,
|
||
path TEXT NOT NULL,
|
||
status INTEGER NOT NULL,
|
||
ts TIMESTAMPTZ DEFAULT now()
|
||
);
|
||
```
|
||
|
||
### Validation pipeline (every request)
|
||
|
||
1. Extract bearer token → sha256 → lookup `api_tokens` by `token_hash`.
|
||
2. Reject if: not found, `revoked_at IS NOT NULL`, `expires_at < now()`, or scope doesn't cover the route.
|
||
3. Resolve client IP. Trust `X-Forwarded-For` only when `TRUSTED_PROXIES` env lists the immediate peer; otherwise use the socket address. (Prevents spoofing the IP check.)
|
||
4. If `ip_deny` matches → 403.
|
||
5. If `ip_allow` is set and doesn't match → 403.
|
||
6. Update `last_used_at`, write `api_token_audit` row, proceed.
|
||
|
||
### Admin endpoints (scope = `admin`)
|
||
|
||
Bootstrap admin token is generated on first boot and printed to stdout once (operator must capture it). Subsequent admin tokens issued via:
|
||
|
||
- **`POST /v1/admin/tokens`** — body: `{ name, scopes, ip_allow?, ip_deny?, ttl_hours? }`. Response includes the **raw token once** (never retrievable again) and the token id.
|
||
- **`GET /v1/admin/tokens`** — list (no raw values, just metadata + last-used).
|
||
- **`POST /v1/admin/tokens/{id}/revoke`** — sets `revoked_at = now()`.
|
||
- **`GET /v1/admin/audit?token_id=...&limit=...`** — recent usage.
|
||
|
||
CLI shortcuts (same binary): `cdn-relay token create --name=X --scopes=upload --ttl=720h --ip-allow=1.2.3.0/24`, `cdn-relay token revoke <id>`, `cdn-relay token list`. Useful for ops when the HTTP surface itself is locked down.
|
||
|
||
### Storage of raw tokens
|
||
|
||
Never. We store `sha256(token)` only. If lost, revoke and reissue.
|
||
|
||
---
|
||
|
||
## Microservice internals
|
||
|
||
### State (its own database, e.g. SQLite or Postgres)
|
||
|
||
```sql
|
||
CREATE TABLE cdn_repos (
|
||
id INTEGER PRIMARY KEY,
|
||
name TEXT NOT NULL UNIQUE, -- "bb-cdn-7f3a9e2c"
|
||
github_owner TEXT NOT NULL,
|
||
local_clone_path TEXT NOT NULL, -- where it's checked out on disk
|
||
size_used_bytes BIGINT NOT NULL DEFAULT 0,
|
||
is_active BOOLEAN NOT NULL DEFAULT 0, -- the current write target
|
||
is_full BOOLEAN NOT NULL DEFAULT 0,
|
||
created_at TIMESTAMPTZ DEFAULT now(),
|
||
retired_at TIMESTAMPTZ
|
||
);
|
||
|
||
CREATE TABLE publish_log (
|
||
id INTEGER PRIMARY KEY,
|
||
filelist_hashkey TEXT NOT NULL,
|
||
filehash TEXT NOT NULL,
|
||
cdn_repo_id INTEGER REFERENCES cdn_repos(id),
|
||
commit_sha TEXT,
|
||
cdn_url TEXT,
|
||
status TEXT NOT NULL, -- "pending" | "pushed" | "reported" | "failed"
|
||
attempts INTEGER NOT NULL DEFAULT 0,
|
||
last_error TEXT,
|
||
created_at TIMESTAMPTZ DEFAULT now(),
|
||
updated_at TIMESTAMPTZ DEFAULT now(),
|
||
UNIQUE(filelist_hashkey)
|
||
);
|
||
```
|
||
|
||
`cdn_repos.size_used_bytes` is the source of truth for rotation. Recomputed by a periodic `du -sb` of the local clone; updated incrementally after each push.
|
||
|
||
### Configuration
|
||
|
||
```toml
|
||
# config.toml
|
||
main_app_base_url = "https://bukidbounty.example.com"
|
||
main_app_token = "<env: CDN_SERVICE_TOKEN>"
|
||
github_owner = "<env: GH_OWNER>"
|
||
github_token = "<env: GH_TOKEN>"
|
||
poll_interval_sec = 30
|
||
batch_size = 50
|
||
per_file_max_bytes = 50_000_000 # 50 MB hard cap
|
||
repo_max_bytes = 800_000_000 # 800 MB rotation threshold
|
||
repo_name_prefix = "bb-cdn-"
|
||
clone_root = "/var/lib/cdn-relay/repos"
|
||
```
|
||
|
||
`github_owner` is intentionally not committed. The repo name pattern `bb-cdn-<random8hex>` is generated at rotation time so existing repos can't be enumerated by guessing.
|
||
|
||
### Repo rotation algorithm
|
||
|
||
```
|
||
on each batch flush:
|
||
active = select * from cdn_repos where is_active = 1 limit 1
|
||
if active is null OR active.size_used_bytes >= repo_max_bytes:
|
||
if active: mark active.is_active = 0, is_full = 1, retired_at = now()
|
||
new_name = repo_name_prefix + random_hex(8)
|
||
create_github_repo(new_name) # via GitHub API, public, empty
|
||
git_clone(new_name, clone_root/new_name)
|
||
insert cdn_repos (name=new_name, is_active=1, …)
|
||
active = the new row
|
||
return active
|
||
```
|
||
|
||
The retired repo's existing `cdn_url`s never need updating — they already encode the repo name and a frozen commit SHA.
|
||
|
||
### Publish loop (per tick)
|
||
|
||
```
|
||
1. resp = GET {main_app}/internal/cdn/pending?limit=batch_size
|
||
2. for each item in resp.items:
|
||
if publish_log row already exists for hashkey: skip
|
||
insert publish_log (status=pending)
|
||
3. group items by active repo (rotating mid-batch if size cap hit)
|
||
4. for each item:
|
||
bytes = GET {main_app}/internal/cdn/bytes/{hashkey} # streamed
|
||
ext = mimetype_to_ext(item.mimetype)
|
||
path = "{file_type}/{filehash}.{ext}" # file_type used as folder
|
||
write bytes to active.local_clone_path/path
|
||
stage with `git add`
|
||
5. once batch staged:
|
||
commit = git commit -m "publish batch <timestamp>"
|
||
git push origin main
|
||
sha = <commit sha>
|
||
6. for each item in batch:
|
||
cdn_url = "https://cdn.jsdelivr.net/gh/{owner}/{repo}@{sha}/{path}"
|
||
update publish_log set status=pushed, commit_sha, cdn_url
|
||
POST {main_app}/internal/cdn/published { hashkey, cdn_url }
|
||
update publish_log set status=reported
|
||
7. update active.size_used_bytes (incremental sum + occasional du reconciliation)
|
||
```
|
||
|
||
Steps 2–7 run inside a single advisory lock (`flock` or DB lock) so two ticks can't collide. Single-writer is the cheapest correctness guarantee.
|
||
|
||
### Failure modes
|
||
|
||
| Failure | Recovery |
|
||
| --- | --- |
|
||
| Main app `/pending` 5xx | Skip tick, retry next |
|
||
| `/bytes` 404 | Mark `publish_log.failed`, continue batch (file was deleted between listing and fetch) |
|
||
| `git push` rejected | Roll back local commit (`git reset --hard HEAD~1`), mark batch failed, retry next tick |
|
||
| `/published` 5xx | Row stays in `publish_log.status=pushed`; reconciler re-POSTs on next tick (using `commit_sha` + `cdn_url` from log) |
|
||
| Microservice crash mid-batch | On boot, find `publish_log.status=pending` rows, decide: did the commit happen? `git log --oneline | head -1` vs known last sha — if a new commit exists with our staged paths, mark pushed and report; else reset and retry |
|
||
|
||
### Backfill mode
|
||
|
||
Same code path. Just an operator command: `cdn-relay backfill --limit=10000` that bypasses the polling sleep and runs `/pending` requests until the response is empty. No new logic.
|
||
|
||
### Per-file size cap
|
||
|
||
Already enforced via the `size_in_bytes <= ?max_size` filter in `/pending` — the main app never offers oversized rows. Microservice can also double-check before write.
|
||
|
||
### Mime → extension table
|
||
|
||
Keep this in the microservice (not the main app), since the main app already has its own extension map for the local fallback path. They will drift; that's fine. Worst case is a `.bin` extension and jsDelivr serves `application/octet-stream` — defensive, not catastrophic, and easy to fix later.
|
||
|
||
---
|
||
|
||
## Local-machine v0 (before the microservice exists)
|
||
|
||
The `is_public`, `file_type`, `cdn_url`, and `resolvedUrl()` plumbing in this conversation is enough to support a **manual** publish workflow today:
|
||
|
||
```bash
|
||
# Hand-edit DB to flip is_public=1 on a known row
|
||
psql -c "UPDATE file_list SET is_public = 1 WHERE hashkey = '...';"
|
||
|
||
# Manually copy the bytes to a local cdn repo clone, commit, push, capture the commit sha
|
||
cp ./tmp/<filehash>.png ~/cdn-repos/bb-cdn-7f3a9e2c/app_logo/<filehash>.png
|
||
cd ~/cdn-repos/bb-cdn-7f3a9e2c
|
||
git add . && git commit -m "manual" && git push
|
||
SHA=$(git rev-parse HEAD)
|
||
|
||
# Hand-write the cdn_url back
|
||
psql -c "UPDATE file_list SET cdn_url = 'https://cdn.jsdelivr.net/gh/<owner>/bb-cdn-7f3a9e2c@${SHA}/app_logo/<filehash>.png' WHERE hashkey = '...';"
|
||
```
|
||
|
||
Tedious but proves the redirect path end-to-end before committing to the microservice build.
|
||
|
||
A small artisan command (`php artisan cdn:publish-manual <filelist_hashkey> <cdn_url>`) could wrap step 3 to avoid raw SQL — easy to add later, out of scope for this plan.
|
||
|
||
---
|
||
|
||
## Open decisions for the microservice conversation
|
||
|
||
1. ~~**Language/runtime**~~: **Decided — Go 1.23+.**
|
||
2. **Hosting**: Docker Compose alongside main app (easiest), or separate Dokploy/Hetzner box. Needs persistent volume for repo clones.
|
||
3. **Mimetype-to-folder rules**: `file_type` defaults to `misc/` when null (both polling and simple-upload modes).
|
||
4. **Commit batching**: one commit per `/pending` batch for the polling loop; one commit per request for simple-upload mode (v1). Revisit if push rate becomes a bottleneck.
|
||
5. **Repo creation**: dedicated GitHub machine user with a PAT scoped to `repo`. Token stored in env, never in DB.
|
||
6. **Public visibility check**: refuse to mark a repo `is_active` if GitHub API reports it as private.
|
||
7. **Rate limiting**: simple-upload mode needs per-token rate limits (e.g. `60/min`, `10MB/s`) — token-bucket in memory keyed by `token_id`. v1 uses a single global default; per-token overrides v2.
|
||
|
||
---
|
||
|
||
## Summary of what already exists in BukidBountyApp to support this
|
||
|
||
- `file_list.cdn_url` (migration `2026_05_09_120000_add_cdn_url_to_file_list.php`)
|
||
- `file_list.is_public` (default false) and `file_list.file_type` (migration `2026_05_09_120100_add_is_public_and_file_type_to_file_list.php`)
|
||
- `FileList::resolvedUrl()` — prefers CDN URL when set, otherwise local route
|
||
- `FilesMainController::viewFilebyFileListHash` — 302-redirects to CDN URL when set, so all existing `<img :src="'/RequestData/File/' + hash">` references benefit transparently
|
||
- `FilesMainController::generateURLforFileListHash` — returns CDN URL when set in DB
|
||
- `FilesMainController::uploadFileList` — accepts a `?string $file_type` parameter; every existing caller sets one explicitly (or null for the generic `UploadFilefromRequest` endpoint)
|
||
|
||
Still missing (to be built when the microservice is built):
|
||
- `/internal/cdn/pending`, `/internal/cdn/bytes/{hash}`, `/internal/cdn/published` endpoints + bearer middleware
|
||
- Management UI for flipping `is_public` and assigning `file_type` to existing rows
|
||
- The microservice itself
|