Build Tracking
Database build, not vibes.
Execution tracker for the Dig implementation plan. This page is a manual progress snapshot for ingest, transforms, gates, and the next actions needed to move from data loading to retrieval/API work.
Current focus
Phase 5 — Enrichment pipeline live. MusicBrainz crosswalks (1.2M artists, 1.8M releases), relationship edges (423K), label linkouts (53K Bandcamp/Instagram), setlist.fm timeline.
Latest milestone commit
8f986f5
Tests
65 unit + 47 MCP smoke (18 contract, 47 remote)
Live surfaces
app.dig.baby (search UI) + dig-api.fly.dev (REST) + dig-mcp.fly.dev (MCP SSE)
Phase 0A — System foundationsdone
Phase 0B — Profiling + normalizationdone
Gate A — Passedclosed
Phase 1 — Ingest + canonical transformsdone
Gate B — Closed with caveatsclosed
Phase 2 — Retrieval coredone
Gate C — Passedclosed
Phase 3 — REST + MCP alphadone
Gate D — GO (staging)closed
Phase 4 — Data + search UIdone
Gate E — GO (soft alpha)closed
Phase 5 — Alpha hardeningweek 1 done
EN-A — Enrichment schemadone
EN-B — Enrichment APIdone
EN-C — MusicBrainz importdone
EN-D — Setlist.fm timelinedone
EN-E — Label linkoutsdone
Full artist catalog re-ingestin progress
Live Build Snapshot
Manual update from overnight runs
Raw entities
24,025,633
All 4 entity types ingested
raw_entities size
81 GB
Main disk pressure driver
DB size (post-transform)
192 GB
After full releases transform + FTS
Search benchmark
0 errors / 96
Run 8: p50 108ms, 7/7 warm SLOs pass
| Process | Status | Progress | Rate / note |
|---|
| Full restore to Fly | done | ~555M rows across 12 tables | pg_restore -j4, ~14h |
| Releases ingest | done | 18,876,362 releases loaded | ~3,958/s |
| Artists transform | done | Complete | 48s |
| Labels transform | done | Complete | 626s (~10 min) |
| Masters transform | done | Complete | 1,822s (~30 min) |
| Releases transform | done | 18,876,362 transformed | Cursor-pagination validated |
| Gate B checklist | closed w/ caveats | 6/6 checked | Partial artists dump caveat |
Full-corpus restore complete. Run 8 benchmark passed (7/7 warm SLOs). Phase 5 Week 1 shipped: search IA upgrade, track-level credits, product telemetry, alpha ops pack. Cover Art Archive integrated: 1.77M crosswalks, Redis cache. Frontend on Fly.io (always-on).
Search Benchmark
Run 9 — Full corpus (18.9M releases), Fly.io production — 46 queries × 2 runs, 0 errors
| Category | p50 | p95 | Warm SLO | Status |
|---|
| Release FTS | 111ms | 181ms | p95 < 500ms | pass |
| Common-term | 104ms | 208ms | p99 < 1,000ms | pass |
| Fuzzy | 266ms | 412ms | p95 < 500ms | pass |
| Filtered | 99ms | 337ms | p95 < 500ms | pass |
| Multi-entity | 124ms | 210ms | p95 < 500ms | pass |
| Unicode | 103ms | 129ms | p95 < 200ms | pass |
| Retrieval | 131ms | 194ms | p95 < 200ms | pass |
Run 9 (live, 2026-03-05): 46 queries × 2 runs against full 18.9M-release corpus on Fly.io. 0 errors. All 7 SLO categories pass. Fuzzy search (pg_trgm) is the slowest category at p50 266ms — expected under contention.
Dig vs Discogs API
Run 9 (live, 2026-03-05) — Sequential, throttled to Discogs 60 req/min limit
Dig (Fly.io) Discogs API
Queries run sequentially, paced at 1.1s between requests to stay within Discogs's 60 req/min authenticated limit. Dig was idle during each throttle window — latency figures reflect single-connection p50, not burst capacity. Under real concurrent load (see stress test below), Dig handles 56 req/s at c100 with 0 errors. Discogs would rate-limit immediately at that volume.
Enrichment Pipeline
EN-A through EN-E — complete
- ✓EN-A — Schema:
enrich.* schema (8 tables) applied local + Fly. Crosswalks, edges, context, linkouts, events. - ✓EN-B — API: Enrichment endpoints live — relationships, context, timeline, linkouts. Query params:
include_enrichment, min_confidence, sources. - ✓EN-C — MusicBrainz: 1.77M release crosswalks, 1.21M artist crosswalks (200K with Wikidata QIDs), 423K relationship edges (23 edge types).
- ✓EN-D — Setlist.fm: Timeline pipeline live. 1,778 events across 208 artists. API endpoint + frontend display.
- ✓EN-E — Label linkouts: 53,233 label linkouts (34K Bandcamp, 19K Instagram). Verification queue with URL health checks. 6,808 verified.
Frontend + UX Polish
shipped
- ✓Discogs profile parsing: [aXXX]/[lXXX] refs rendered as clickable links.
- ✓Label linkout display: Bandcamp + Instagram pills with brand SVG icons.
- ✓Related artists: MusicBrainz relationship edges with human-readable labels.
- ✓External URLs: Domain name display instead of raw URLs.
- ✓Nav cleanup: Simplified navigation, reliable back link, clean search on sub-pages.
- ✓Media embeds: YouTube/video embeds on release pages from Discogs video data.
- ✓OG share cards: Dynamic Open Graph + Twitter Card metadata on all entity pages.
- ◐Artist catalog gap: Full artist re-ingest in progress (~9.8M from Discogs dump).
Roadmap & Checklist
Implementation plan execution tracker
Phase 0A / 0B + Gate A
Foundations, profiling, normalization
passed- ✓System scaffold (monorepo, Fastify, Kysely, migrations, local Postgres/Redis, CI)
- ✓Full profiling for artists/labels/masters + 500k release sample
- ✓Normalization Dictionary v1 + Preserve/Normalize matrix + QA Gate Spec
- ✓Parser fixtures/tests and LEGAL draft completed; Gate A closed
Phase 1 + Gate B
Raw ingest, canonical transforms, QA, idempotency
closed w/ caveats- ✓Ingest infra tables + catalog schema + indexes + FTS columns
- ✓Full-tree parser and ingest pipeline hardening; 52 tests passing
- ✓Raw ingest complete for all 4 entity types
- ✓Canonical upserts complete for releases, including child fanout tables
- ✓QA/reconciliation report completed and thresholds recalibrated
- ✓Idempotency and restart behavior validated with cursor-based rerun
- ✓FTS vectors populated (all 18,876,362 releases)
- ✓Gate B closed with caveats documented
Phase 2
Retrieval core (search + entity retrieval + traversal)
done- ✓Query envelope + response contracts locked
- ✓Multi-entity FTS search with filters + fuzzy fallback
- ✓Entity retrieval services: artist, label, master, release
- ✓Traversal services: 5 link types
- ✓Benchmark runner: 32-query suite, 8 categories
- ✓Statement timeout enforcement + broad query detection
- ✓Two-path release search rewrite + stop-word fix
- ✓Discogs API comparison: Dig faster in 7/7 categories
- ✓Run 5-6: 0 errors / 96 queries, warm SLOs improving
Phase 3
REST API + MCP public alpha
done- ✓REST API: two-tier rate limiting, CORS, structured logging
- ✓MCP server: 6 tools, SSE transport, 47 smoke tests passing
- ✓Deployed to Fly.io: dig-api + dig-mcp + Fly Postgres + Upstash Redis
- ✓Run 7: 32 queries, 0 errors, p50 117ms
- ✓Gate D: GO (staging alpha)
- ✓Docs: quickstart, ops runbook, alpha invite, Phase 4 prerequisites
Phase 4
Full data load + human search UI + Gate E
done- ✓Full releases dataset migration (~555M rows, 12 tables)
- ✓Run 8: 0/96 errors, p50 108ms, 7/7 warm SLOs pass
- ✓Next.js frontend: search + entity pages, CSS Modules, server-side API
- ✓Deployed to Fly.io (always-on), migrated from Vercel
- ✓Master-first search IA, entity pages, URL restructure
- ✓Cover Art Archive: 1.77M crosswalks, cover proxy + Redis cache
- ✓Gate E: GO for soft alpha (5-10 testers)
Phase 5 — Week 1
Alpha hardening, UX depth, instrumentation
in progress- ✓Day 1 — SLO Baseline: Froze alpha SLO table, load tested c100
- ✓Day 2 — Filtered Query Hardening: Zero 5xx under c100
- ✓Day 3 — Track-Level Credits UX: Per-track credits grouped by role
- ✓Day 4 — Search IA Upgrade: Exact/prefix boost, FK dedup, per-type cap
- ✓Day 5 — Product Instrumentation: 5 event types, structured JSON to Fly logs
- ✓Day 6 — Alpha Ops Pack: Events rate limiting, issue templates, runbook
- ✓Day 7 — UX Polish: Version format/country tags, collapsible aliases
- ◐Soft Alpha: Invites ready, 5 keys issued, monitoring pending
- ·User auth + collections remain post-alpha scope
Data layer: 18.9M releases + 2.5M masters + 2.3M labels + 289K artists (full re-ingest in progress). Discogs CC0 February 2026 dump on Fly.io. Disk: 158GB / 300GB.
Enrichment: 1.77M release crosswalks + 1.21M artist crosswalks + 423K relationship edges + 53K label linkouts + 1.8K setlist events.
Search: Postgres FTS with exact/prefix name boosting, pg_trgm fuzzy, FK-based dedup, per-type result caps. Run 9: 7/7 warm SLOs pass.
Live: app.dig.baby (search UI) + dig-api.fly.dev (REST) + dig-mcp.fly.dev (MCP SSE). Cover art via CAA (1.77M releases). Enrichment API live.
Concurrent Stress Test
2026-03-05 — Live API (Fly.io, shared-cpu-1x), mixed query workload
| Test | Concurrency | Requests | p50 | p95 | p99 | Success | Throughput |
|---|
| Warm-up run | 50 | 200 | 255ms | 3,145ms | 3,415ms | 200/200 (100%) | 20.7 req/s |
| Full pressure | 100 | 300 | 426ms | 1,986ms | 2,091ms | 295/300 (98.3%) | 56.4 req/s |
Mixed workload: search (FTS, fuzzy, filtered, cross-entity), entity retrieval, and traversal queries in random rotation. Zero application errors at both pressure levels. Rate limiting (429) is expected at c100 — triggered correctly after ~300 keyed req/min. Fuzzy search (pg_trgm) dominates the tail latency; retrieval and traversal stay under 250ms p50 at full concurrency.