DigEarly stage. Building in public.
AboutMCP setupLLM betaHow we builtUsageReport a bugGitHub
Build Tracking

Database build, not vibes.

Execution tracker for the Dig implementation plan. This page is a manual progress snapshot for ingest, transforms, gates, and the next actions needed to move from data loading to retrieval/API work.

Current focus
Phase 5 — Enrichment pipeline live. MusicBrainz crosswalks (1.2M artists, 1.8M releases), relationship edges (423K), label linkouts (53K Bandcamp/Instagram), setlist.fm timeline.
Latest milestone commit
8f986f5
Tests
65 unit + 47 MCP smoke (18 contract, 47 remote)
Live surfaces
app.dig.baby (search UI) + dig-api.fly.dev (REST) + dig-mcp.fly.dev (MCP SSE)
Phase 0A — System foundationsdone
Phase 0B — Profiling + normalizationdone
Gate A — Passedclosed
Phase 1 — Ingest + canonical transformsdone
Gate B — Closed with caveatsclosed
Phase 2 — Retrieval coredone
Gate C — Passedclosed
Phase 3 — REST + MCP alphadone
Gate D — GO (staging)closed
Phase 4 — Data + search UIdone
Gate E — GO (soft alpha)closed
Phase 5 — Alpha hardeningweek 1 done
EN-A — Enrichment schemadone
EN-B — Enrichment APIdone
EN-C — MusicBrainz importdone
EN-D — Setlist.fm timelinedone
EN-E — Label linkoutsdone
Full artist catalog re-ingestin progress

Live Build Snapshot

Manual update from overnight runs
Raw entities
24,025,633
All 4 entity types ingested
raw_entities size
81 GB
Main disk pressure driver
DB size (post-transform)
192 GB
After full releases transform + FTS
Search benchmark
0 errors / 96
Run 8: p50 108ms, 7/7 warm SLOs pass
ProcessStatusProgressRate / note
Full restore to Flydone~555M rows across 12 tablespg_restore -j4, ~14h
Releases ingestdone18,876,362 releases loaded~3,958/s
Artists transformdoneComplete48s
Labels transformdoneComplete626s (~10 min)
Masters transformdoneComplete1,822s (~30 min)
Releases transformdone18,876,362 transformedCursor-pagination validated
Gate B checklistclosed w/ caveats6/6 checkedPartial artists dump caveat
Full-corpus restore complete. Run 8 benchmark passed (7/7 warm SLOs). Phase 5 Week 1 shipped: search IA upgrade, track-level credits, product telemetry, alpha ops pack. Cover Art Archive integrated: 1.77M crosswalks, Redis cache. Frontend on Fly.io (always-on).

Search Benchmark

Run 9 — Full corpus (18.9M releases), Fly.io production — 46 queries × 2 runs, 0 errors
Categoryp50p95Warm SLOStatus
Release FTS111ms181msp95 < 500mspass
Common-term104ms208msp99 < 1,000mspass
Fuzzy266ms412msp95 < 500mspass
Filtered99ms337msp95 < 500mspass
Multi-entity124ms210msp95 < 500mspass
Unicode103ms129msp95 < 200mspass
Retrieval131ms194msp95 < 200mspass
122ms
Overall p50
412ms
Overall p95
7 / 7
SLOs pass
Run 9 (live, 2026-03-05): 46 queries × 2 runs against full 18.9M-release corpus on Fly.io. 0 errors. All 7 SLO categories pass. Fuzzy search (pg_trgm) is the slowest category at p50 266ms — expected under contention.

Dig vs Discogs API

Run 9 (live, 2026-03-05) — Sequential, throttled to Discogs 60 req/min limit
Dig (Fly.io) Discogs API
Release FTS
Dig
111ms
Discogs
184ms
Dig 1.7x
Common-term
Dig
104ms
Discogs
257ms
Dig 2.5x
Fuzzy
Dig
266ms
Discogs
213ms
Even
Filtered
Dig
99ms
Discogs
203ms
Dig 2.0x
Multi-entity
Dig
124ms
Discogs
186ms
Dig 1.5x
Unicode
Dig
103ms
Discogs
186ms
Dig 1.8x
Retrieval
Dig
131ms
Discogs
193ms
Dig 1.5x
122ms
Dig p50
213ms
Discogs p50
6 / 7
Categories Dig wins
Queries run sequentially, paced at 1.1s between requests to stay within Discogs's 60 req/min authenticated limit. Dig was idle during each throttle window — latency figures reflect single-connection p50, not burst capacity. Under real concurrent load (see stress test below), Dig handles 56 req/s at c100 with 0 errors. Discogs would rate-limit immediately at that volume.

Enrichment Pipeline

EN-A through EN-E — complete
  • ✓EN-A — Schema: enrich.* schema (8 tables) applied local + Fly. Crosswalks, edges, context, linkouts, events.
  • ✓EN-B — API: Enrichment endpoints live — relationships, context, timeline, linkouts. Query params: include_enrichment, min_confidence, sources.
  • ✓EN-C — MusicBrainz: 1.77M release crosswalks, 1.21M artist crosswalks (200K with Wikidata QIDs), 423K relationship edges (23 edge types).
  • ✓EN-D — Setlist.fm: Timeline pipeline live. 1,778 events across 208 artists. API endpoint + frontend display.
  • ✓EN-E — Label linkouts: 53,233 label linkouts (34K Bandcamp, 19K Instagram). Verification queue with URL health checks. 6,808 verified.

Frontend + UX Polish

shipped
  • ✓Discogs profile parsing: [aXXX]/[lXXX] refs rendered as clickable links.
  • ✓Label linkout display: Bandcamp + Instagram pills with brand SVG icons.
  • ✓Related artists: MusicBrainz relationship edges with human-readable labels.
  • ✓External URLs: Domain name display instead of raw URLs.
  • ✓Nav cleanup: Simplified navigation, reliable back link, clean search on sub-pages.
  • ✓Media embeds: YouTube/video embeds on release pages from Discogs video data.
  • ✓OG share cards: Dynamic Open Graph + Twitter Card metadata on all entity pages.
  • ◐Artist catalog gap: Full artist re-ingest in progress (~9.8M from Discogs dump).

Roadmap & Checklist

Implementation plan execution tracker
Phase 0A / 0B + Gate A
Foundations, profiling, normalization
passed
  • ✓System scaffold (monorepo, Fastify, Kysely, migrations, local Postgres/Redis, CI)
  • ✓Full profiling for artists/labels/masters + 500k release sample
  • ✓Normalization Dictionary v1 + Preserve/Normalize matrix + QA Gate Spec
  • ✓Parser fixtures/tests and LEGAL draft completed; Gate A closed
Phase 1 + Gate B
Raw ingest, canonical transforms, QA, idempotency
closed w/ caveats
  • ✓Ingest infra tables + catalog schema + indexes + FTS columns
  • ✓Full-tree parser and ingest pipeline hardening; 52 tests passing
  • ✓Raw ingest complete for all 4 entity types
  • ✓Canonical upserts complete for releases, including child fanout tables
  • ✓QA/reconciliation report completed and thresholds recalibrated
  • ✓Idempotency and restart behavior validated with cursor-based rerun
  • ✓FTS vectors populated (all 18,876,362 releases)
  • ✓Gate B closed with caveats documented
Phase 2
Retrieval core (search + entity retrieval + traversal)
done
  • ✓Query envelope + response contracts locked
  • ✓Multi-entity FTS search with filters + fuzzy fallback
  • ✓Entity retrieval services: artist, label, master, release
  • ✓Traversal services: 5 link types
  • ✓Benchmark runner: 32-query suite, 8 categories
  • ✓Statement timeout enforcement + broad query detection
  • ✓Two-path release search rewrite + stop-word fix
  • ✓Discogs API comparison: Dig faster in 7/7 categories
  • ✓Run 5-6: 0 errors / 96 queries, warm SLOs improving
Phase 3
REST API + MCP public alpha
done
  • ✓REST API: two-tier rate limiting, CORS, structured logging
  • ✓MCP server: 6 tools, SSE transport, 47 smoke tests passing
  • ✓Deployed to Fly.io: dig-api + dig-mcp + Fly Postgres + Upstash Redis
  • ✓Run 7: 32 queries, 0 errors, p50 117ms
  • ✓Gate D: GO (staging alpha)
  • ✓Docs: quickstart, ops runbook, alpha invite, Phase 4 prerequisites
Phase 4
Full data load + human search UI + Gate E
done
  • ✓Full releases dataset migration (~555M rows, 12 tables)
  • ✓Run 8: 0/96 errors, p50 108ms, 7/7 warm SLOs pass
  • ✓Next.js frontend: search + entity pages, CSS Modules, server-side API
  • ✓Deployed to Fly.io (always-on), migrated from Vercel
  • ✓Master-first search IA, entity pages, URL restructure
  • ✓Cover Art Archive: 1.77M crosswalks, cover proxy + Redis cache
  • ✓Gate E: GO for soft alpha (5-10 testers)
Phase 5 — Week 1
Alpha hardening, UX depth, instrumentation
in progress
  • ✓Day 1 — SLO Baseline: Froze alpha SLO table, load tested c100
  • ✓Day 2 — Filtered Query Hardening: Zero 5xx under c100
  • ✓Day 3 — Track-Level Credits UX: Per-track credits grouped by role
  • ✓Day 4 — Search IA Upgrade: Exact/prefix boost, FK dedup, per-type cap
  • ✓Day 5 — Product Instrumentation: 5 event types, structured JSON to Fly logs
  • ✓Day 6 — Alpha Ops Pack: Events rate limiting, issue templates, runbook
  • ✓Day 7 — UX Polish: Version format/country tags, collapsible aliases
  • ◐Soft Alpha: Invites ready, 5 keys issued, monitoring pending
  • ·User auth + collections remain post-alpha scope
Data layer: 18.9M releases + 2.5M masters + 2.3M labels + 289K artists (full re-ingest in progress). Discogs CC0 February 2026 dump on Fly.io. Disk: 158GB / 300GB.
Enrichment: 1.77M release crosswalks + 1.21M artist crosswalks + 423K relationship edges + 53K label linkouts + 1.8K setlist events.
Search: Postgres FTS with exact/prefix name boosting, pg_trgm fuzzy, FK-based dedup, per-type result caps. Run 9: 7/7 warm SLOs pass.
Live: app.dig.baby (search UI) + dig-api.fly.dev (REST) + dig-mcp.fly.dev (MCP SSE). Cover art via CAA (1.77M releases). Enrichment API live.

Concurrent Stress Test

2026-03-05 — Live API (Fly.io, shared-cpu-1x), mixed query workload
TestConcurrencyRequestsp50p95p99SuccessThroughput
Warm-up run50200255ms3,145ms3,415ms200/200 (100%)20.7 req/s
Full pressure100300426ms1,986ms2,091ms295/300 (98.3%)56.4 req/s
Mixed workload: search (FTS, fuzzy, filtered, cross-entity), entity retrieval, and traversal queries in random rotation. Zero application errors at both pressure levels. Rate limiting (429) is expected at c100 — triggered correctly after ~300 keyed req/min. Fuzzy search (pg_trgm) dominates the tail latency; retrieval and traversal stay under 250ms p50 at full concurrency.