[dig]—— house scene 88-08Maintenance mode, but still worth landing on.
HomeSearch previewHow We Built ItGitHubReport an issue
Catalog data from Discogs (CC0). Cover art from Cover Art Archive. Crosswalk data from MusicBrainz (CC0). Editorial classifications by dig — independent, opinionated, fixable.
Build Tracking

Database build, not vibes.

Execution tracker for the Dig implementation plan. This page is a manual progress snapshot for ingest, transforms, gates, and the next actions needed to move from data loading to retrieval/API work.

Current focus
Phase 5 — Enrichment pipeline live. MusicBrainz crosswalks (1.2M artists, 1.8M releases), relationship edges (423K), label linkouts (53K Bandcamp/Instagram), setlist.fm timeline.
Latest milestone commit
8f986f5
Tests
65 unit + 18 contract (MCP smoke retired with the server)
Live surfaces
app.dig.baby (search UI) + dig-api.fly.dev (REST). MCP server archived.
Phase 0A — System foundationsdone
Phase 0B — Profiling + normalizationdone
Gate A — Passedclosed
Phase 1 — Ingest + canonical transformsdone
Gate B — Closed with caveatsclosed
Phase 2 — Retrieval coredone
Gate C — Passedclosed
Phase 3 — REST + MCP alphadone
Gate D — GO (staging)closed
Phase 4 — Data + search UIdone
Gate E — GO (soft alpha)closed
Phase 5 — Alpha hardeningweek 1 done
EN-A — Enrichment schemadone
EN-B — Enrichment APIdone
EN-C — MusicBrainz importdone
EN-D — Setlist.fm timelinedone
EN-E — Label linkoutsdone
Full artist catalog re-ingestin progress

Live Build Snapshot

Manual update from overnight runs
Raw entities
24,025,633
All 4 entity types ingested
raw_entities size
81 GB
Main disk pressure driver
DB size (post-transform)
192 GB
After full releases transform + FTS
Search benchmark
0 errors / 96
Run 8: p50 108ms, 7/7 warm SLOs pass
ProcessStatusProgressRate / note
Full restore to Flydone~555M rows across 12 tablespg_restore -j4, ~14h
Releases ingestdone18,876,362 releases loaded~3,958/s
Artists transformdoneComplete48s
Labels transformdoneComplete626s (~10 min)
Masters transformdoneComplete1,822s (~30 min)
Releases transformdone18,876,362 transformedCursor-pagination validated
Gate B checklistclosed w/ caveats6/6 checkedPartial artists dump caveat
Full-corpus restore complete. Run 8 benchmark passed (7/7 warm SLOs). Phase 5 Week 1 shipped: search IA upgrade, track-level credits, product telemetry, alpha ops pack. Cover Art Archive integrated: 1.77M crosswalks, Redis cache. Frontend on Fly.io (always-on).

Search Benchmark

Run 9 — Full corpus (18.9M releases), Fly.io production — 46 queries × 2 runs, 0 errors
Categoryp50p95Warm SLOStatus
Release FTS111ms181msp95 < 500mspass
Common-term104ms208msp99 < 1,000mspass
Fuzzy266ms412msp95 < 500mspass
Filtered99ms337msp95 < 500mspass
Multi-entity124ms210msp95 < 500mspass
Unicode103ms129msp95 < 200mspass
Retrieval131ms194msp95 < 200mspass
122ms
Overall p50
412ms
Overall p95
7 / 7
SLOs pass
Run 9 (live, 2026-03-05): 46 queries × 2 runs against full 18.9M-release corpus on Fly.io. 0 errors. All 7 SLO categories pass. Fuzzy search (pg_trgm) is the slowest category at p50 266ms — expected under contention.

Dig vs Discogs API

Run 9 (live, 2026-03-05) — Sequential, throttled to Discogs 60 req/min limit
Dig (Fly.io) Discogs API
Release FTS
Dig
111ms
Discogs
184ms
Dig 1.7x
Common-term
Dig
104ms
Discogs
257ms
Dig 2.5x
Fuzzy
Dig
266ms
Discogs
213ms
Even
Filtered
Dig
99ms
Discogs
203ms
Dig 2.0x
Multi-entity
Dig
124ms
Discogs
186ms
Dig 1.5x
Unicode
Dig
103ms
Discogs
186ms
Dig 1.8x
Retrieval
Dig
131ms
Discogs
193ms
Dig 1.5x
122ms
Dig p50
213ms
Discogs p50
6 / 7
Categories Dig wins
Queries run sequentially, paced at 1.1s between requests to stay within Discogs's 60 req/min authenticated limit. Dig was idle during each throttle window — latency figures reflect single-connection p50, not burst capacity. Under real concurrent load (see stress test below), Dig handles 56 req/s at c100 with 0 errors. Discogs would rate-limit immediately at that volume.

Enrichment Pipeline

EN-A through EN-E — complete
  • ✓EN-A — Schema: enrich.* schema (8 tables) applied local + Fly. Crosswalks, edges, context, linkouts, events.
  • ✓EN-B — API: Enrichment endpoints live — relationships, context, timeline, linkouts. Query params: include_enrichment, min_confidence, sources.
  • ✓EN-C — MusicBrainz: 1.77M release crosswalks, 1.21M artist crosswalks (200K with Wikidata QIDs), 423K relationship edges (23 edge types).
  • ✓EN-D — Setlist.fm: Timeline pipeline live. 1,778 events across 208 artists. API endpoint + frontend display.
  • ✓EN-E — Label linkouts: 53,233 label linkouts (34K Bandcamp, 19K Instagram). Verification queue with URL health checks. 6,808 verified.

Frontend + UX Polish

shipped
  • ✓Discogs profile parsing: [aXXX]/[lXXX] refs rendered as clickable links.
  • ✓Label linkout display: Bandcamp + Instagram pills with brand SVG icons.
  • ✓Related artists: MusicBrainz relationship edges with human-readable labels.
  • ✓External URLs: Domain name display instead of raw URLs.
  • ✓Nav cleanup: Simplified navigation, reliable back link, clean search on sub-pages.
  • ✓Media embeds: YouTube/video embeds on release pages from Discogs video data.
  • ✓OG share cards: Dynamic Open Graph + Twitter Card metadata on all entity pages.
  • ◐Artist catalog gap: Full artist re-ingest in progress (~9.8M from Discogs dump).

Roadmap & Checklist

Implementation plan execution tracker
Phase 0A / 0B + Gate A
Foundations, profiling, normalization
passed
  • ✓System scaffold (monorepo, Fastify, Kysely, migrations, local Postgres/Redis, CI)
  • ✓Full profiling for artists/labels/masters + 500k release sample
  • ✓Normalization Dictionary v1 + Preserve/Normalize matrix + QA Gate Spec
  • ✓Parser fixtures/tests and LEGAL draft completed; Gate A closed
Phase 1 + Gate B
Raw ingest, canonical transforms, QA, idempotency
closed w/ caveats
  • ✓Ingest infra tables + catalog schema + indexes + FTS columns
  • ✓Full-tree parser and ingest pipeline hardening; 52 tests passing
  • ✓Raw ingest complete for all 4 entity types
  • ✓Canonical upserts complete for releases, including child fanout tables
  • ✓QA/reconciliation report completed and thresholds recalibrated
  • ✓Idempotency and restart behavior validated with cursor-based rerun
  • ✓FTS vectors populated (all 18,876,362 releases)
  • ✓Gate B closed with caveats documented
Phase 2
Retrieval core (search + entity retrieval + traversal)
done
  • ✓Query envelope + response contracts locked
  • ✓Multi-entity FTS search with filters + fuzzy fallback
  • ✓Entity retrieval services: artist, label, master, release
  • ✓Traversal services: 5 link types
  • ✓Benchmark runner: 32-query suite, 8 categories
  • ✓Statement timeout enforcement + broad query detection
  • ✓Two-path release search rewrite + stop-word fix
  • ✓Discogs API comparison: Dig faster in 7/7 categories
  • ✓Run 5-6: 0 errors / 96 queries, warm SLOs improving
Phase 3
REST API + MCP public alpha
done
  • ✓REST API: two-tier rate limiting, CORS, structured logging
  • ✓MCP server: 6 tools, SSE transport, 47 smoke tests passing
  • ✓Deployed to Fly.io: dig-api + dig-mcp + Fly Postgres + Upstash Redis
  • ✓Run 7: 32 queries, 0 errors, p50 117ms
  • ✓Gate D: GO (staging alpha)
  • ✓Docs: quickstart, ops runbook, alpha invite, Phase 4 prerequisites
Phase 4
Full data load + human search UI + Gate E
done
  • ✓Full releases dataset migration (~555M rows, 12 tables)
  • ✓Run 8: 0/96 errors, p50 108ms, 7/7 warm SLOs pass
  • ✓Next.js frontend: search + entity pages, CSS Modules, server-side API
  • ✓Deployed to Fly.io (always-on), migrated from Vercel
  • ✓Master-first search IA, entity pages, URL restructure
  • ✓Cover Art Archive: 1.77M crosswalks, cover proxy + Redis cache
  • ✓Gate E: GO for soft alpha (5-10 testers)
Phase 5 — Week 1
Alpha hardening, UX depth, instrumentation
in progress
  • ✓Day 1 — SLO Baseline: Froze alpha SLO table, load tested c100
  • ✓Day 2 — Filtered Query Hardening: Zero 5xx under c100
  • ✓Day 3 — Track-Level Credits UX: Per-track credits grouped by role
  • ✓Day 4 — Search IA Upgrade: Exact/prefix boost, FK dedup, per-type cap
  • ✓Day 5 — Product Instrumentation: 5 event types, structured JSON to Fly logs
  • ✓Day 6 — Alpha Ops Pack: Events rate limiting, issue templates, runbook
  • ✓Day 7 — UX Polish: Version format/country tags, collapsible aliases
  • ◐Soft Alpha: Invites ready, 5 keys issued, monitoring pending
  • ·User auth + collections remain post-alpha scope
Data layer: 18.9M releases + 2.5M masters + 2.3M labels + 289K artists (full re-ingest in progress). Discogs CC0 February 2026 dump on Fly.io. Disk: 158GB / 300GB.
Enrichment: 1.77M release crosswalks + 1.21M artist crosswalks + 423K relationship edges + 53K label linkouts + 1.8K setlist events.
Search: Postgres FTS with exact/prefix name boosting, pg_trgm fuzzy, FK-based dedup, per-type result caps. Run 9: 7/7 warm SLOs pass.
Live: app.dig.baby (search UI) + dig-api.fly.dev (REST). MCP server archived (source remains in repo). Cover art via CAA (1.77M releases). Enrichment API live.

Concurrent Stress Test

2026-03-05 — Live API (Fly.io, shared-cpu-1x), mixed query workload
TestConcurrencyRequestsp50p95p99SuccessThroughput
Warm-up run50200255ms3,145ms3,415ms200/200 (100%)20.7 req/s
Full pressure100300426ms1,986ms2,091ms295/300 (98.3%)56.4 req/s
Mixed workload: search (FTS, fuzzy, filtered, cross-entity), entity retrieval, and traversal queries in random rotation. Zero application errors at both pressure levels. Rate limiting (429) is expected at c100 — triggered correctly after ~300 keyed req/min. Fuzzy search (pg_trgm) dominates the tail latency; retrieval and traversal stay under 250ms p50 at full concurrency.