Synthetic-Data End-to-End Sweep¶
Audit date: 2026-06-08 Scope: All synthetic / seed / fixture data feeding the v6.2-flexible conversation and matching pipeline Epic: #1373 Driver: #141 — full synthetic-data + e2e review Reviewers: Engineering + external model (Gemini) for adversarial read; clinical items deferred to Dr. Naidu Disposition: Findings triaged P0/P1/P2; clinical items deferred (not self-authored)
Purpose¶
Before flipping the v6.2-flexible canary, sweep every piece of fabricated data the pipeline consumes — persona seeds, provider catalogs, SOP schemas, replay baselines, grader fixtures, graph/vector seeds, and prompt few-shot content — for three failure classes:
- Clinical inaccuracy — wrong codes, implausible values, unsafe defaults that a clinician must correct.
- Data leakage — hardcoded numbers, real-world references, or PROD-derived content embedded where it shouldn't be.
- Engineering drift — fixtures that no longer match the schemas/contracts they're supposed to exercise.
The sweep was the last broad QA pass gating the canary, and it is where the #1376 clinical sign-off checklist came from.
Methodology¶
The sweep ran as a manifest-first inventory followed by eight themed review bundles, each reviewed independently and then collated into P0/P1/P2 findings:
| Bundle | Surface reviewed |
|---|---|
| Step 0 — Inventory | Manifest of all synthetic data across the repo (seed files, fixtures, baselines) |
| Bundle 1 | Persona + provider + condition + procedure seed data |
| Bundle 2 | SOP interpretation schemas (config/prompts/sops/*.yaml) |
| Bundle 3 | Replay baseline JSONL |
| Bundle 4 | Grader fixtures + expected_scores |
| Bundle 5 | Test fixtures / factories |
| Bundle 6 | Neo4j graph seed data |
| Bundle 7 | Qdrant source / embedding seed |
| Bundle 8 (final) | Prompt few-shot examples + knowledge files |
Each bundle's findings were tagged by disposition so the right owner acts on each:
| Tag | Meaning | Owner |
|---|---|---|
[NAIDU] |
Clinical call — threshold, code, or safety rule | Dr. Naidu (deferred authoring) |
[FIX] |
Engineering defect — fixture/schema drift, leakage | Engineering |
[DECISION] |
Product/data-owner call | SD |
[ENG] |
Non-clinical engineering follow-up | Engineering |
Outcome¶
Clinical items → #1376¶
Every [NAIDU]-tagged finding across bundles 1–4 and bundle 6 was consolidated into the single clinical sign-off tracking issue #1376, organized into six themed sections (ICD/CPT coding, screening panels, validity windows/thresholds, safety-gating rules, agent clinical-voice boundaries, plausibility/device sign-offs). These are deferred — engineering does not author the clinical numbers; they wait on Dr. Naidu. See Clinical Sign-off Governance.
The clinician-queue issues feeding #1376: #1363, #1364, #1365, #1366, #1367, #1368, #1370.
Price-policy finding → governed ranges (resolved)¶
The sweep surfaced a price-policy contradiction: cost numbers were hardcoded and drifting across prompts, few-shot examples, and code. The price-policy decision resolved it in favour of governed indicative ranges with a single source of truth and a runtime guard:
- Runtime price-guard shipped —
price_guard_enabled(#1374). - Duplicate hardcoded cost numbers single-sourced into
financial_options.yaml(#1375).
This gate is cleared. See Price Governance.
Non-clinical engineering items¶
[FIX] / [DECISION] / [ENG] findings from each bundle are tracked in their own issues (not in #1376), scoped per bundle, and resolved on the normal PR path.
Data-handling note¶
Anonymized-production patient transcripts (frt_001–004) were used internally only during the sweep to sanity-check realism. They were never sent to any external model (including the Gemini adversarial reviewer), regardless of any "synthetic" labelling. This boundary is non-negotiable: PROD-derived conversational content stays in-house.