Synthetic-Data End-to-End Sweep¶

Audit date: 2026-06-08 Scope: All synthetic / seed / fixture data feeding the v6.2-flexible conversation and matching pipeline Epic: #1373 Driver: #141 — full synthetic-data + e2e review Reviewers: Engineering + external model (Gemini) for adversarial read; clinical items deferred to Dr. Naidu Disposition: Findings triaged P0/P1/P2; clinical items deferred (not self-authored)

Purpose¶

Before flipping the v6.2-flexible canary, sweep every piece of fabricated data the pipeline consumes — persona seeds, provider catalogs, SOP schemas, replay baselines, grader fixtures, graph/vector seeds, and prompt few-shot content — for three failure classes:

Clinical inaccuracy — wrong codes, implausible values, unsafe defaults that a clinician must correct.
Data leakage — hardcoded numbers, real-world references, or PROD-derived content embedded where it shouldn't be.
Engineering drift — fixtures that no longer match the schemas/contracts they're supposed to exercise.

The sweep was the last broad QA pass gating the canary, and it is where the #1376 clinical sign-off checklist came from.

Methodology¶

The sweep ran as a manifest-first inventory followed by eight themed review bundles, each reviewed independently and then collated into P0/P1/P2 findings:

Bundle	Surface reviewed
Step 0 — Inventory	Manifest of all synthetic data across the repo (seed files, fixtures, baselines)
Bundle 1	Persona + provider + condition + procedure seed data
Bundle 2	SOP interpretation schemas (`config/prompts/sops/*.yaml`)
Bundle 3	Replay baseline JSONL
Bundle 4	Grader fixtures + `expected_scores`
Bundle 5	Test fixtures / factories
Bundle 6	Neo4j graph seed data
Bundle 7	Qdrant source / embedding seed
Bundle 8 (final)	Prompt few-shot examples + knowledge files

Each bundle's findings were tagged by disposition so the right owner acts on each:

Tag	Meaning	Owner
`[NAIDU]`	Clinical call — threshold, code, or safety rule	Dr. Naidu (deferred authoring)
`[FIX]`	Engineering defect — fixture/schema drift, leakage	Engineering
`[DECISION]`	Product/data-owner call	SD
`[ENG]`	Non-clinical engineering follow-up	Engineering

Outcome¶

Clinical items → #1376¶

Every [NAIDU]-tagged finding across bundles 1–4 and bundle 6 was consolidated into the single clinical sign-off tracking issue #1376, organized into six themed sections (ICD/CPT coding, screening panels, validity windows/thresholds, safety-gating rules, agent clinical-voice boundaries, plausibility/device sign-offs). These are deferred — engineering does not author the clinical numbers; they wait on Dr. Naidu. See Clinical Sign-off Governance.

The clinician-queue issues feeding #1376: #1363, #1364, #1365, #1366, #1367, #1368, #1370.

Price-policy finding → governed ranges (resolved)¶

The sweep surfaced a price-policy contradiction: cost numbers were hardcoded and drifting across prompts, few-shot examples, and code. The price-policy decision resolved it in favour of governed indicative ranges with a single source of truth and a runtime guard:

Runtime price-guard shipped — price_guard_enabled (#1374).
Duplicate hardcoded cost numbers single-sourced into financial_options.yaml (#1375).

This gate is cleared. See Price Governance.

Non-clinical engineering items¶

[FIX] / [DECISION] / [ENG] findings from each bundle are tracked in their own issues (not in #1376), scoped per bundle, and resolved on the normal PR path.

Data-handling note¶

Anonymized-production patient transcripts (frt_001–004) were used internally only during the sweep to sanity-check realism. They were never sent to any external model (including the Gemini adversarial reviewer), regardless of any "synthetic" labelling. This boundary is non-negotiable: PROD-derived conversational content stays in-house.