Skip to content

Synthetic-Data End-to-End Sweep

Audit date: 2026-06-08 Scope: All synthetic / seed / fixture data feeding the v6.2-flexible conversation and matching pipeline Epic: #1373 Driver: #141 — full synthetic-data + e2e review Reviewers: Engineering + external model (Gemini) for adversarial read; clinical items deferred to Dr. Naidu Disposition: Findings triaged P0/P1/P2; clinical items deferred (not self-authored)


Purpose

Before flipping the v6.2-flexible canary, sweep every piece of fabricated data the pipeline consumes — persona seeds, provider catalogs, SOP schemas, replay baselines, grader fixtures, graph/vector seeds, and prompt few-shot content — for three failure classes:

  1. Clinical inaccuracy — wrong codes, implausible values, unsafe defaults that a clinician must correct.
  2. Data leakage — hardcoded numbers, real-world references, or PROD-derived content embedded where it shouldn't be.
  3. Engineering drift — fixtures that no longer match the schemas/contracts they're supposed to exercise.

The sweep was the last broad QA pass gating the canary, and it is where the #1376 clinical sign-off checklist came from.

Methodology

The sweep ran as a manifest-first inventory followed by eight themed review bundles, each reviewed independently and then collated into P0/P1/P2 findings:

Bundle Surface reviewed
Step 0 — Inventory Manifest of all synthetic data across the repo (seed files, fixtures, baselines)
Bundle 1 Persona + provider + condition + procedure seed data
Bundle 2 SOP interpretation schemas (config/prompts/sops/*.yaml)
Bundle 3 Replay baseline JSONL
Bundle 4 Grader fixtures + expected_scores
Bundle 5 Test fixtures / factories
Bundle 6 Neo4j graph seed data
Bundle 7 Qdrant source / embedding seed
Bundle 8 (final) Prompt few-shot examples + knowledge files

Each bundle's findings were tagged by disposition so the right owner acts on each:

Tag Meaning Owner
[NAIDU] Clinical call — threshold, code, or safety rule Dr. Naidu (deferred authoring)
[FIX] Engineering defect — fixture/schema drift, leakage Engineering
[DECISION] Product/data-owner call SD
[ENG] Non-clinical engineering follow-up Engineering

Outcome

Clinical items → #1376

Every [NAIDU]-tagged finding across bundles 1–4 and bundle 6 was consolidated into the single clinical sign-off tracking issue #1376, organized into six themed sections (ICD/CPT coding, screening panels, validity windows/thresholds, safety-gating rules, agent clinical-voice boundaries, plausibility/device sign-offs). These are deferred — engineering does not author the clinical numbers; they wait on Dr. Naidu. See Clinical Sign-off Governance.

The clinician-queue issues feeding #1376: #1363, #1364, #1365, #1366, #1367, #1368, #1370.

Price-policy finding → governed ranges (resolved)

The sweep surfaced a price-policy contradiction: cost numbers were hardcoded and drifting across prompts, few-shot examples, and code. The price-policy decision resolved it in favour of governed indicative ranges with a single source of truth and a runtime guard:

  • Runtime price-guard shipped — price_guard_enabled (#1374).
  • Duplicate hardcoded cost numbers single-sourced into financial_options.yaml (#1375).

This gate is cleared. See Price Governance.

Non-clinical engineering items

[FIX] / [DECISION] / [ENG] findings from each bundle are tracked in their own issues (not in #1376), scoped per bundle, and resolved on the normal PR path.

Data-handling note

Anonymized-production patient transcripts (frt_001004) were used internally only during the sweep to sanity-check realism. They were never sent to any external model (including the Gemini adversarial reviewer), regardless of any "synthetic" labelling. This boundary is non-negotiable: PROD-derived conversational content stays in-house.