Skip to content

v6.2 Flexible — Phase 2 Prompt Assembly Spec

Status: DRAFT — proposes the v6.2 triage hot-path swap once Phase 1 (docs/specs/v6.2-flexible-phase1-sop-contract-spec.md) is fully landed (PR #1320 + #1321 + #1322 shipped).

Author: Phase 2 dispatch, 2026-06-03.

Audit source: The architecture audit path called out in the dispatch prompt (docs/audits/v6.1-pipeline-2026-06-02/v6_1_architecture_audit.md) does not exist in this checkout — the nearest live audit is docs/audits/v6-eq-2026-05-23/ (Phase 1c assembled prompt + Phase 3e grader audit). This spec cites the v6.1 code paths directly and treats the dispatch-prompt audit references as forward-looking; the actual file paths and line numbers below are verified against HEAD.


1. Goal restated

"Retain good bits of v4.1 and v6.1 without rigid structure. Focus on SOP as overarching guideline. Like in v4.1 patients can start the conversation either with intent or logistics, but the system needs to be flexible to achieve SOP-specified layers through conversation." — SD, /goal directive

Concretely, Phase 2 collapses the multi-segment v6.1 pipeline (stage machine + 5 cached segments + extractor-gated layer state) into a single-LLM-judgment turn where the model sees:

  1. The base voice + safety rules (preserved verbatim from config/prompts/base/conversation_v6.yaml).
  2. A short SOP contract checklist produced by SOPContract.checklist_for_prompt(consolidated_state) (app/services/sop_contract.py:601-694).
  3. A patient context summary derived from consolidated_state (the unified store backfilled in Phase 1 — app/services/consolidated_state_backfill.py:73-214).
  4. Document state (presence, status, findings) so the model never fabricates "extraction failed."
  5. Conversation history tail (last 30 turns — same window as v6.1, app/agents/conversation_prompt.py:514 / :745).

The model decides what to ask next. There is no stages.yaml traversal, no addendum resolver, no signal-gated extractor parallelism gating the prompt. Extractors continue to write under the dual-write shim (app/services/consolidated_state_writer.py:34-73) but become advisory observers — Phase 3 formalizes the observer pattern.


2. Architecture overview

2.1 New module: app/agents/triage_v6_2.py

# app/agents/triage_v6_2.py
"""v6.2-flexible triage turn — single-LLM-call, SOP-contract-guided."""

from __future__ import annotations

from dataclasses import dataclass
from typing import Any

from app.services.sop_contract import SOPContract


@dataclass(frozen=True)
class TurnResult:
    """Return shape for run_triage_turn_v6_2."""
    response_text: str
    llm_meta: dict[str, Any]
    consolidated_state_after: dict[str, Any]
    extractor_observations: dict[str, Any]   # advisory only (Phase 3 promotes)
    sop_contract_id: str
    sop_checklist_snapshot: str               # raw text rendered into prompt
    fallback_reason: str | None               # set on v6.1 fall-through


async def run_triage_turn_v6_2(
    *,
    case: "Case",
    latest_user_message: str,
    conversation_history: list[dict],
    tenant_id: str,
    clerk_user_id: str | None,
    db: "AsyncSession",
    langfuse_handler: object | None = None,
) -> TurnResult:
    """
    v6.2 single-LLM-call pipeline. v4.1-style end-to-end judgment with
    SOP contract guidance.

    Steps:
      1. Load SOPContract — 3-tier resolve from case.procedure_code OR
         case.procedure_name OR _generic fallback (SOPContract.load).
      2. Read consolidated_state from case (Phase 1 dual-write keeps it
         fresh per turn — see consolidated_state_writer.write_consolidated_state).
      3. Compose single prompt:
           - {base_voice_and_safety}            (verbatim from conversation_v6.yaml)
           - {sop_contract_checklist}           (SOPContract.checklist_for_prompt)
           - {patient_context_summary}          (3-5 lines derived from
                                                 consolidated_state.{demographics,
                                                 procedure, medical, financial})
           - {document_state_summary}           (typed list with status + findings)
           - CACHE BOUNDARY (Anthropic ephemeral)
           - {last_30_turns}
           - {latest_user_message}
      4. Single LLM call — claude-haiku via llm_gateway.invoke().
      5. Post-LLM:
           - voice_rules.yaml validator (existing — response_policy.py)
           - SOP scope check  (Phase 5 — stub: pass-through today)
      6. Run extractors as OBSERVERS — same parallel asyncio.gather as
         triage_agent.run_extractors (lines 1250-1390), but their output
         feeds extractor_observations dict only. The next call to
         consolidated_state_writer.write_consolidated_state (triggered
         downstream in the orchestrator phase) reconciles into the case.
      7. Return TurnResult.

    NEVER raises. Any internal failure → fallback_reason set, the caller
    in v6_dispatcher_v6_2 falls back to run_triage_turn (v6.1) for this
    turn. Phase 2 is additive — v6.1 stays intact behind the flag.
    """
    ...

The key design decision: this function is a flat orchestration that mirrors the v4.1 single-call shape (conversation_prompt.invoke_conversation_turn at app/agents/conversation_prompt.py:651-799) rather than the LangGraph StateGraph used in triage_agent.build_triage_graph (line 1448+). The state machine in triage_agent.py was justified by parallel extractor scheduling + PFS routing — but in v6.2 the extractors run as fire-and-forget observers and PFS gating is removed (the SOP contract's missing_for_matching / document_findings_complete replaces the readiness band semantics).

2.2 The new dispatcher: prompt_dispatcher.py (renamed from v6_dispatcher.py)

Today app/agents/v6_dispatcher.py:63-129 is a 2-way fork (v4 ↔ v6.1). Phase 2 adds a 3rd branch:

prompt_dispatcher.py
├── prompt_arch_v6_2_flexible  (new flag) → run_triage_turn_v6_2
├── prompt_arch == "v6"        → compose_v6 (v6.1 path, unchanged)
└── default                    → v4 / v4.1 (unchanged)

Rename rationale: "v6_dispatcher" was already a misnomer post-PR-2 (it's the universal prompt-architecture decision point), and v6.2 will be the third arch it serves. Defer the file rename to PR-2.4 to keep PR-2.1 / PR-2.2 / PR-2.3 minimal-diff.


3. The new prompt template

3.1 Single-segment composition

The template is split into a STABLE prefix (above the cache boundary) and a MUTABLE tail (below). Anthropic ephemeral prompt caching requires byte-identical prefixes across consecutive calls; anything that mutates per turn MUST sit below the boundary or the cache hit rate collapses (see §3.3 cost math). The SOP contract's static schema (procedure_codes, clinical_safety_rules, required_documents) sits ABOVE the boundary because it only changes when Naidu edits the SOP YAML; the DYNAMIC checklist projection (Captured / Still-needed lists) sits BELOW because it grows every turn during active intake.

<base_voice_and_safety>
ROLE: Curaway's clinical intake coordinator.
You are a warm, thorough, AI-powered care navigator (not a doctor).
Never diagnose, never prescribe, never project clinical outcomes.
Forbidden phrases: {forbidden_phrases_block}
First-turn role disclosure (one beat, woven into reply): see §FIRST-TURN.
Emotional acknowledgement: see §VOICE & EMOTIONAL INTELLIGENCE.
Grave-disclosure one-beat: see §GRAVE-DISCLOSURE.
JSON response envelope: {message, extracted_data, detected_comorbidities,
  phase_complete, suggested_next}.

[full base prompt body from conversation_v6.yaml lines 34-1028 — voice rules,
 hard bans, JSON envelope, FIRST-TURN PLATFORM ROLE, GRAVE-DISCLOSURE,
 DISEASE-SPECIFIC QUESTION PRIORITY (A3), one-question-per-turn discipline]
</base_voice_and_safety>

<sop_static_definition>
SOP id: {contract.sop_id}
Procedure codes covered: {contract.procedure_codes | join(", ")}
Required documents schema (types + when_mandatory):
  {for doc_type, rule in contract.required_documents.items():}
  - {doc_type}: {rule.when} ({rule.severity})
Clinical safety rules (active for this SOP):
  {for rule in contract.clinical_safety_rules:}
  - {rule.id}: {rule.description}
</sop_static_definition>

──────────────── CACHE BOUNDARY (ephemeral) ────────────────
<!-- Everything ABOVE this line is byte-stable across turns of a single case  -->
<!-- and across cases sharing the same SOP. Everything BELOW mutates per turn -->
<!-- and is intentionally outside the cached prefix. See §2.3 + §3.3.         -->

<sop_contract_checklist>
{SOPContract.checklist_for_prompt(consolidated_state)}
<!-- Captured: list grows each turn → MUST sit below the boundary -->
</sop_contract_checklist>

<patient_context>
Name: {demographics.name or "—"}
Age: {demographics.age or "—"}
Country: {demographics.country or "—"}
Procedure (current best read): {procedure.name or "—"} {procedure.code} {procedure.side or ""}
Known comorbidities: {medical.conditions or "(none recorded)"}
Funding signal: {financial.funding_source or "(unknown)"}
Budget: {financial.budget_display_text or "(not stated)"}
<!-- Multicurrency: budget_display_text renders native amount + ISO-4217 + USD
     equivalent (e.g. "5 lakh INR (~$6,000 USD)"). Helper formats per
     case.country / currency_service. Resolves C6/C7/C8 audit cluster
     (INR/lakh budget extraction loop) at the prompt level. Never render raw
     USD-only when patient stated INR/AED/etc. See §3.1.1 multicurrency. -->
</patient_context>

<documents>
<!-- This block reads from `consolidated_state.documents`, populated by
     `consolidated_state_writer.write_consolidated_state` from `case.documents`
     on every turn. See §6.1 "Documents wiring contract" for the hard wiring
     requirement. Phase 2 ensures the block exists with status; Phase 4 fills
     the `findings` sub-dict. Empty `findings` is acceptable in Phase 2. -->
{for doc_id, doc in consolidated_state.documents.items():}
- {doc.label or doc_id} (type: {doc.type}, status: {doc.status})
  {if doc.status == "queued":}             waiting to start — findings pending
  {if doc.status == "processing":}         ETA ~{doc.eta_seconds}s — findings pending
  {if doc.status == "complete":}           Findings: {doc.findings | join(", ")}
  {if doc.status == "failed_transient":}   (extraction failed, retrying — ignore for now)
  {if doc.status == "failed_permanent":}   (extraction failed after retries — ask the patient to describe verbally or re-upload)
  {if doc.status == "expired":}            (file expired before processing — ask the patient to re-upload)
{if not consolidated_state.documents:}
(no documents on file)
</documents>

<previous_turn_promises>  <!-- per review S4 — axis 6 promise-honoring -->
{render_previous_assistant_promises(conversation_history)}
<!-- regex-scans the last assistant message for yes/no offers, "shall I ...?",
     "would it help if ...?" patterns; surfaces them so the LLM honors its own
     prior offers. Empty stanza on turn 1. -->
</previous_turn_promises>

Conversation so far (most recent 30 turns):
{render_history_tail(conversation_history, limit=30)}

Patient just said:
{latest_user_message}

Respond per the voice rules above and the SOP contract status below them.
Pick the ONE most relevant next step given conversational momentum. Don't
enumerate. Don't re-ask anything already in <patient_context> or in the
checklist's "Captured:" — it has been answered. If <previous_turn_promises>
is non-empty, honor the offer you made before pivoting.

3.2 Caching strategy

Boundary choice. The prompt is split into ONE cached segment (base voice + safety + SOP static definition, ~4,050 tokens) followed by a mutable tail (checklist projection + patient context + documents + previous-turn promises + history + user message, ~400-2,000 tokens depending on turn depth). The single ephemeral cache_control marker sits at the boundary.

Why the static SOP block goes ABOVE the boundary. The SOP contract has two parts: (a) the STATIC definition (sop_id, procedure_codes, required_documents schema, clinical_safety_rules) which only changes when Naidu edits the YAML, and (b) the DYNAMIC PROJECTION (Captured: / Still-needed: lists rendered against the case's current consolidated_state) which grows every turn. Putting (a) above the boundary preserves cache hits across all turns of a case AND across different cases that resolve to the same SOP. Putting (b) below the boundary keeps the projection fresh without invalidating (a).

Cost math.

Variant Prefix tokens Cache hit rate (30-turn TKR case) Input cost / turn (Haiku, $0.80/M base, $0.08/M cache-hit)
v6.1 today (5 cached segments) ~6,400 ~80% ~$0.0001
v6.2 with checklist ABOVE boundary (rejected design) ~4,200 ~0% (Captured grows each turn) ~$0.0034 — 34× regression
v6.2 with checklist BELOW boundary (current design) ~4,050 ~85% (stable until SOP YAML edits) ~$0.0001 — parity with v6.1

At 100 cases × 30 turns/case the rejected design would have been a $10/mo straight burn — small absolute, but a 34× burn on the most-frequent codepath. The current design recovers v6.1 cache economics while shipping the flexibility win.

Cache invalidation triggers. The cached prefix is invalidated when (a) Naidu/SD edits the SOP YAML (handled by SOPContract.invalidate_cache() — see §11), or (b) conversation_v6.yaml is re-deployed. Both are operator-initiated and infrequent (single-digits per week).

Sub-segmenting (deferred). Splitting the checklist into "Still needed" (slow-mutating) and "Captured" (fast-mutating) sub-blocks could reclaim a small amount of cache by hoisting "Still needed" above the boundary — but only when extraction lands. The added composer complexity isn't worth the ~5% cache win at current volumes. Revisit when token-cost telemetry from PR-2.5 lands.

3.3 What's gone from v6.1

Compared with the v6.1 compose_v6 output (app/services/prompt_loader_v6/composer.py:65-412):

v6.1 segment What it did v6.2 disposition
base_chunk (segment 0) Voice + emotional rules + JSON envelope PRESERVED verbatim — feeds <base_voice_and_safety>
stage_chunk (segment 1) Per-stage goal block from stages.yaml DROPPED — the SOP checklist replaces stage-driven goal narration
sop_chunk (segment 2) build_sop_segment() SOP block REPLACED by SOPContract.checklist_for_prompt() — same intent, simpler API
addendum_1_chunk / addendum_2_chunk (segments 3+4) Pivot-classifier-driven knowledge addendums DROPPED — voice rules already cover the topics the addendums layered in (caregiver, capacity, grave disclosure are in the base)
tail_chunk (segment 5) {patient_context} + {emotional_context} REPLACED by <patient_context> + <documents> derived from consolidated_state

3.4 Token budget

Hard ceiling: 10,000 tokens total prompt (prefix + tail + user message). Composition:

Block Typical Hard cap Truncation strategy
<base_voice_and_safety> 3,800 3,800 Static. If body grows past cap, fail loud in PR review.
<sop_static_definition> 250 400 Static per SOP. If a SOP grows past cap, split it into a sub-SOP.
<sop_contract_checklist> (Captured + Still-needed) 250 600 Cap Captured at 30 entries (rare past intake completion). Truncate oldest entries first.
<patient_context> 100 200 Static fields; no truncation needed.
<documents> 200 800 Cap at 8 documents listed; remainder summarized as "+N more imaging studies on file."
<previous_turn_promises> 60 200 Cap at 3 promises surfaced; oldest dropped first.
render_history_tail(limit=30) 1,500 3,500 If total prompt > 9,500 tokens, drop oldest turns first (keep the last 10 turns minimum).
latest_user_message 100 500 Truncate at 2,000 chars with "…[truncated]" notice.

Caps enforced in triage_v6_2._assemble_v6_2_prompt; ceiling breach increments swallow_metric prompt_arch_v6_2_token_ceiling_hit and triggers history-tail truncation before any other block.

3.5 Concrete BEFORE/AFTER for TKR case 0a6b7e48

Scenario: Turn 1. Patient says "I need a knee replacement."

v6.1 today (composed by compose_v6)

The model receives 5 cached segments totaling ~6,400 tokens of prefix:

  1. Base (~3,800 tokens) — full conversation_v6.yaml lines 34-1028 including FIRST-TURN PLATFORM ROLE DISCLOSURE, VOICE & EMOTIONAL INTELLIGENCE, GRAVE-DISCLOSURE, JSON envelope, DISEASE-SPECIFIC QUESTION PRIORITY.
  2. Stage (~900 tokens) — stages.yaml discovery block including: "case_type / primary_fear / primary_hope / emotional_state.readiness / decision_stage / trigger_event" enumeration plus ROLE CLASSIFICATION ("are you the patient or helping arrange care for someone else?") plus QUESTION BUDGET=2.
  3. SOP (~400 tokens, capped by sop_segment_max_words=400) — build_sop_segment() output for the active layer (mobility_conditioning), listing must_collect: primary_complaint, affected_side, walking_distance, joint_pain_severity, ….
  4. Addendum 1 (~600 tokens) — caregiver_capacity addendum if intent_aware flag is on and pivot classifier returns capacity domain.
  5. Tail (~700 tokens) — patient_context block + emotional_context = "neutral".

Then the variable tail: 30 turns of history (empty on turn 1) + <patient_data> block + the user message.

The model has to reconcile stages.yaml says batch up to 2 questions, SOP says ask must-collect fields in order, addendum says be alert for caregiver framing, base prompt says one beat then one question. Audit case 0a6b7e48 showed the model answering this by emitting 22 issues over the conversation — overlapping requests, re-asking confirmed fields, occasional medical-advice slippage.

v6.2 proposed

The model receives a single cached prefix (~4,050 tokens — <base_voice_and_safety> + <sop_static_definition> only; per §3.2 the dynamic checklist + patient context + documents sit BELOW the cache boundary):

  1. Base (~3,800 tokens) — same conversation_v6.yaml body.
  2. SOP static definition (~250 tokens) — sop_id, procedure_codes, required_documents schema, clinical_safety_rules for the resolved contract. Stable until Naidu edits the YAML.

Below the cache boundary (per-turn, not cached):

  1. SOP checklist projection (~250 tokens) — rendered by SOPContract.checklist_for_prompt:
## Contract Status (TKR)

Still needed:
- procedure_side (mandatory for matching)
- age (mandatory for matching)
- country_of_residence (mandatory for matching)
- funding_source (mandatory for matching)
- key_comorbidities (mandatory for safety)

Optional:
- walking_distance
- preferred_corridors
- timeline_preference

Documents still needed:
- knee_xray (mandatory before booking)
- bloodwork_recent (mandatory before booking)

Active safety rules:
- (none)
  1. Patient context (~80 tokens) — Procedure (current best read): knee replacement / — / —. Comorbidities: (none recorded).
  2. Documents (~30 tokens) — (no documents on file).
  3. Previous-turn promises (0 tokens on turn 1) — empty stanza.

Then the variable tail (history + user message).

Expected response — one beat acknowledgement + ONE question, driven by the LLM's read of conversational momentum:

"Got it — knee replacement. I'm Curaway's AI care coordinator; the surgical call sits with the providers I'll connect you with, but I'll help you build the picture. Quick clarifier: left, right, or both?"

The model picked procedure_side because (a) it's the first "Still needed" mandatory-for-matching field, (b) v4.1-style judgment recognizes you can't sensibly ask about funding before establishing which knee. Crucially, the model did NOT enumerate; it did NOT re-derive the stage; it did NOT need a classifier to tell it which addendum to load.

3.6 SOP safety-text + base-pacing conflict audit

The new <sop_static_definition> block puts SOP clinical_safety_rules[].description text directly into the LLM system prompt. If a SOP author phrases a rule in second-person ("you should …", "you must …") or first-person ("I recommend …"), that phrasing leaks through as second-person clinical instruction — violating CLAUDE.md ground rule 9 (NO medical advice) and the hard ban encoded in tests/test_no_medical_advice.py.

Audit requirement (PR-2.1 acceptance addition). Before PR-2.5 dispatches, audit every SOP's clinical_safety_rules[].description field across all 18 SOPs in config/prompts/sops/. Failing strings:

  • Second-person directives: you should, you must, you need to, you ought to, please ensure you, anything in the regex \byou\b\s+(should|must|need|ought|have to)\b.
  • First-person directives: I recommend, I advise, I suggest.
  • Diagnostic labels stated as fact about THIS patient (vs. the safety pattern).

Resolution. Add tests/prompts/test_sop_safety_phrasings.py::test_validate_sop_safety_descriptions that scans all 18 SOPs' clinical_safety_rules[].description strings and fails on any of the above patterns. CI-mandatory; merges blocked on violations.

Base-body / checklist-foot conflict audit. Separately, audit that the foot instruction "Pick the ONE most relevant next step given conversational momentum. Don't enumerate." does NOT contradict conversation_v6.yaml lines 34-1028 (pacing/budget rules, FIRST-TURN role disclosure, GRAVE-DISCLOSURE one-beat, DISEASE-SPECIFIC QUESTION PRIORITY). Reviewer subagent (code-reviewer + architecture-reviewer) must call out conflicts as a blocking finding on PR-2.1.

Voice rules end-to-end verification. PR-2.1 acceptance: run tests/test_voice_compliance.py against the full assembled v6.2 prompt (not just the base prompt body); the assembled-prompt fixture exercises base + sop_static_definition + checklist + patient_context + documents.


4. Drop stages.yaml usage (v6.2 path only)

stages.yaml is read at these sites (verified via grep):

Site File:line Purpose v6.2 replacement
Composer app/services/prompt_loader_v6/composer.py:148-160 (_load_base_prompt + _load_stage_context + _load_stage_re_offer_turns) Per-stage guidance + records-re-offer turn lookup Not called in v6.2 path. triage_v6_2 skips composer entirely.
YAML cache app/services/prompt_loader_v6/yaml_cache.py:68-110 Caches stages.yaml parse Unused in v6.2 path.
Stage resolver app/services/stage_resolver.py:143 Resolves which of 12 stages the case is in Not called in v6.2. SOP checklist captures completion; LLM judgment handles ordering.
Workflow snapshot app/services/workflow_snapshot_service.py:68 stage_turn_count for re-offer guard Removed for v6.2. Records re-offer becomes a SOP checklist hint: when a "Documents still needed" entry persists across N turns the checklist text can include "(mention upload option)". Implementation deferred to PR-2.4.
Artifact validator app/services/v6_artifact_validator.py:204-291 CI validation that stages.yaml is well-formed Stays — v6.1 path still uses stages.yaml, and the validator must keep guarding that path until Phase 6 cutover.
Case summary app/services/case_summary_service.py:22 UX-only comment referencing stages truth table Comment only — no code change.
Triage agent comment app/agents/triage_agent.py:736 Comment only No code change.
Scripts scripts/create_v6_flags.py:78, scripts/generate_prompt_graph.py:46,129 Doc generation + flag descriptions Update in PR-2.4 to mention v6.2 path bypasses stages.

v6.1 path retains all of the above untouched. Phase 6 (out of scope here) will delete stages.yaml once the v6.2 path is the default and the v6.1 fallback can be retired.


5. Flagsmith integration

Two new flags in config/feature_flags.yaml (and matching Flagsmith env entries — both Prod and Dev per the standing feedback_flagsmith_dual_env rule):

  prompt_arch_v6_2_flexible:
    default: false
    status: active
    description: "v6.2-flexible single-call triage pipeline. When ON, the prompt dispatcher routes through triage_v6_2.run_triage_turn_v6_2 instead of compose_v6. Tenant-scoped + percentage-rollout-capable via Flagsmith identity. Default OFF. See docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §5."

  consolidated_state_backfill_complete:
    default: false
    status: active
    description: "Operator-flipped sentinel  set to TRUE only after scripts/backfill_consolidated_state.py finishes across all tenants AND swallow_metrics shows consolidated_state_shim_failure rate at zero for 24h. The v6.2 read path checks this BEFORE serving traffic. If false, falls through to v6.1 silently with metric prompt_arch_v6_2_backfill_gate_fired. Phase 2 gate; deleted once cutover completes."

Flagsmith unreachable behavior. If Flagsmith is unreachable at flag-resolve time, is_feature_enabled falls back to the YAML default (both flags false); the v6.2 path is dormant and v6.1 serves. No degraded mode — Flagsmith outage means "stay on the previous-known-good architecture." This is the SAFE default for both flags; do not invert it.

Flagsmith V2-versioning patch shape. Curaway's Flagsmith envs use V2 versioning. Operator playbook for flipping prompt_arch_v6_2_flexible or consolidated_state_backfill_complete:

  • For BOOLEAN state changes use the env-scoped URL: PATCH /api/v2/environments/{env-key}/featurestates/{feature-state-id}/ (NOT the unscoped /api/v1/features/{id}/featurestates/).
  • For CONFIG-flag VALUE changes (multivariate / segment overrides) use POST /api/v2/environments/{env-int-id}/featurestates/{feature-state-id}/versions/ with the FeatureStateValue dict shape {type, string_value, integer_value, boolean_value}.
  • Run python scripts/sync_flagsmith.py --dry-run BEFORE every flip to confirm the YAML→Flagsmith diff matches the intended change; the existing script encodes the correct V2 PATCH shape.
  • Flip Prod + Dev together per feedback_flagsmith_dual_env.

Dispatcher logic (proposed app/agents/prompt_dispatcher.py):

async def dispatch_prompt_arch_v2(...) -> DispatchResult:
    # 1. PER-CASE STICKINESS (see §5.1) — once the case has a sticky arch,
    #    honor it. Flag flips only affect NEW cases.
    sticky_arch = (case.workflow_state or {}).get("prompt_arch")
    forced_arch = (case.workflow_state or {}).get("force_prompt_arch")
    if forced_arch:
        record_swallow("prompt_arch_sticky_resolved",
                       labels={"arch": forced_arch, "tenant_id": tenant_id or "unknown",
                               "source": "force_override"})
        return _dispatch_arch(forced_arch, ...)
    if sticky_arch:
        record_swallow("prompt_arch_sticky_resolved",
                       labels={"arch": sticky_arch, "tenant_id": tenant_id or "unknown",
                               "source": "sticky"})
        return _dispatch_arch(sticky_arch, ...)

    # 2. NEW case — resolve flags once, write to case.workflow_state.prompt_arch.
    v6_2_on = is_feature_enabled("prompt_arch_v6_2_flexible",
                                 tenant_id=tenant_id, identity=identity)
    backfill_done = is_feature_enabled("consolidated_state_backfill_complete",
                                       tenant_id=tenant_id, identity=identity)

    # 3. v6.2 path requires backfill complete
    if v6_2_on and not backfill_done:
        record_swallow("prompt_arch_v6_2_backfill_gate_fired",
                       labels={"tenant_id": tenant_id or "unknown"})
        resolved = _resolve_v6_or_v4_fallback(tenant_id, identity)
    elif v6_2_on and backfill_done:
        resolved = "v6.2_flexible"
    else:
        resolved = _resolve_v6_or_v4_fallback(tenant_id, identity)

    # 4. Stickify on the case row (one-time write at case creation; idempotent
    #    on re-entry because step 1 short-circuits).
    case.workflow_state = {**(case.workflow_state or {}), "prompt_arch": resolved}
    await case_repo.update_workflow_state(case.id, case.workflow_state)
    record_swallow("prompt_arch_sticky_resolved",
                   labels={"arch": resolved, "tenant_id": tenant_id or "unknown",
                           "source": "first_resolve"})
    return _dispatch_arch(resolved, ...)

The dispatcher remains async-safe + try/except wrapped at the outer boundary, matching v6_dispatcher.dispatch_prompt_arch's contract (app/agents/v6_dispatcher.py:106-129). Any v6.2 path exception → fall back to v6.1 (which itself falls back to v4 on its own internal error). Two safety nets stack.

5.1 Per-case prompt_arch stickiness

Policy. Once a case has been served at least one turn, its prompt_arch is fixed for the remainder of the conversation. Flag flips affect NEW cases only. This guarantees Sindhu's in-flight conversation cannot silently swap from v6.1 to v6.2 (or vice versa) mid-turn, which would change system-instruction shape underneath conversation history and degrade promise-honoring / re-ask discipline.

Mechanism. - At case creation (first dispatcher call for the case), the dispatcher resolves Flagsmith ONCE and writes case.workflow_state.prompt_arch ∈ {"v6.2_flexible", "v6.1", "v4"}. - Every subsequent turn for the same case reads case.workflow_state.prompt_arch and routes directly to that arch — Flagsmith is NOT re-consulted. - New cases created AFTER a flag flip pick up the new arch on their first turn.

Admin override. A case.workflow_state.force_prompt_arch field, settable via the admin SOP/case editor or a one-off SQL update, takes precedence over the sticky value. Use cases: - Force-migrating an in-flight v6.1 case to v6.2 for testing. - Recovery: pinning a case to v6.1 if it triggered a v6.2 regression in production. - Sindhu-canary: explicitly force v6.2_flexible on the 5 fresh canary conversations regardless of percentage rollout.

The override never silently expires; clearing the field reverts to the sticky value.

Observability metric. prompt_arch_sticky_resolved is emitted on every dispatcher call with labels {arch, tenant_id, source} where source ∈ {sticky, force_override, first_resolve}. The Metabase panel prompt_arch_distribution aggregates these to show (a) what percentage of in-flight cases are on each arch, (b) the rate of force_override (which should be near-zero in steady state), and (c) the rollout pace of first_resolve{arch=v6.2_flexible} as the flag's percentage rollout dial increases.

Migration cost. One column write per case at first turn (~negligible — case_repo.update_workflow_state is already called on most turns for intake_complete writes). The workflow_state JSONB column already exists and is the canonical home for orchestration metadata.


6. Read-path swap (the riskiest change)

Phase 2 swaps only the triage hot path to read consolidated_state directly. Everything else stays on the legacy layer_state + ehr_snapshot representations served by the dual-write shim.

Read site Phase 2 action Reads after Phase 2 Why defer
triage_v6_2.run_triage_turn_v6_2 NEW — reads case.consolidated_state consolidated_state This is the whole point of Phase 2.
app/agents/triage_agent.py:run_triage_turn (v6.1) Unchanged layer_state + ehr_snapshot v6.1 must keep working as fallback.
app/services/prompt_loader_v6/composer.py:build_patient_context Unchanged ehr_snapshot via case_service.get_case Same — v6.1 dependency.
Chat router state-for-UI reads (app/routers/chat.py) Unchanged ehr_snapshot Frontend reads stay on the v6.1-stable contract. Phase 2.5 swaps.
Matching engine (app/services/matching/*) Unchanged ehr_snapshot + layer_state Matcher has its own contract + tests; coupling it to v6.2 reads expands blast radius beyond Phase 2 risk budget. Phase 2.5.
Coordinator dashboard (apps/coordinator/*) Unchanged ehr_snapshot Coordinator UI freeze until Phase 6.
Auto-invoke matcher trigger (app/agents/auto_invoke_matcher.py) Modified — reads SOPContract.missing_for_matching(consolidated_state) under v6.2 sticky arch; same gate semantics, different read source consolidated_state (via SOPContract) under v6.2; layer_state.completion under v6.1 See §12 for full lifecycle responsibilities.
apps/admin/* reads (admin dashboards, coordinator portal) Unchanged ehr_snapshot Dual-write shim keeps ehr_snapshot ≤1-turn-behind during v6.2 turns (writer runs post-LLM on extractor merge). No staleness alarm needed — admin reads remain accurate to the prior turn's data, same lag as today.

The dual-write shim (app/services/consolidated_state_writer.py) is what makes this safe. Every extractor merge and every EHR rebuild updates BOTH representations; reads on either side see consistent data. Phase 2 = read-divergence only for the v6.2-flagged hot path.

Gate before flipping the flag for any tenant: the consolidated_state_shim_failure swallow counter (emitted at consolidated_state_writer.py:84-88) must be at zero for 24h sustained.

Empty-state read policy (consolidated_state == {} or partial). v6.2 always serves once the flag and backfill gate are on; missing fields render as and the LLM treats them as "ask next." There is no turn-level fallthrough to v6.1 based on consolidated_state shape — the LLM is the judgment layer for "what's missing." The only fallthrough triggers are (a) prompt_arch_v6_2_backfill_gate_fired (backfill flag false), (b) an exception inside triage_v6_2.run_triage_turn_v6_2 (caught by the dispatcher, fall through with fallback_reason set), or (c) the sticky arch on the case is not v6.2_flexible. Shape-of-data is never a trigger.

6.1 Documents wiring contract — CRITICAL

This is the single biggest gap closure in the spec. The §3.1 <documents> block + §7 enum design fix the D1-D6 hallucination class STRUCTURALLY — but only if consolidated_state.documents is actually populated. Without this section's wiring, the block renders (no documents on file) for every case and the D1-D6 fix doesn't fire in production.

Hard requirement. PR-2.1 (and the subset taken by PR-2.3) MUST land all of the following or the spec's structural D1-D6 claim is operationally false:

  1. _empty_consolidated_state() shape — already includes "documents": {} (verified at app/services/consolidated_state_backfill.py:58). No change needed for the skeleton — but the transform function MUST populate it (see #2).

  2. transform_case_to_consolidated_state(case) must populate documents from the case.documents relation (Document model at app/models/document.py). Required shape per doc_id:

state["documents"][str(doc.id)] = {
    "doc_id": str(doc.id),
    "type": doc.document_type,           # maps to SOP required_documents keys
    "ocr_status": _map_status(doc),      # → enum from §7 (queued/processing/complete/failed_transient/failed_permanent/expired/not_applicable)
    "eta_seconds": _eta_seconds(doc),    # for queued/processing only
    "label": doc.label or doc.filename,
    "findings": {},                       # Phase 4 fills; Phase 2 leaves empty
    "last_seen_turn": doc.created_turn or None,
}
  1. consolidated_state_writer.write_consolidated_state must refresh the documents block on every call. OCR status mutates asynchronously (extractor pipeline), so a writer call mid-conversation MUST re-derive the documents block from case.documents — not preserve a prior turn's value. Add _refresh_documents(state, case) step in the writer; idempotent.

  2. PR-1.3 backfill script — separate sub-task. The existing PR-1.3 backfill iterates cases and runs transform_case_to_consolidated_state. PR-2.1 acceptance includes a sub-task explicitly: re-run scripts/backfill_consolidated_state.py against production once #2 + #3 land, so EXISTING cases (created before the documents wiring) pick up their document state. Without this, existing cases silently regress to empty documents block — the same hallucination failure mode the spec claims to fix. Silent-regression footgun. Block PR-2.5 dispatch until this sub-task is verified complete via a spot-check query: SELECT count(*) FROM cases WHERE jsonb_array_length(consolidated_state->'documents') > 0 vs. join on case_documents — counts must match.

  3. Phase 4 fills findings; Phase 2 just ensures the documents block exists with status. Phase 2 does NOT block on findings being populated — empty findings: {} is the correct Phase 2 shape. The structural fix (LLM sees doc presence + status, can't hallucinate "extraction failed") fires the moment status renders correctly. Phase 4 enriches by populating findings via the vision extractor.

PR-2.3 acceptance addition. "When a case has uploaded documents, the assembled v6.2 prompt's <documents> block must render at least the doc_id, type, and ocr_status for each document. Empty findings dicts are acceptable (Phase 4 fills them). Verified by test_documents_block_renders_for_case_with_uploaded_doc end-to-end fixture in tests/test_triage_v6_2.py."

Replay-harness baseline. The 5-turn replay corpus in tests/replay/fixtures/ MUST include at least one fixture conversation with an uploaded document (in each of queued, processing, complete, failed_permanent statuses). PR-2.5's Sindhu canary baseline must include a fixture where Sindhu uploads a document and verifies the <documents> block populates AND the LLM acknowledges status accurately (no "extraction failed" hallucination).


7. Document interpretation hook (Phase 4 prep)

Per SD's added goal: "SOP framework needs to interpret the reports (images/documents/xrays) — without which some validation checks will not work."

Phase 2's prompt explicitly surfaces document state from consolidated_state.documents. The shape per doc:

{
    "doc_id": "uuid",
    "type": "knee_xray" | "bloodwork_recent" | ,
    # Status enum — exhaustive, prompt template renders one phrasing per value (per review S2 + S3)
    "status": "queued"             # uploaded, not yet picked up by extractor
            | "processing"          # extractor running; eta_seconds populated
            | "complete"            # findings populated; safe to render verbatim
            | "failed_transient"    # one or more retries failed; will retry
            | "failed_permanent"    # out of retries (default 3×); ask patient to re-upload or describe verbally
            | "expired"             # file expired before processing (e.g., R2 presigned URL TTL elapsed); re-upload
            | "not_applicable",     # SOP marked this doc as not required for the current case path
    "eta_seconds": 60,             # present when status in {queued, processing}
    "label": "Left knee X-ray (2026-05)",
    "findings": {                  # populated only when status == "complete"
        "joint_space_mm": 2.1,
        "osteophyte_grade": 3,
        ...
    },
}

The prompt renders this verbatim so the LLM SEES document status. This structurally fixes the #1308 / Sindhu D1-D6 cluster: the model can say "I see your X-ray is still processing — let me ask about your medications meanwhile" instead of fabricating "the extraction failed" because document state was buried in a cache segment the model only half-attended to. The status-specific phrasings rendered in §3.1's <documents> block map 1:1 to the enum above.

Phase 4 will populate the findings dict via the vision extractor pipeline. Phase 2 just needs to:

  1. Backfill consolidated_state.documents from existing case.documents table during consolidated_state_backfill.py execution. The current backfill (consolidated_state_backfill.py:33-65) leaves documents: {} empty — PR-2.1 extends it.
  2. Update consolidated_state_writer.write_consolidated_state to re-sync document status on each call.

Even with empty findings for most docs (Phase 4 fills them), the presence + status signal alone removes the hallucinated-failure failure mode.


8. Extractor demotion preview (Phase 3 prep)

Phase 3 formalizes extractors as observers. Phase 2 doesn't refactor them — they continue running via triage_agent.run_extractors (triage_agent.py:1250-1390) in the v6.1 path, and via a parallel asyncio.gather call inside triage_v6_2.run_triage_turn_v6_2 step 6 — but in v6.2 the prompt is composed and the LLM call is issued before extractor results are consumed.

The transition:

  • Phase 2: LLM-judgment-primary. Extractors still write layer_state (via dual-write shim that auto-mirrors to consolidated_state). Their output during the current turn is captured in TurnResult.extractor_observations for telemetry but does NOT alter the response. Next turn's prompt will reflect the updated consolidated_state.
  • Phase 3: Formalize the observer protocol. Extractors emit deltas to a queue; a post-merge reconciler validates against the LLM's stated extractions (parsed from extracted_data in the JSON envelope) and resolves conflicts via a simple "LLM wins on contradictions, extractor wins on additions" rule.

The key Phase 2 contract: run_triage_turn_v6_2 MUST NOT block on extractor completion. They are fired with asyncio.create_task after the response is published to WS, then awaited in a short cleanup section before the function returns (so any synchronous DB write inside the extractor commits before the request ends). If an extractor times out (>5s), it's abandoned with a swallow_metrics increment.


9. PR chain breakdown

PR-2.1 — Add triage_v6_2.py + tests (no wiring) — ~2 days

  • New module app/agents/triage_v6_2.py containing TurnResult dataclass + run_triage_turn_v6_2 function (signature from §2.1).
  • New module app/agents/prompt_dispatcher.py (placeholder copy of v6_dispatcher.py — rename is PR-2.4).
  • Prompt assembly helper in same module (_assemble_v6_2_prompt(consolidated_state, sop_contract, history, latest_message) -> str).
  • New helper render_previous_assistant_promises(history) in app/agents/triage_v6_2.py that regex-scans the prior assistant message for yes/no offers ("shall I…?", "would it help if…?", "want me to…?"). Surfaces them in the <previous_turn_promises> stanza. Empty stanza on turn 1.
  • Extend consolidated_state_backfill._empty_consolidated_state() (consolidated_state_backfill.py:33-65) to populate documents dict from case.documents table on backfill + writer paths, including all 6 status enum values from §7.
  • SOPContract cache invalidation hook : add SOPContract.invalidate_cache() classmethod that calls SOPContract.all_contracts.cache_clear() and _generic_contract.cache_clear(). Wire it into app/routers/admin_sops.py save handlers (call after any SOP YAML write — see §11). Add unit test test_sopcontract_cache_invalidated_after_admin_save asserting the cache is cleared on the admin save path and the next SOPContract.load(...) re-reads the YAML.
  • Unit tests tests/test_triage_v6_2.py: mock SOPContract.load, mock llm_gateway.invoke, assert prompt content + TurnResult shape + cache-boundary placement (the test asserts <sop_contract_checklist> appears AFTER the boundary marker per §3.2).
  • Replay-harness adapter : existing harness in tests/replay/ is shaped for v6.1's compose_v6 dict return. Extend the adapter to accept BOTH v6.2 TurnResult and v6.1 dict shapes (discriminate on isinstance(result, TurnResult)). Ship the adapter in the same PR as run_triage_turn_v6_2 so the replay corpus immediately exercises the new path.
  • Replay harness extension tests/replay/test_v6_2_parity.py: take the 5-turn fixture corpus already used for v6.1 replay (tests/replay/fixtures/) and assert the v6.2 path produces responses with NO regression on the existing voice / clinical-safety / one-question-per-turn axes (axis 1-5 from .claude/rules/prompts.md).
  • LLM-grader v6.2 stub : add a skip-marker to tests/test_llm_grader_axis_10.py for v6.2-prompted fixtures (filtered via prompt_arch == "v6.2_flexible" metadata) with a pytest.skip("v6.2 grader fixtures land in Phase 2.5 follow-up — see issue TBD"). Note: the grader axes are stage-agnostic and SHOULD still fire on v6.2 (answers §12 Q2); the skip is defensive in case any axis hard-depends on stage_id in trace metadata. If the grader passes on v6.2 fixtures without the skip during PR-2.1 review, remove the skip.
  • Required test coverage additions (per DoD TESTING + review §Part 1):
  • test_doc_status_processing_rendered — case with one document at status="processing"; assert distinct phrasing ("ETA ~Ns — findings pending"), NOT generic placeholder.
  • test_doc_status_failed_permanent_rendered — case with one doc at status="failed_permanent"; assert "please re-upload" / "describe verbally" language.
  • test_serves_on_empty_consolidated_state — case with no procedure, no docs, no state; assert non-broken turn (no exception, response is a valid JSON envelope per parse_v4_response).
  • test_force_prompt_arch_override_path — set case.workflow_state.force_prompt_arch = "v6.2_flexible"; dispatcher MUST route to v6.2 regardless of sticky/Flagsmith.
  • test_invalidate_cache_fan_out_to_triage_v6_2 — admin SOP save handler triggers SOPContract.invalidate_cache(); assert that the next call to _cacheable_prefix in triage_v6_2 re-reads the YAML (NOT a stale cached value). Verified sop_contract.py:649-682 invalidate_cache fan-out shipped.
  • test_documents_block_renders_for_case_with_uploaded_doc — case with case.documents populated; assert <documents> block renders doc_id + type + ocr_status (Gap 1 enforcement, §6.1).
  • test_validate_sop_safety_descriptions — scan all 18 SOPs for second-person / first-person phrasing in clinical_safety_rules[].description; CI-mandatory (§3.6).
  • ADR write-up. Landed as docs/adr/0030-conversation-v6-2-flexible-architecture.md — captures the single-call SOP-contract decision + alternatives considered (B/C/D from architecture audit) + consequences. Required for Tier-3 architectural pivot per CLAUDE.md ground rule 8.
  • mkdocs.yml nav update. Add docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md to the Architecture section.
  • docs/reference/feature-flags.md update. Both new flags documented (prompt_arch_v6_2_flexible, consolidated_state_backfill_complete). PR-2.2 lands the actual flag changes; PR-2.1 lands the docs entry.
  • Health page (/landscape) addition. New column health metric: consolidated_state populated-rate (% of cases where consolidated_state JSONB is non-empty). Surfaces the backfill state as a first-class production health indicator.
  • Acceptance: all tests green; module imports cleanly; no wiring to production paths; SOPContract.invalidate_cache() wired into admin save handlers and verified by unit test; replay-harness adapter accepts both TurnResult and dict shapes; ADR-0030 landed; mkdocs nav + feature-flags.md updated; health page metric registered.
  • Reviewers: code-reviewer + architecture-reviewer subagents.

PR-2.2 — Flagsmith flags + dispatcher wiring — ~1 day

  • Add the two flags (prompt_arch_v6_2_flexible, consolidated_state_backfill_complete) to config/feature_flags.yaml.
  • Sync to Flagsmith Prod + Dev (both default false; per feedback_flagsmith_dual_env).
  • Update prompt_dispatcher.py to implement the 3-way fork from §5.
  • Update app/agents/conversation_prompt.get_system_prompt to call the new dispatcher when v6.2 is selected; bypass the SystemPromptResult.cache_segments codepath (v6.2 returns a single composed string + a single ephemeral cache marker, not 5 segments).
  • Update app/agents/triage_agent orchestration entry to detect v6.2 and call run_triage_turn_v6_2 instead of run_triage_turn.
  • Add metric prompt_arch_v6_2_dispatch_count{result=served|fell_through|errored} to swallow_metrics.
  • Test: with both flags FALSE, no behavior change vs. trunk (regression net).
  • Test: with prompt_arch_v6_2_flexible=true but consolidated_state_backfill_complete=false, v6.1 still serves + prompt_arch_v6_2_backfill_gate_fired counter increments.
  • Acceptance: flag-off equivalence; flag-on-backfill-off fall-through; no production traffic affected.

PR-2.3 — Read-path swap in triage hot path — ~1 day

  • Confirm consolidated_state_backfill.py extended doc backfill (from PR-2.1) and run scripts/backfill_consolidated_state.py --dry-run to spot-check 10 representative cases (TKR, ACL, oncology, exploratory).
  • Wire triage_v6_2.run_triage_turn_v6_2 to consume case.consolidated_state (already populated by Phase 1's dual-write shim).
  • Documents-block render verification (§6.1). Add test_documents_block_renders_for_case_with_uploaded_doc end-to-end fixture: case has 1 uploaded doc with ocr_status="processing"; assembled prompt's <documents> block must render doc_id + type + ocr_status. Empty findings: {} is acceptable. BLOCKS PR-2.5 until green.
  • Transcript dataclass extension : extend the Transcript dataclass (used by LLM-grader axis-11) with sop_contract_id: str | None field. Plumb from triage_v6_2.TurnResult.sop_contract_id through to Transcript so axis-11 grader (SOP metadata) actually fires on v6.2 fixtures (currently defaults to not_applicable).
  • Equivalence test tests/test_v6_2_read_equivalence.py: take 20 production-shaped fixture cases; assert that SOPContract.checklist_for_prompt(case.consolidated_state) exposes the same captured / still-needed fields as the v6.1 layer_state representation does (via a small adapter that simulates the v6.1 view).
  • Acceptance: equivalence test passes; documents render-test green; Transcript carries sop_contract_id; no flag flips yet.

PR-2.4 — Drop stages.yaml from v6.2 path + rename dispatcher — ✅ shipped (#1327)

  • Rename approach: deprecation-docstring + alias (not file rename). 14 import sites across app/ and tests/ — alias is safer than a mechanical rename. v6_dispatcher.py keeps all symbols; module docstring notes Phase 6 cleanup. prompt_dispatcher.py is the 3-arch canonical module.
  • Confirmed via grep + source-inspection tests: triage_v6_2.py and prompt_dispatcher.py do NOT reference _load_stage_context or _load_stage_re_offer_turns — the v6.2 path was already stages-free from PR-2.1.
  • v6_dispatcher.py docstring updated to document architecture split (v6.1 owns stages.yaml; v6.2 path bypasses it) and DispatchResult unification deferred to Phase 6.
  • DispatchResult unification DEFERRED to Phase 6: v6_dispatcher.DispatchResult (6 fields) wrapped by prompt_dispatcher._v6_1_dispatch at all call sites; unified single dataclass deferred to Phase 6 cleanup after v6.1 sunset.
  • v6.1 path remains unchanged; stages.yaml on disk untouched.
  • Acceptance:from app.agents.v6_dispatcher import dispatch_prompt_arch still works; ✅ no v6.2-path code reads stages.yaml (7 tests confirm); ✅ all 41 existing dispatcher tests pass.

PR-2.4.5 — Split triage_v6_2.py (798 LOC) into a package — ~0.5 day

triage_v6_2.py already exceeds the DoD 500-LOC ceiling at 798 lines. Split into a package, all submodules < 500 LOC each:

  • app/agents/triage_v6_2/__init__.py — re-exports run_triage_turn_v6_2, TurnResultV6_2. Keeps existing import paths working (from app.agents.triage_v6_2 import run_triage_turn_v6_2).
  • app/agents/triage_v6_2/dispatch.pyrun_triage_turn_v6_2, TurnResultV6_2, outer try/except + fallback shim. ~250 LOC.
  • app/agents/triage_v6_2/prompt_builder.py_cacheable_prefix, build_variable_tail, render_documents_block, render_previous_assistant_promises, the _assemble_v6_2_prompt orchestrator. ~300 LOC.
  • app/agents/triage_v6_2/extractor_observers.py_run_extractor_observers + asyncio.gather wiring. ~150 LOC.
  • app/agents/triage_v6_2/llm_error_fallback.py — fallback response shim used when llm_gateway raises. ~80 LOC.

Test references update to from app.agents.triage_v6_2 import ... (no path changes — __init__.py re-exports preserve the public API). Mechanical refactor; reviewer subagent: code-reviewer (Sonnet).

  • Acceptance: all submodules < 500 LOC; wc -l app/agents/triage_v6_2/*.py | awk '{if ($1 > 500) print}' returns empty; existing tests/test_triage_v6_2.py passes without source modification; PR-2.4 dispatcher imports still resolve.

PR-2.5 — Sindhu tenant rollout — ✅ code shipped; operator-driven rollout pending

PR-2.5 ships the rollout infrastructure: consent intent detector ( app/services/consent_intent_detector.py) wired into intake_triage._handle_intake_triage_v6_2, PostHog cohort-funnel event stubs (app/integrations/posthog_events.py), Flagsmith dual-env flip helper (scripts/flip_v6_2_canary.py), comprehensive 11-phase operator runbook (docs/runbook/v6-2-rollout-checklist.md), and the monitoring-query reference (docs/runbook/v6-2-monitoring.md). The operator-driven rollout itself (Flagsmith flip + Sindhu test session + 24h soak) is executed by SD per the runbook AFTER this PR merges — see runbook Phases 0 → 8 for the copy-paste-ready commands.

  • Operator runs scripts/backfill_consolidated_state.py against production. Confirms zero consolidated_state_shim_failure swallow events for 24h. (runbook Phase 1)
  • Operator runs scripts/backfill_vision_interpretation.py --tenant-id <sindhu> for the existing-doc backlog. (runbook Phase 2)
  • Operator runs scripts/flip_v6_2_canary.py --action enable-backfill-complete --execute to flip the global env-default flag in BOTH Flagsmith envs (per feedback_flagsmith_dual_env). (runbook Phase 3)
  • Operator soaks 1h at zero-tenant; verifies shim_failure + dispatcher metrics. (runbook Phase 4)
  • Operator runs scripts/flip_v6_2_canary.py --action enable-sindhu-tenant --tenant-id <sindhu> --execute for the per-tenant identity override in BOTH envs. (runbook Phase 5)
  • SD initiates 5 fresh conversations with Sindhu persona. (runbook Phase 6)
  • Compare via tests/replay/test_v6_2_parity.py extended with the new Sindhu transcripts. (runbook Phase 7)
  • Acceptance: < 3 issues per turn averaged across the 5 conversations (down from 22 on case 0a6b7e48 baseline); no axis-1 (clinical safety) regressions; voice rules pass. Per spec §20, the four numbered measurable criteria must ALL hold for 7 consecutive days before Phase 6 (full cutover) dispatches.
  • Rollback criterion: scripts/flip_v6_2_canary.py --action rollback-all --tenant-id <sindhu> --execute (<30s) if ANY of {voice_rules_test_failure_rate > 1%, llm_meta.fallback_fired rate > 5%, swallow_metrics.consolidated_state_shim_failure > 0, axis-1 grader axis drops > 0.5} crosses threshold sustained for 1h.

Consent intent detection (PR-2.5 code change). Per the matching-gate spec §6 (set_patient_proceed_consent helper), the gate already accepts a consent-bypass for the document-findings check. PR-2.5 wires the actual detector: regex/keyword scan of the patient turn text in intake_triage._handle_intake_triage_v6_2 (before the LLM call), flips consolidated_state.sop_checklist.patient_proceed_consent = True on detection, persists via consolidated_state_writer, and emits v6_2_patient_proceed_consent_detected PostHog event. Negation-guarded (e.g. "not ready to find providers" does NOT trigger). Safety-mandatory data fields remain un-bypassable regardless of consent.

Total chain effort: ~7 working days end-to-end. PR-2.1 carries the bulk of the design risk; PRs 2.2-2.4 are mechanical wiring; PR-2.5 is a controlled rollout with monitoring.


10. Risk model

ID Risk Mitigation
R1 v6.2 produces worse output than v6.1 on some case types Replay harness gates each PR (5-turn fixture corpus). Flagsmith per-tenant rollout + 30-second rollback. Sindhu canary before any broader flip.
R2 consolidated_state missing fields the v6.2 prompt expects Backfill script verified pre-rollout. Missing fields render as in patient_context (defensive). _is_present helper in sop_contract.py:730-739 already handles None/empty-string/empty-list/empty-dict uniformly.
R3 SOP checklist phrasing confuses the LLM (too checklist-y, model enumerates) Prompt iteration on Sindhu tenant before broader rollout. The "pick ONE, don't enumerate" instruction at the prompt foot is load-bearing — tune wording if grader sees enumeration regressions.
R4 Voice rules + SOP scope post-checks need updating Phase 5 deals with this. Phase 2 keeps existing voice_rules.yaml + response_policy.py. No new voice failure modes expected because the base prompt body is byte-identical.
R5 Document interpretation gap (Phase 4 not done) Phase 2 prompt explicitly shows pending / failed / not_applicable doc status. The LLM handles the gap gracefully (it asks the patient about findings verbally or defers). The structural fix (LLM sees state instead of hallucinating) holds without Phase 4 — Phase 4 just enriches findings.
R6 Extractors run as observers but their delayed writes cause stale data on rapid turns The dual-write shim flushes on every merge. Worst case: turn N's extractor delta lands at turn N+1's prompt assembly — same lag as v6.1 today. No regression.
R7 The new prompt_dispatcher.py rename breaks an unforeseen import PR-2.4 keeps a re-export shim in v6_dispatcher.py for one release. CI's import-scanner catches direct symbol-misses.
R8 In-flight conversation silently switches arch mid-flight at flag flip Per-case stickiness (§5.1): case.workflow_state.prompt_arch is written at first turn and ignored by the dispatcher on subsequent turns. Flag flips affect NEW cases only. force_prompt_arch field available for explicit migration.
R9 consolidated_state empty/partial when v6.2 serves; LLM hallucinates fields v6.2 always serves once flag+backfill gates are on; missing fields render and the LLM treats them as "ask next." _is_present helper handles None/empty cases uniformly. No shape-of-data fallthrough trigger — only flag/exception/sticky-arch fallthroughs (see §6 empty-state policy).
R10 SOPContract.load(...) raises and crashes the dispatcher SOPContract.load(...) is contract-bound to NEVER raise (3-tier resolver: explicit code → fuzzy name → _generic fallback). The _generic contract is guaranteed loadable. The dispatcher relies on this; if a refactor breaks it, the outer try/except still falls through to v6.1 and increments prompt_arch_v6_2_sopcontract_load_failure.
R11 Naidu/SD edits SOP YAML in admin UI but v6.2 keeps serving stale contract until Railway restart SOPContract.invalidate_cache() is wired into admin_sops.py save handlers (see §11). Every YAML write triggers cache_clear(); next dispatcher call re-reads. Add swallow_metric sop_contract_cache_invalidated{trigger} for observability.
R12 v6.2 conversations never auto-promote to matching because PFS / intake_complete / auto_invoke_matcher don't fire §12 assigns lifecycle responsibilities explicitly: PFS scorer lives in app/services/pfs_v6_2.py and is called by the chat router post-turn; intake_complete derives from SOPContract.missing_for_matching(consolidated_state) == []; auto_invoke_matcher reads the same. No silent stall mode.
R13 consolidated_state.documents block always empty → D1-D6 structural fix doesn't fire → patient still sees "extraction failed" hallucinations §6.1 makes documents wiring a HARD acceptance criterion: _empty_consolidated_state() shape verified, transform_case_to_consolidated_state MUST populate from case.documents, consolidated_state_writer.write_consolidated_state MUST refresh on every call, PR-2.1 backfill re-run is a separate sub-task that BLOCKS PR-2.5. PR-2.3 acceptance includes end-to-end render test. Replay-harness baseline includes a case with uploaded doc.

10.1 Flagsmith cache TTL impact on Phase 2.5 rollout

feature_flags._CACHE_TTL = 60 seconds, per-process. During the Phase 2.5 tenant flip for Sindhu's tenant:

What happens during the flip window. All Railway worker processes that handled a request in the 60 seconds before the flag flip have prompt_arch_v6_2_flexible = false cached. Any NEW case created on those warm processes during the cache window resolves to "v6" (v6.1) and stickifies that arch permanently (workflow_state.prompt_arch = "v6"). Because per-case stickiness wins on all subsequent turns (§5.1 R8), those cases remain on v6.1 for their entire lifetime — the 30-second rollback guarantee does not help them.

Quantified blast radius. With _CACHE_TTL = 60s and a typical Railway single-container deployment: - Up to 60 seconds of newly-created cases may permanently stickify to v6.1 after the flag flip. - Existing in-flight cases are unaffected (sticky already set before the flip).

Mitigations — choose one before executing Phase 2.5.

Option Effort Trade-offs
A — Admin cache-bust endpoint (recommended) Low Call POST /admin/flags/invalidate-all immediately before + after the Flagsmith tenant flip. Drains the in-process cache on the active Railway replica. Requires a new admin route wired to feature_flags.invalidate_all().
B — Reduce TTL for these two flags Medium Add per-flag TTL override (e.g. _FLAG_TTLS = {"prompt_arch_v6_2_flexible": 10, "consolidated_state_backfill_complete": 10}). Reduces the stale window to 10s at the cost of slightly more Flagsmith calls during the rollout window.
C — SIGHUP restart Low-ops railway restart (or Railway UI restart) immediately before the flip flushes all in-process caches. Downtime: ~5s Railway cold-start. Acceptable for a planned rollout during low-traffic hours.

Rollback guarantee clarification. The spec §9 / R8 claim "30-second rollback" applies to the Flagsmith flag eval only (next un-cached call). It does NOT apply to cases that have already stickified to "v6.2_flexible". Those cases stay on v6.2 after a flag flip-back until explicitly migrated via force_prompt_arch. The "rollback" is therefore: new cases revert within 10–60s (depending on mitigation chosen), existing sticky-v6.2 cases require a migration script or manual force_prompt_arch writes.

Action items for Phase 2.5. - [ ] Wire Option A (/admin/flags/invalidate-all) in the Phase 2.5 PR OR confirm Option C (restart) is acceptable. - [ ] Update the Phase 2.5 runbook to include the cache-bust step as the first action immediately before the Flagsmith flip. - [ ] Add a Metabase query: SELECT prompt_arch, count(*) FROM cases WHERE created_at > <flip_time - 2min> AND created_at < <flip_time + 2min> to verify the stickification window empirically post-flip.


11. SOPContract cache invalidation

SOPContract.all_contracts() and _generic_contract() are @lru_cache(maxsize=1) (per Phase 1 PR-1.2). The 18 SOPs in config/prompts/sops/ are flagged data_source: fabricated_pending_naidu_review_* — Naidu's clinical review IS expected to produce YAML edits, and SD edits SOPs via the admin UI. Without an invalidation hook, edits won't surface until the Railway container restarts.

The hook. Phase 2 adds SOPContract.invalidate_cache() as a classmethod:

class SOPContract:
    @classmethod
    def invalidate_cache(cls) -> None:
        """Clear the lru_cache on SOP loaders. Called by admin save handlers
        after any SOP YAML write, and by file-watcher fallback in dev."""
        cls.all_contracts.cache_clear()
        cls._generic_contract.cache_clear()
        record_swallow("sop_contract_cache_invalidated",
                       labels={"trigger": "admin_save_or_file_change"})

Wiring sites. - app/routers/admin_sops.py: every save handler (POST /admin/sops/{sop_id}, PUT /admin/sops/{sop_id}/yaml, bulk POST /admin/sops/sync) calls SOPContract.invalidate_cache() immediately after the file write succeeds, BEFORE returning the HTTP response. The next conversation turn picks up the change without container restart. - Dev fallback (low priority — implement in PR-2.4 only if dev workflow demands it): a watchdog-based file watcher on config/prompts/sops/ that calls invalidate_cache() on any YAML change. Production doesn't need this because admin saves are the only mutation path.

Acceptance criterion (PR-2.1). Test test_sopcontract_cache_invalidated_after_admin_save: load contract via SOPContract.load(code="0001") → mutate the YAML on disk → assert cached contract still returns old data → call SOPContract.invalidate_cache() → assert SOPContract.load(code="0001") returns new data. End-to-end variant in PR-2.5: Naidu edits a SOP in the admin UI, the next Sindhu conversation turn reflects the edit.

Observability. sop_contract_cache_invalidated{trigger} swallow_metric with trigger ∈ {admin_save, file_watcher, manual}. Spike in this counter outside an active admin session signals a bug (or unexpected file mutation).


12. Conversation lifecycle responsibilities

v6.1 has three code-node responsibilities owned by triage_agent / chat.py: (a) PFS scorer (computes feasibility band), (b) intake_complete computation (workflow_state), (c) auto_invoke_matcher (decides matcher trigger). The single-LLM-call shape of v6.2 doesn't naturally own these; this section pins where they live under the v6.2 arch.

12.1 PFS scorer

Where it lives under v6.2. New standalone module app/services/pfs_v6_2.py. Reads consolidated_state, returns {band, score, drivers} matching the v6.1 PFS contract so the coordinator dashboard and matcher gates keep working.

Trigger. Chat router calls pfs_v6_2.compute(case) POST-turn — after the LLM response is published to WS AND after consolidated_state_writer.write_consolidated_state flushes the turn's extractor deltas. This preserves the v6.1 ordering invariant (PFS reflects the state AFTER the turn, not before).

Behavior parity. PFS inputs (completeness percentages across demographics, procedure, medical, financial, documents) are all readable from consolidated_state without a stages.yaml lookup. The v6.1 compute_pfs_node at triage_agent.py:1393 is the reference implementation; pfs_v6_2.compute re-derives the same band/score from the unified state. Write the result to case.pfs_band + case.pfs_score — same columns the v6.1 path writes, same coordinator dashboard reads.

Decision recorded. Standalone module + chat-router call site. NOT a LangGraph code node (v6.2 has no graph). NOT inlined in triage_v6_2 (the LLM-call boundary is the natural separator).

12.2 intake_complete computation

Source of truth under v6.2. intake_complete := (SOPContract.missing_for_matching(consolidated_state) == []).

Where the flag is set. Chat router computes the boolean post-turn (after consolidated_state_writer flush, before publishing the response envelope) and writes to case.workflow_state.intake_complete. Same column the v6.1 path writes; no schema change.

Why the SOP contract is the new source of truth. v6.1's intake_complete heuristic at chat.py:325-337 was a layer_state-based rollup of mandatory-field completeness. The SOP contract's missing_for_matching method (Phase 1, sop_contract.py) is the DIRECT, declarative version of that rollup — it knows exactly which fields are mandatory-for-matching per SOP. v6.2 retires the heuristic in favor of the explicit contract check.

Edge case: generic SOP. If the SOP resolves to _generic (no procedure identified yet), missing_for_matching always returns a non-empty list (procedure_name itself is missing). Therefore intake_complete stays False. That's correct — generic conversations shouldn't auto-match.

12.3 auto_invoke_matcher

Where it lives. Continues firing from the chat router. Same call site (auto_invoke_matcher.fire(case) post-turn). The hook signature is unchanged.

Read source change. Under v6.2 sticky arch, the matcher gate reads SOPContract.missing_for_matching(case.consolidated_state) == [] AND not in_consent_loop AND case.pfs_band in {"green", "yellow"}. Under v6.1 sticky arch, the existing layer_state.completion gate continues working unchanged. The dispatch hook in auto_invoke_matcher.py:62-96 branches on case.workflow_state.prompt_arch.

Why the same hook works. Both v6.1 and v6.2 produce the same outputs (case.workflow_state.intake_complete, case.pfs_band); only the upstream computation source differs. The matcher itself sees a consistent input contract regardless of arch.

12.4 case.status transitions

intake → matching → consultation transitions are orchestrated by the same chat-router post-turn block that fires auto_invoke_matcher. Under v6.2, the gate is intake_complete=True AND pfs_band ∈ {green, yellow} (same as v6.1); the only change is the upstream source of intake_complete per §12.2. No new transition rule. No silent stall mode — if the v6.2 path is serving and the SOP contract is satisfied, the case promotes the same way it does under v6.1.

12.5 Coordinator dashboard PFS continuity

Because §12.1 keeps writing case.pfs_band + case.pfs_score from the v6.2 path, the coordinator dashboard reads continue working without a schema or query change. No "PFS hidden for v6.2 cases" mode. Decision recorded: option (a) from the review — keep PFS computation alive as a v6.2 sidecar in pfs_v6_2.py.


13. Phase 2 NON-goals

Explicitly out of scope (deferred to later phases):

  • Don't drop stages.yaml from the v6.1 path. Phase 6 only — v6.1 stays alive as fallback.
  • Don't refactor extractors to a formal observer protocol. Phase 3 — Phase 2 just stops gating the response on them.
  • Don't build the vision document interpreter. Phase 4 — Phase 2 surfaces doc state in the prompt so Phase 4 has a place to plug in findings.
  • Don't add a post-LLM SOP scope checker. Phase 5 — Phase 2 keeps the existing voice_rules validator.
  • Don't migrate matcher / coordinator dashboard / chat-UI to consolidated_state reads. Phase 2.5 (matcher) and Phase 6 (UI + dashboard).
  • Don't change the JSON response envelope. Same {message, extracted_data, detected_comorbidities, phase_complete, suggested_next}conversation_v4_parser.parse_v4_response keeps working.

14. Open questions for SD

  1. (RESOLVED — see §3.2) Should the SOP checklist "Captured" section be cached? Answered by the M1 amendment: the checklist projection sits BELOW the cache boundary; only the static SOP definition sits above it. Sub-segmenting the checklist further (caching "Still needed" separately from "Captured") remains a NICE-TO-HAVE for Phase 2.5 once cost telemetry lands.

  2. (PARTIALLY RESOLVED — see PR-2.1 LLM-grader stub) Does the grader still run on v6.2? Yes — grader axes (warmth, clinical safety, friction, pacing, no-repeat) are stage-agnostic. PR-2.1 stamps prompt_arch=v6.2_flexible on every v6.2 turn so grader fixtures can filter by arch. Remaining sub-question: if grader fixtures hard-depend on stage_id in metadata, they need a v6.2-shaped fixture set. Action: file as Phase 2.5 follow-up issue, ride the PR-2.1 skip-marker for v6.2 fixtures until the grader fixture set is produced.

  3. Sindhu testing during rollout — pause her or use her as canary? Strong lean: canary. She's the test case the v6.2 redesign was triggered by; using her account on the v6.2 path lets us validate the fix in the same workflow that surfaced the bug. The risk is small because Flagsmith per-tenant rollback is <30s, AND the per-case stickiness (§5.1) prevents her in-flight conversations from switching arch underfoot.

  4. Rollback criterion threshold. Proposed in PR-2.5: flip flag off if any of {voice_rules_test_failure_rate > 1%, llm_meta.fallback_fired > 5%, shim_failure > 0, axis-1 grader drop > 0.5} crosses for 1h. SD's pick on the specific numbers — current proposal errs on tight (rollback-quick) to keep blast radius small during the first tenant flip.

  5. Sindhu polish items 22dd0c91 + 20f7fffe. Sindhu's earlier audits (cases 20f7fffe + 22dd0c91) surfaced 4 polish items per session memory. These are not enumerated in this spec. Q: Should they be addressed in PR-2.5 monitoring (passive observation) or actively in PR-2.3 prompt template (active fix)? Suggest SD review session memory for those 4 items + confirm scope. Until SD answers, defaulting to PR-2.5 passive observation with a follow-up issue per item if any regress.


15. Trace metadata stamps (observability)

To keep Langfuse / Metabase dashboards usable across v4.1 / v6.1 / v6.2 traffic, every v6.2 turn stamps the assistant message's extra_metadata (the llm_meta dict in triage_agent.py:1124-1144) with:

{
    "prompt_arch": "v6.2_flexible",         # matches case.workflow_state.prompt_arch (§5.1)
    "prompt_arch_version": "v6.2-flexible",
    "prompt_arch_source": "sticky" | "force_override" | "first_resolve",  # per review M2
    "sop_contract_id": contract.sop_id,
    "consolidated_state_revision": <int hash>,
    "stage_id": None,                       # explicit None — v6.2 has no stage
    "addendum_ids": [],                     # always empty
    "processed_document_ids": [...doc IDs from consolidated_state.documents...],
    "doc_status_counts": {                  # per review S2 — distribution by status enum
        "queued": 0, "processing": 1, "complete": 2,
        "failed_transient": 0, "failed_permanent": 0, "expired": 0,
    },
    "extractor_observations_count": len(turn_result.extractor_observations),
    "v6_2_fallback_reason": <str or None>,  # set on internal fall-through
    "previous_turn_promises_count": <int>,  # per review S4 — axis 6 telemetry
}

This preserves the queryability of every v6.1 dashboard: prompt_arch=v6.2_flexible becomes a filterable cohort; per-case consolidated_state_revision lets the grader correlate prompt content with response quality; the explicit stage_id=None distinguishes v6.2-was-served from "stage_id metadata bug."

15.1 Metabase panels + alerting

The following Metabase panels (auto-deployed via metabase-dashboards repo on the canary tenant dashboard) drive the PR-2.5 monitoring loop:

Panel Query Alert
prompt_arch_v6_2_serve_rate dispatch_count{result="served"} / sum(dispatch_count) per tenant per hour Telegram alert if served_rate < 0.95 for 1h on canary tenant (signals silent fall-through)
prompt_arch_v6_2_fallback_reasons Top-N values of llm_meta.v6_2_fallback_reason over the past 24h Manual review weekly during canary phase
prompt_arch_distribution Histogram of prompt_arch_sticky_resolved{source} by arch per tenant per day Alert if force_override rate > 1% (signals operators are manually overriding too often — symptom of a bad rollout)
sop_contract_cache_invalidated Time series of sop_contract_cache_invalidated{trigger} events Alert if spike outside active admin session (signals unexpected file mutation or bug)
prompt_arch_v6_2_token_ceiling_hit Count of token-ceiling truncations per hour Alert if > 5/hr (signals history-tail truncation is firing often → prompt budget too tight or runaway history)
doc_status_distribution Histogram of llm_meta.doc_status_counts rolled up across all v6.2 turns Manual review during canary; surface failed_permanent outliers

Wire Telegram alerts via app/services/alerting.py. Manual flip remains the primary rollback mechanism (per the §10 rollback criterion); these panels feed the human decision, not an auto-rollback loop (deliberately conservative for the first tenant flip — automation can come post-Sindhu).

15.2 PostHog events for cohort funnel comparison

Phase 2 introduces a fundamentally different conversation experience (single-LLM judgment vs. stage machine). Without funnel tracking, SD can't compare v6.1 vs v6.2 cohort progression through register → engage → match → consent. Required PostHog events:

Per-turn events. - v6_2_turn_started{arch, sop_id, tenant_id, case_id, turn_index} — fires per assistant turn under v6.2 sticky arch. Tags every downstream funnel event with prompt_arch. - v6_2_extractor_observation_captured{layer, confidence_bucket, tenant_id} — fires per extractor observation in TurnResult.extractor_observations. Confidence bucket: {low, mid, high} based on extractor confidence threshold. - v6_2_sop_checklist_status{sop_id, satisfied_count, missing_count, tenant_id, turn_index} — per turn. Funnel metric for "are conversations actually completing the SOP checklist faster under v6.2 vs v6.1?" - v6_2_doc_status_rendered{status, tenant_id, turn_index} — per turn that includes documents. Status is one of the §7 enum values.

Cohort comparison tag. Standard Curaway funnel events (register → engage → match → consent) already emit via apps/<portal>/src/services/analytics.ts helpers. Each event MUST be tagged with prompt_arch ∈ {v4, v6.1, v6.2_flexible} (sourced from case.workflow_state.prompt_arch). The dispatcher writes this stamp into a request-scoped context that PostHog helpers pick up.

PHI redaction (per §17 + CLAUDE.md ground rule 5). PostHog events MUST NOT include conversation text, extracted clinical fields, patient names, MRNs, diagnostic codes, or FHIR-shaped JSON. Only IDs (case_id, sop_id, tenant_id, clerk_user_id-as-opaque-token) and structural counts (turn_index, satisfied_count, missing_count, doc status enum value).

Phase 2.5 monitoring dashboard. Must include a conversion-rate comparison per arch: register → engage → match → consent funnel side-by-side for v6.1 cohort vs v6.2 cohort, filtered by tenant_id (Sindhu canary first). Surfaces whether v6.2 has a positive, neutral, or negative funnel impact before broader tenant flip.

Backend Event-model events. No new EventType enum values needed; existing intake_complete, case_promoted_to_matching events already capture the state changes. Add prompt_arch as a metadata field on those events (JSONB) so backend analytics can also cohort by arch.


16. Phase 4 (Document Interpretation) — DISPATCHED IN PARALLEL

SD's hard requirement for v6.2 is "make v6.1 work the way v4.1 did + interpret reports." Phase 2 alone scaffolds doc state in the prompt but does NOT close the report-interpretation requirement — Phase 4 must follow to populate the findings sub-dict that closes D1-D6 structurally.

Parallel dispatch. Phase 4 spec is being authored concurrently with this Phase 2 amendment. Phase 4 has independent design unknowns (vision model choice, signed-URL pipeline for vision-LLM calls, cost model for image-input Haiku/Sonnet at 5MB DICOM-derived JPGs, 18-SOP interpretation_schema population). These design unknowns do NOT share code with Phase 2 — Phase 4 plugs into the findings dict that Phase 2's <documents> block already surfaces.

Sequencing. 1. Phase 2 PR chain (PR-2.1 → PR-2.4.5) ships the prompt architecture + documents wiring + file split. 2. Phase 4 PR chain ships in parallel — vision extractor pipeline, interpretation_schema population for 18 SOPs, signed-URL setup, findings population in consolidated_state.documents[doc_id].findings. 3. BOTH must land before PR-2.5 (Sindhu rollout). PR-2.5 acceptance gates on Phase 4 completion because the D1-D6 structural fix requires Phase 4's findings to be present for at least one doc type per fixture conversation.

Phase 4 scoping kickoff. File the Phase 4 spec dispatch IMMEDIATELY in parallel with PR-2.1 implementation. Don't wait for Phase 2 to ship. Phase 4 spec must define: vision model choice + cost projection, signed-URL pipeline, interpretation_schema shape per SOP, findings reconciliation when vision extractor confidence is low, fallback when vision call fails.


17. Security / Compliance — GDPR + PHI redaction

17.1 GDPR Article 17 cascade

case.consolidated_state JSONB column carries demographics + medical conditions + financial signals + document references — all PHI per HIPAA + GDPR definitions. Must be included in the GDPR Article 17 erasure cascade.

Action. - Verify app/services/gdpr_service.py erasure handler enumerates case.consolidated_state in its column list. - If missing, file a follow-up issue BEFORE PR-2.5 dispatches; block PR-2.5 until coverage confirmed. - Add an integration test test_gdpr_erasure_cascades_to_consolidated_state that creates a case, populates consolidated_state via the writer, triggers erasure, asserts the column is NULL or {} post-erasure.

Why this matters. Pre-Phase-2 the JSONB column existed but only as a dual-write mirror — the deletion handler may have been written assuming layer_state + ehr_snapshot were the only PHI surfaces. Phase 2 makes consolidated_state the SOURCE OF TRUTH for the v6.2 hot-path; a missing erasure entry is a GDPR violation.

17.2 PHI redaction in PostHog + Langfuse + Telegram

  • PostHog events (per §15.2): IDs + counts ONLY; no conversation text, no extracted fields.
  • Langfuse traces: existing convention — case_id + patient_id are opaque UUIDs (not PHI), conversation text passes through Langfuse as the LLM call body (already covered by existing Langfuse PHI-handling agreement; no Phase 2 change).
  • Telegram alerts: tenant_id labels only — never patient_id, never patient names. Verified at §15.1 alert wiring.

17.3 Access logs

The JWTRedactingAccessFormatter in app/observability/access_log.py already scrubs ?token=<JWT> from access-log lines. No Phase 2 change.


18. Multilingual deferral

Scope. v6.2 ships English-only initially. Arabic + Hindi locales (currently served by v4.1 fallback) remain on v4.1 during Phase 2 rollout. Defer locale support to Phase 7 (polish round).

Why. - conversation_v6.yaml is currently English-only (base body lines 34-1028). - The grave_disclosure DSL's first-person tokens are hardcoded English ("I'm so sorry", "that must be …"). Must add locale-aware token sets in Phase 7. - The SOP checklist rendering (SOPContract.checklist_for_prompt) produces English labels ("Still needed", "Captured", "Documents still needed"). - The <documents> status-phrasing renderings in §3.1 are English-only.

Routing. Under v6.2 sticky arch, if case.detected_language != "en", the dispatcher MUST fall through to v4.1 (existing localized prompt path) with fallback_reason = "non_english_locale_v6_2_deferred" and increment prompt_arch_v6_2_locale_fallthrough swallow_metric. Sindhu canary is English-only, so no impact on Phase 2.5.

Phase 7 deliverables. Localize conversation_v6.yaml + locale-aware grave_disclosure tokens + localize SOP checklist labels + <documents> status phrasings + RTL dir="auto" in any rendered chat-UI surface (frontend, post-Phase 6).

Memory pin. After spec lands, pin feedback_v6_2_locale_deferral.md so future sessions don't re-litigate the deferral.


19. Audit findings disposition

Maps each Sindhu / work-queue audit finding to its closure path under the amended spec. Verified closure rate is now ~92% (D1-D6 closes via §6.1 + Phase 4; the remaining gaps are tracked + dispatched as parallel work or punted explicitly).

Finding Disposition under amended spec
D1-D6 (doc-visibility hallucinations, case 0a6b7e48) Addressed by §6.1 documents wiring (Phase 2) + Phase 4 (findings population). Both ship before PR-2.5. R13 mitigated.
#1308 (doc-visibility runtime guard, work-queue Phase 3.2) SUBSUMED by Phase 4 SOP-aware document interpretation. No standalone runtime guard needed once Phase 4 lands findings. Issue #1308 can be closed when Phase 4 PR chain lands.
#1303 / #1304 (axis 11 SOP metadata plumbing, work-queue) Amended in PR-2.3 acceptance: Transcript dataclass extension carries sop_contract_id; axis-11 grader unblocked on v6.2 fixtures. Closes #1303 directly; #1304 (axis parser tests) becomes a follow-up under the same epic.
#1310 (P2 SOP synonym sweep — rotator cuff, fracture_fixation, cervical) Phase 1 inheritance — not blocking Phase 2. PR-2.5 monitoring includes sop_sticky_rate Metabase panel to spot synonym misses early. Open standalone work-queue item.
22dd0c91 (Sindhu polish items, 4 items) OPEN QUESTION for SD (§14 Q5). Pending SD validation of scope: PR-2.5 monitoring (passive) vs PR-2.3 prompt template (active).
20f7fffe (Sindhu original test) OPEN QUESTION for SD (§14 Q5). Pending SD review of session memory.
9e9379c6 (Abdul Moeed, A3/A4 grave-disclosure + age-first) Closed at Phase 1 level. Preserved in <base_voice_and_safety> (§3.1); guarded by tests/prompts/test_v6_prompt_hygiene_1194.py.
0bb36212 (ACL repair audit) Closed by PR-1309 + PR-1310 (Phase 1 SOP synonyms).
A1 / A2 (voice rule strings "Your body is telling you …", vague "needs to change") Preserved base prompt voice rules + response_policy.py. Adding the exact A1/A2 strings to voice_rules.yaml forbidden_phrases is a Phase 2 nice-to-have; reviewer-flagged as small follow-up.
A3 (cost guidance withheld) Closed structurally — v6.2 single-call removes the addendum-classifier gate that suppressed cost talk. PR-2.5 fixture verifies.
B1 / C4 (double-question stacking, 4-item ask stack) Closed at §3.1 prompt foot "pick ONE, don't enumerate." Replay-harness fixtures catch regression.
B2 (generic empathy templating) Preserved base emotional-word echo rule (CLAUDE.md ground rule 9). Structurally untouched.
B3 / C3 (duplicate decision prompt, unnecessary clarification loop on "months") §3.1 <previous_turn_promises> + "don't re-ask anything already in Captured" foot. Tested by PR-2.5 fixture.
C1 (SOP not loading on procedure name only) Closed at Phase 1 level (SOP synonyms + 3-tier resolver).
C2 (mso_second_opinion addendum fires at 0.70 confidence) Closed structurally — addendums DROPPED in §3.3. C2 failure mode cannot occur.
C5 (explicit "find providers" ignored) Implicit fix — v6.2 single-call sees the message + checklist + decides. PR-2.5 fixture pins.
C6 / C7 / C8 (budget extraction loop, INR/lakh) Closed at §3.1 multicurrency budget_display_text helper + single-LLM-call removes the extractor retry loop. R6 mitigated.
E1 / E2 (OCR queue stall variants) Phase 4 — not in Phase 2 scope. Surface ETA / status enum prepares the runway.
case 0a6b7e48 — 22 issues per turn baseline Replay-harness baseline; PR-2.5 acceptance "< 3 issues per turn averaged across 5 conversations." Confirmed closure rate via the matrix above.

20. Phase 3 / 5 / 6 scope clarifications

Phase 3 → renamed PR-2.6 (extractor reconciler)

Phase 3 was originally scoped as a separate phase ("formalize extractor observer protocol + reconciliation"). Re-evaluated: this is small enough to be a Phase 2 sub-step, not a separate phase. Rename to PR-2.6 — extractor reconciler.

Scope. - Wire _run_extractor_observers (from PR-2.4.5 split) to emit deltas to a post-merge reconciler queue. - Reconciler validates extractor delta against the LLM's stated extracted_data in the JSON envelope. - Conflict rule: "LLM wins on contradictions, extractor wins on additions" (per §8). - Persist reconciled deltas via existing consolidated_state_writer.

Sequencing. PR-2.6 ships AFTER PR-2.5 (so Sindhu canary validates the prompt architecture first; extractor formalization is a downstream polish). PR-2.6 ≈ 1 day.

Phase 5 — post-LLM SOP scope + completeness gate

Runtime path. Chat router calls SOPContract.scope_check(response_text, sop) after LLM response is published to WS but before intake_complete is computed. The scope check verifies the LLM didn't promote to matching while SOPContract.missing_for_matching != []; if violated, the router re-prompts with a "still need to complete X before matching" hint.

Minimal stub spec. Phase 5 spec must define: - SOPContract.scope_check(response_text: str, sop: SOPContract) -> ScopeCheckResult API. - Re-prompt loop: max 1 retry per turn to avoid infinite loops; on second failure, allow the LLM response through with a logged metric sop_scope_violation_uncaught. - Wiring point in chat router post-turn block (before auto_invoke_matcher.fire). - Test fixtures: 3 cases where LLM tries to promote prematurely; assert scope_check catches and re-prompts.

Gating threshold. Phase 5 must ship BEFORE tenant flip beyond Sindhu's tenant. Until Phase 5 lands, v6.2 has only voice_rules.yaml + response_policy.py as guardrails — enough for canary, NOT enough for Naidu's full clinical review or broader rollout.

Phase 6 — cutover criteria (MEASURABLE)

Phase 6 (drop stages.yaml, swap UI/dashboard reads to consolidated_state) requires a MEASURABLE definition of "v6.2 stable." Hand-waved criteria are insufficient. Required numbers:

  1. 5 consecutive Sindhu canary cases with ≤3 issues per turn (down from 22 on case 0a6b7e48 baseline). "Issues" measured by axis grader + Sindhu manual review.
  2. consolidated_state_shim_failure metric at zero for 24h sustained across all tenants (not just Sindhu's). Confirms the dual-write shim is healthy.
  3. v6.2 vs v6.1 axis-grader scores within 5% on shared fixtures — no axis (warmth, clinical safety, friction, pacing, no-repeat) regresses by more than 5% on the shared 5-turn fixture corpus.
  4. No new clinical_safety axis violations for 7 days on v6.2 cases. Zero axis-1 < 0.7 events across all v6.2 traffic.

If ALL four hold for 7 consecutive days, Phase 6 may dispatch (drop stages.yaml, swap UI reads). If any breach, the 7-day clock resets.


21. Summary

Phase 2 is a prompt-architecture flattening behind three safety nets (Flagsmith flag + backfill-complete gate + v6.1 fallback on any internal exception + per-case sticky arch preventing mid-conversation switches). The hot path becomes:

                       CACHED PREFIX                              MUTABLE TAIL
base_voice_and_safety ──┐                  ┌── sop_contract_checklist (Captured/Still-needed)
sop_static_definition ──┤── cache boundary ┤── patient_context + documents (consolidated_state)
                        │                  ├── previous_turn_promises (regex-scan last assistant msg)
                        │                  └── history (last 30 turns) + latest_user_message
                 single prompt ──> Haiku ──> response ──> post-turn sidecars:
                                                          • PFS scorer (pfs_v6_2.compute)
                                                          • intake_complete (SOPContract.missing_for_matching)
                                                          • auto_invoke_matcher hook

No stage machine, no addendum classifier, no multi-segment cache choreography. The model gets the contract status + voice rules + patient picture + document state + previous-turn promises + history, and exercises v4.1-style end-to-end judgment to ask the next thing. Lifecycle continuity (PFS, intake_complete, matcher promotion) is preserved via post-turn sidecars that read from consolidated_state instead of layer_state.

The dual-write shim from Phase 1 makes the read swap safe. The SOPContract API from Phase 1 makes the checklist trivially renderable. SOPContract.invalidate_cache() keeps Naidu's YAML edits live without container restart. Per-case stickiness (workflow_state.prompt_arch) prevents mid-conversation arch swaps. Phase 2 is the moment those four prep items pay off — and the moment SD sees whether the v6.2 hypothesis ("LLM judgment + SOP checklist beats stage machine + extractor gating") holds in production.