v6.2 Flexible — Phase 4 Document Interpretation Spec¶

Status: DRAFT — proposes the vision-extractor pipeline that fills consolidated_state.documents.{doc_id}.findings against the SOP-declared interpretation_schema. Filed in parallel with Phase 2 PR-2.1 per the Phase 2 DoD review (docs/specs/v6.2-flexible-phase2-dod-and-audit-coverage-review.md §Part 4, recommendation #11).

Author: Phase 4 dispatch, 2026-06-03.

SD's stated hard requirement: "SOP framework needs to interpret the reports (images/documents/xrays) — without which some of the validation checks will not work."

Companion specs: - docs/specs/v6.2-flexible-phase1-sop-contract-spec.md (interpretation_schema field declared on SOP required_documents[]; TKR + ACL examples §2) - docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §6 documents block, §7 findings shape, §11 cache invalidation, status enum - docs/specs/v6.2-flexible-phase2-dod-and-audit-coverage-review.md Gap 1 (D1-D6 documents wiring) + Phase 4 scoping risks

1. Goal restated¶

Phase 1 declared interpretation_schema on required_documents[] (config/prompts/sops/tkr.yaml:907-918 knee_xray schema → joint_space_mm: float, osteophyte_grade: int, malalignment_deg: float; config/prompts/sops/acl_repair.yaml:185-192 knee_mri schema → acl_tear_grade: int, associated_meniscus_tear: bool, associated_collateral_injury: bool). Phase 1's SOPContract.document_findings_complete() method (app/services/sop_contract.py:571-599) already validates findings against the schema and returns False when any mandatory document type lacks any schema key.

Phase 2 surfaces document presence + status + findings in the prompt (docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §3.1 <documents> block, §7 enum). The prompt already renders Findings: {doc.findings | join(", ")} when status == "complete". The shape is there; the values aren't filled.

Phase 4 fills them. Concretely:

A document interpreter service that, when a doc finishes OCR (or as a TTL fallback for queued OCR), runs a vision-LLM call against the SOP's interpretation_schema and emits structured findings.
Findings storage at consolidated_state.documents.{doc_id}.findings — the same dict shape Phase 2's prompt template reads.
The Phase 1 SOP completeness gate — SOPContract.document_findings_complete(consolidated_state) — starts returning True only when every mandatory doc has its findings populated; auto_invoke_matcher (Phase 2 §12.3) consumes this gate.
Prompt visibility preserved per Phase 2 — the LLM sees what was extracted and can ask the patient to confirm or supplement.

The Sindhu D1-D6 audit class (audit 0a6b7e48, fabricated "the extraction failed" / "uploading now" status hallucinations) is structurally addressed by Phase 2's <documents> block — Phase 4 turns the empty findings: {} into actual content so the LLM stops asking patients to verbally describe an X-ray that's already been read.

2. Architecture overview¶

2.1 New module: `app/services/document_interpreter.py`¶

# app/services/document_interpreter.py
"""v6.2 Phase 4 — vision-based document interpretation against SOP schemas."""

from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Literal

from app.services.sop_contract import SOPContract, RequiredDocumentEntry


InterpretationStatus = Literal[
    "skipped_no_schema",      # SOP doesn't declare a schema for this doc type
    "skipped_oversized",      # file too large to compress to safe vision-LLM size
    "skipped_not_in_sop",     # uploaded doc type doesn't match any SOP required_documents
    "complete",               # findings emitted, all schema keys present (some may be null)
    "failed_transient",       # vision call errored, retry pending
    "failed_permanent",       # exhausted retries (default 3×)
]


@dataclass(frozen=True)
class InterpretationResult:
    doc_id: str
    doc_type: str
    status: InterpretationStatus
    findings: dict[str, Any]           # {schema_field: value or None}
    rationale: dict[str, str]          # {schema_field: one-sentence rationale}
    notes: list[str]                   # informational; off-schema observations
    llm_meta: dict[str, Any]           # tier, provider, latency_ms, cost_usd
    failure_reason: str | None


async def interpret_document(
    *,
    document: "DocumentReference",     # app.models.document.DocumentReference
    sop_contract: SOPContract,
    tenant_id: str,
    case_id: str,
    patient_id: str,
    db: "AsyncSession",
    langfuse_handler: object | None = None,
) -> InterpretationResult:
    """
    Run the vision interpreter against a single document.

    Steps:
      1. Resolve the SOP's required_documents entry whose `type` matches
         document.document_category (or skip if no match → status=skipped_not_in_sop).
      2. If the matched entry has no `interpretation_schema`, status=skipped_no_schema.
      3. Fetch + preprocess the binary from R2 (vision_preprocessor.prepare_for_vision).
         If preprocessing fails / file too large after compression, status=skipped_oversized.
      4. Synthesize the vision prompt from interpretation_schema (synthesize_vision_prompt).
      5. Choose tier: presence-detection only (Haiku) vs structured numeric findings
         (Sonnet). See §2.2 tier-selection rules.
      6. Run vision call via llm_gateway.invoke() with the synthesized prompt.
      7. Parse JSON response strict=False; enforce schema-bounded output
         (drop fields not in interpretation_schema → notes[]). See §5.
      8. Return InterpretationResult; caller commits findings via
         consolidated_state_writer (see §6.2).

    NEVER raises. Internal errors → status=failed_transient + failure_reason.
    """
    ...


def synthesize_vision_prompt(
    *,
    doc_type: str,
    interpretation_schema: tuple[tuple[str, str], ...],
) -> str:
    """Translate SOP interpretation_schema → vision-LLM prompt.

    See §4 for the worked example and template. Pure function — unit-testable
    against fixture schemas without any LLM call.
    """
    ...

This module is intentionally isolated: it doesn't import from app.agents.* (no cross-domain), it talks to r2_client for binaries, llm_gateway for the vision call, sop_contract for the schema, and consolidated_state_writer for the write-back. Extraction-ready — could become an HTTP handler unchanged.

2.2 Vision-LLM tier selection¶

Per .claude/rules/backend-agents.md Model Tiers and CLAUDE.md cost constraints ($1K total budget, ~$9/mo current LLM spend at 100 cases):

Schema field type	Use case	Tier	Rationale
`bool` (e.g., `associated_meniscus_tear`, `bone_bruise_present`)	Presence / absence detection	Haiku 4.5	Cheap (~$0.001/image); model only needs to say yes/no with a rationale
`int` enum (e.g., `osteophyte_grade: 0-3`, `acl_tear_grade: 0-3`)	Categorical grading on bounded scale	Sonnet 4.6	Numeric grading benefits from stronger visual reasoning; cost ~$0.005/image acceptable
`float` measurement (e.g., `joint_space_mm`, `malalignment_deg`, `hba1c`)	Quantitative reading	Sonnet 4.6	Measurement extraction is the highest-stakes field; downgrade-to-null on low confidence
OCR-extractable text fields (e.g., `recency_days`)	Date / numeric from report header	Haiku 4.5	Already legible text; vision call equivalent to OCR re-read

Schema-level decision rule. _choose_tier(interpretation_schema) in document_interpreter.py: - If ANY schema field is float or int (excluding recency_days-style date fields), use Sonnet. - Else use Haiku.

Override available via config/model_registry.yaml per-SOP key (document_interpretation.knee_xray.tier: sonnet|haiku) so admin can dial cost/quality per SOP. Tier override resolved per call, never cached.

Cost estimate. - TKR case: 2-3 docs (knee_xray + optional bloodwork_recent) × ~$0.005 (Sonnet) = $0.01-0.015/case. - ACL case: 1 doc (knee_mri) × ~$0.005 = $0.005/case. - Bloodwork (Haiku): ~$0.001/case. - At 100 cases/mo: ~$1-2/mo total vision spend. Negligible vs $9/mo conversation LLM spend. - At 1,000 cases/mo (post-canary scale): ~$10-15/mo. Still negligible vs $90 LLM spend.

GPT-4o was considered as alternative but rejected: cost parity with Sonnet, weaker clinical structured-output adherence per Curaway's existing extractor evaluation (docs/reference/llm-evaluation.md). Commercial radiology APIs (e.g., Aidoc, RapidAI) deferred — out of MVP scope; would need clinical validation + procurement.

3. Vision preprocessor and signed-URL pipeline¶

3.1 Current state¶

app/integrations/claude_pdf_extractor.py is the current vision touchpoint. extract_text_from_pdf (:234-303) base64-encodes the entire file and sends it inline to Claude — fine for small scanned PDFs (<5MB). extract_text_from_image (:306-383) does the same for JPEG/PNG. There is no resize/compress step. analyze_clinical_image (:407-484) is the existing observational-only path; it's NOT structured-extraction, so it doesn't fit the SOP schema use case.

R2 access lives in app.integrations.r2_client (used at app/services/document_service.py:19, 43). Documents are stored under {tenant_id}/{patient_id}/{file_id}.{ext} (app/models/document.py:37-42).

3.2 The problem¶

Anthropic vision API accepts base64 image bodies up to ~5MB after encoding (~3.7MB raw). DICOM-derived JPGs from MRI scans routinely run 8-15MB. Sending these inline either fails outright or hits cost cliffs (image_tokens scale with pixel count).

3.3 Design: `app/services/vision_preprocessor.py`¶

"""v6.2 Phase 4 — vision input preprocessor (resize + recompress)."""

from __future__ import annotations
from dataclasses import dataclass
from typing import Literal


@dataclass(frozen=True)
class PreparedImage:
    media_type: str               # "image/jpeg" | "image/png" | "application/pdf"
    base64_data: str
    pixels_w: int
    pixels_h: int
    bytes_after: int
    compression_applied: bool
    notes: str                    # e.g., "downsized from 4096w to 1200w; jpeg q=80"


async def prepare_for_vision(
    *,
    document: "DocumentReference",
    target_max_width_px: int = 1200,
    target_max_bytes: int = 3_500_000,    # safe under Anthropic 5MB cap
    jpeg_quality: int = 80,
) -> PreparedImage | Literal["too_large"]:
    """Fetch binary from R2, resize/recompress if needed, return base64-ready payload.

    Steps:
      1. Fetch raw bytes from R2 via r2_client (re-uses storage_key lookup).
      2. If document.mime_type == 'application/pdf' AND size_bytes <= 3.5MB,
         pass through unchanged (Anthropic accepts inline PDFs).
      3. If size_bytes > 3.5MB OR mime_type is an image format larger than
         target_max_width_px, resize via Pillow (PIL.Image.thumbnail) to
         target_max_width_px while preserving aspect ratio. Re-encode as JPEG
         at jpeg_quality.
      4. If post-compression bytes still > target_max_bytes, drop quality to 60
         and retry. If still > target_max_bytes, return "too_large" sentinel.
      5. Return PreparedImage with base64-encoded data + diagnostic metadata.

    Original file untouched in R2 — preprocessing produces an ephemeral payload
    only for the vision call. Never writes back to R2.
    """
    ...

Pillow is added as a Python dep (requirements.txt); it's a 5MB install with no transitive surprises and is already implicitly needed for any future image work. The PreparedImage notes field surfaces compression diagnostics into InterpretationResult.llm_meta.preprocessing_notes for observability.

Failure mode: "too_large" → interpret_document returns status=skipped_oversized. The downstream prompt template renders (file expired before processing — ask the patient to re-upload smaller version) per Phase 2 §7 expired enum value (re-uses the same surface; no new enum value needed).

4. Interpretation schema → vision prompt synthesis¶

4.1 The synthesizer¶

synthesize_vision_prompt(doc_type, interpretation_schema) is the heart of Phase 4. It takes a tuple-of-(field, type_str) pairs (the shape RequiredDocumentEntry.interpretation_schema already exposes per app/services/sop_contract.py:78) and emits a vision-LLM prompt asking for those exact fields.

4.2 Worked example — TKR knee_xray¶

Input:

doc_type = "knee_xray"
interpretation_schema = (
    ("joint_space_mm", "float"),
    ("osteophyte_grade", "int"),
    ("malalignment_deg", "float"),
)

Synthesized prompt (cacheable across all knee_xray calls — cache_control: ephemeral marker placed at end of the system block):

You are a radiology-assistant vision extractor for a cross-border medical
travel coordination platform. The image is a knee X-ray uploaded by a patient.

Your job: extract these structured findings, EXACTLY these fields, NOTHING
else:

  - joint_space_mm (float): the medial joint space width in millimeters.
    Decimal value. Null if not measurable from the image.
  - osteophyte_grade (int): bone spur severity on a 0-3 scale.
    0=none, 1=mild, 2=moderate, 3=severe. Null if not determinable.
  - malalignment_deg (float): varus/valgus angle in degrees.
    Positive = varus, negative = valgus. Null if not measurable.

For EACH field, also emit a one-sentence visual rationale describing what in
the image you used to arrive at the value (or what was missing if null).

OUTPUT RULES — non-negotiable:
- Return strict JSON only. No prose, no markdown fences.
- Shape: {"findings": {field: value or null}, "rationale": {field: "..."},
          "notes": [optional list of other observable features NOT in the
                    schema above — informational only, NOT used for clinical
                    gating]}
- If the image is NOT a knee X-ray (wrong body region, non-clinical photo,
  unreadable), return {"findings": {field: null for each},
                       "rationale": {field: "image is not a knee X-ray"},
                       "notes": ["wrong image type: <observed>"]}.
- DO NOT diagnose. DO NOT suggest treatment. DO NOT mention surgery decisions.
- DO NOT add fields beyond the schema. Anything observed but off-schema goes
  in "notes".

The synthesizer produces this exact text deterministically — no LLM call needed to compose it. Unit-testable via fixture: test_synthesize_vision_prompt_tkr_knee_xray asserts the output matches the expected golden string.

4.3 Edge cases¶

Schema with no fields: interpretation_schema = () → interpret_document returns status=skipped_no_schema without calling vision. The TKR YAML had this state at Phase 1 ship time for some SOPs; Phase 4 PR-4.0 (see §8) populates schemas for all 18 SOPs.
Schema with only bool fields: synthesizer omits the float/int grading scale boilerplate; just emits the field name + "true|false|null" instruction.
Bloodwork numeric fields: recency_days is a date-arithmetic field, not a vision-measurement; the synthesizer emits "extract the report date and compute days since today (use the report's stated date in the upper-right header if visible)" — the model gets explicit field-specific guidance.

The synthesizer logic per field-type:

def _field_instruction(field: str, type_str: str) -> str:
    if type_str == "bool":
        return f"  - {field} (bool): true / false / null"
    if type_str == "int":
        # Special-case schemas with known grading scales; otherwise generic.
        if field in _KNOWN_GRADES:    # {"osteophyte_grade", "acl_tear_grade", ...}
            return f"  - {field} (int): {_KNOWN_GRADES[field]}"
        return f"  - {field} (int): integer value or null"
    if type_str == "float":
        if field.endswith("_mm"):
            return f"  - {field} (float): millimeters, decimal value or null"
        if field.endswith("_deg"):
            return f"  - {field} (float): degrees, decimal value or null"
        return f"  - {field} (float): decimal value or null"
    return f"  - {field} ({type_str}): value or null"

_KNOWN_GRADES is a small dict in document_interpreter.py that captures domain-specific scales for the 18 SOPs. It's hardcoded (not config) because each scale's exact wording is clinical and reviewed with Naidu — config-izing it invites accidental editing.

5. SOP scope validation (anti-hallucination)¶

Vision LLMs may emit findings outside the SOP's schema ("I see a tumor"), wrong-image findings ("I see a chest X-ray, here are lung findings"), or over-grade beyond the requested scale.

5.1 Schema-bounded write-back¶

document_interpreter._parse_vision_response enforces:

Drop any key in findings not in interpretation_schema.
Coerce types: int → int, float → float, bool → bool. Failed coercion → null + note "type coercion failed for field X (got Y)".
Drop any rationale key whose corresponding finding was dropped.
Off-schema observations flow into InterpretationResult.notes (informational only — Phase 2 prompt does NOT render notes; they're observability-only).

5.2 Wrong-image-domain detection¶

When the vision response says "image is not a knee X-ray" in rationale or notes: - ALL findings become null with rationale "image type mismatch." - InterpretationResult.status = "complete" (the call succeeded; the findings are honestly null). - A swallow_metric document_interpretation_wrong_image_type{doc_type, observed_type} increments so the team can spot patients uploading the wrong file. - Phase 2 prompt renders Findings: joint_space_mm=null (image type mismatch); … — the model can then ask the patient to re-upload the correct file.

5.3 Implausible-value detection¶

For fields with bounded ranges (osteophyte_grade: 0-3, acl_tear_grade: 0-3), values outside the range are coerced to null with rationale "value out of declared range." Swallow_metric document_interpretation_implausible_value increments — informational, not blocking.

For continuous fields (joint_space_mm, malalignment_deg), Phase 4 does NOT enforce bounds at the parser layer because anatomical reference ranges are clinical knowledge that doesn't belong in the parser. Out-of-range values pass through with the rationale, and human spot-check on the first 50 production cases (§7 below) catches outliers.

5.4 PHI in vision call¶

Per CLAUDE.md ground rule "No PHI in logs, SSE, or external channels" + .claude/rules/coding-principles.md §5 PII rules: - The vision call sends the IMAGE BINARY to Anthropic. Anthropic is a HIPAA-eligible processor under a BAA (verify before Phase 4.5 rollout — open question §11 Q4). - The synthesized prompt contains NO patient identifiers (no name, no MRN, no DOB). The synthesizer is pure schema-only. - Vision response storage: findings lands in consolidated_state.documents.{doc_id}.findings which inherits the existing case.tenant_id RLS policy. No new PHI surface beyond what Phase 2 already added. - Image preprocessing strips EXIF (PIL image.info discarded on re-save) — eliminates GPS / camera metadata leakage. Test test_vision_preprocessor_strips_exif pins this.

6. Trigger points¶

Phase 4 interpretation runs at three points:

6.1 OCR completion (primary trigger)¶

app/services/document_processing.py:run_post_ocr_pipeline (:391) is the existing hook where OCR-completed docs land in extracted_data (:539-540). Phase 4 extends this:

# Pseudocode addition inside run_post_ocr_pipeline after OCR extracted_data write:
if doc.ocr_status == "completed" and case.sop_id:
    sop_contract = SOPContract.load(code=case.procedure_code, name=case.procedure_name)
    matching_entry = _find_required_doc_entry(sop_contract, doc.document_category)
    if matching_entry and matching_entry.interpretation_schema:
        result = await interpret_document(
            document=doc,
            sop_contract=sop_contract,
            tenant_id=case.tenant_id,
            case_id=case.id,
            patient_id=case.patient_id,
            db=db,
        )
        await _write_findings_to_consolidated_state(case, doc.id, result)
        await consolidated_state_writer.write_consolidated_state(
            db, case, reason="explicit_update"
        )

The interpretation runs AFTER OCR text extraction so both ocr_text (for fallback / human review) and findings (for SOP gating) populate. Independent failure modes — vision can succeed even if OCR returned empty (e.g., pure imaging with no text labels).

6.2 TTL fallback for stalled OCR¶

app/services/document_retry_service.py already runs a sweep over ocr_status='queued' OR 'processing' cases that exceeded their TTL. Phase 4 PR-4.3 extends the sweep: when a doc's OCR has been queued > required_documents[].ttl_seconds (default 300s, declared per SOP per tkr.yaml:911), skip OCR and go directly to vision interpretation. The vision call is independent of OCR success.

The TTL semantic from required_documents[].ttl_seconds is interpreted in Phase 4: "after this many seconds, the SOP says don't wait for OCR — interpret directly." This is the first runtime use of the field; Phase 1 only declared it.

6.3 Re-interpretation on SOP YAML change¶

SOPContract.invalidate_cache() (app/services/sop_contract.py:649-680) fires on admin SOP save. Phase 4 hooks into the same fan-out to mark all consolidated_state.documents.{*}.findings as stale for cases whose SOP changed. Implementation: a background QStash job triggered by the admin save handler that re-runs interpret_document for all complete docs of the affected SOP. Deferred to Phase 4.5 — Phase 4 MVP doesn't auto-re-interpret; admin manually triggers re-interpretation via a /admin/cases/{case_id}/reinterpret-documents endpoint.

7. Failure modes¶

Failure	Detection	Outcome	Recovery
Vision API timeout (>30s)	`httpx.Timeout` raised	`status=failed_transient`, swallow_metric increment	Retry once after 60s via document_retry_service sweep
Vision API non-2xx	exception caught	`status=failed_transient`	Same retry loop; after 3 failures → `status=failed_permanent`
File too large after compression	`prepare_for_vision` returns "too_large"	`status=skipped_oversized`, findings={}	Prompt template tells patient to re-upload smaller version (Phase 2 §7)
Schema field model can't extract	parser returns null for that field	`findings.field = null` with rationale	Phase 2 prompt shows null → LLM asks patient to confirm verbally
Wrong image domain (knee X-ray uploaded for an oncology SOP)	rationale text match OR off-schema notes signal	All findings null, status=complete, swallow_metric	Prompt template renders null findings; LLM asks for correct file
SOP has no schema for this doc type	`_find_required_doc_entry` returns no match OR `interpretation_schema` is `()`	`status=skipped_no_schema`, findings={}	OCR text still populated; prompt template renders "Findings: (not interpreted for this SOP)"
Anthropic API key missing	settings check at top of interpret_document	`status=failed_permanent`, failure_reason="ANTHROPIC_API_KEY not set"	Operator config issue; Telegram alert

All failure paths increment document_interpretation_outcome{status, doc_type, tier} swallow_metric for the §8 monitoring panel.

8. Phase 4 PR chain (~3 days)¶

PR-4.0 — Populate `interpretation_schema` for all 18 SOPs — ~0.5 day¶

For each of 18 SOPs in config/prompts/sops/, fill required_documents[].interpretation_schema with the right schema. Five SOPs (tkr, acl_repair, hip_replacement, spine_fusion, knee_arthroscopy) need their imaging schema reviewed with Naidu — file inline comments in the YAML for his pass. The remaining SOPs use placeholder schemas (e.g., bloodwork_recent: {hba1c, hgb, creatinine, recency_days}) which are clinical-knowledge-low-risk.
Test test_all_sops_have_interpretation_schema_for_mandatory_imaging: every SOP with a mandatory imaging-type document has a non-empty interpretation_schema.
Acceptance: all 18 SOPs have schemas; CI test passes.

PR-4.1 — Vision preprocessor + signed URL pipeline — ~1 day¶

New module app/services/vision_preprocessor.py per §3.3.
Add Pillow to requirements.txt (≥10.0.0; verify no Railway image-build surprises).
Test test_vision_preprocessor_resizes_oversized_jpg (5MB input → ≤3.5MB output, width=1200).
Test test_vision_preprocessor_strips_exif (input has GPS EXIF → output has none).
Test test_vision_preprocessor_passes_through_small_pdf (2MB PDF → unchanged).
Test test_vision_preprocessor_returns_too_large_after_min_quality (15MB unrescuable file → "too_large").
Acceptance: Pillow installed cleanly on Railway; all 4 tests green; no production wiring yet.

PR-4.2 — Interpretation schema synthesizer + interpreter service — ~1 day¶

New module app/services/document_interpreter.py per §2.1.
synthesize_vision_prompt per §4 (deterministic, no LLM).
_choose_tier per §2.2.
_parse_vision_response per §5.
LLM-gateway integration: llm_gateway.invoke(case_id=case.id, patient_id=case.patient_id, agent_name="document_interpreter.{doc_type}", prompt_version="v6.2-phase4-vision-v1", language="en") — per .claude/rules/backend-agents.md Tracing rule, every call traces in Langfuse with case_id/patient_id metadata.
Unit tests: golden-string assertion for synthesized prompts (TKR knee_xray, ACL knee_mri, TKR bloodwork_recent); mocked LLM tests for happy-path findings + wrong-image-domain + implausible-value parser branches.
Snapshot tests: assert _KNOWN_GRADES covers the 0-3 enum fields from all 18 SOPs' schemas.
Acceptance: all tests green; module imports cleanly; no wiring yet.

PR-4.3 — Wire into OCR completion + TTL fallback — ~0.5 day¶

Modify app/services/document_processing.py:run_post_ocr_pipeline per §6.1.
Modify app/services/document_retry_service.py TTL sweep to dispatch direct-to-vision per §6.2.
New helper _write_findings_to_consolidated_state(case, doc_id, result) extends consolidated_state.documents with the findings dict + status enum.
Extend consolidated_state_writer.write_consolidated_state to include documents in the transform output (closes the Phase 2 DoD review's Gap 1 — currently consolidated_state_backfill._empty_consolidated_state() returns "documents": {} and writer never updates it). Phase 2 PR-2.1 was supposed to land this; Phase 4 makes it MUST since Phase 4 produces the content.
Integration test: upload a fixture knee X-ray, simulate OCR complete → assert case.consolidated_state.documents.{doc_id}.findings.joint_space_mm is populated.
Integration test: simulate OCR queued > 300s → assert vision interpretation fires and findings land.
Acceptance: integration tests pass; document_findings_complete (Phase 1) returns True for a fully-interpreted TKR case.

PR-4.4 — Prompt rendering integration + observability — ~0.5 day¶

Confirm Phase 2 <documents> block (already shipped, docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §3.1) renders findings dict from consolidated_state.documents.{doc_id}.findings. Phase 2 template handles this; PR-4.4 just adds a fixture test that the rendered prompt for a TKR case with populated findings contains the expected text "knee_xray: joint_space_mm=2.1mm (severe narrowing); osteophyte_grade=3".
Hook auto_invoke_matcher to also check SOPContract.document_findings_complete(case.consolidated_state) under the v6.2 sticky arch (Phase 2 §12.3 already added the missing_for_matching check; Phase 4 adds the documents_complete check as a parallel gate).
Metabase panels: document_interpretation_outcome distribution by status / tier / doc_type / SOP. Alert if failed_permanent rate > 5% for any single doc_type over 1h.
Langfuse trace metadata stamps: doc_interpretation_count, doc_interpretation_total_cost_usd_cents per case (rolled into extra_metadata of the assistant message).
Acceptance: Sindhu's case 0a6b7e48 X-ray (re-uploaded through canary) produces non-empty findings; the v6.2 prompt rendered for the next turn shows the populated Findings: joint_space_mm=2.1mm; osteophyte_grade=3; malalignment_deg=4.5 instead of (no documents on file). This is the structural close of the D1-D6 hallucination class.

Total chain: ~3.5 working days. PR-4.0 (schema population) and PR-4.1 (preprocessor) can run in parallel since they touch disjoint files. PR-4.2-4.4 are sequential.

9. Sequencing with Phase 2 / Phase 3¶

Phase 2's prompt template already renders the documents block and handles empty findings gracefully — it just shows "Findings: (not yet extracted)" until Phase 4 fills the dict. Phase 4 fills it.

Sequencing recommendation: dispatch Phase 4 PRs IN PARALLEL with Phase 2 PR-2.3 (read-path swap). Both can land independently. The join point is Phase 2.5 Sindhu rollout — Phase 4 should ship BEFORE Phase 2.5 because:

Sindhu's case 0a6b7e48 had multiple documents that produced D1-D6 hallucinations. Without Phase 4, Sindhu's canary first cases will see empty findings — the structural fix Phase 2 advertises only half-fires.
The Phase 2 DoD review's Gap 1 specifically called out that consolidated_state.documents was never being populated; PR-4.3 closes this. Without PR-4.3, the Phase 2 §3.1 <documents> block renders "(no documents on file)" universally.

If timeline forces a choice between Phase 4 and Phase 2.5: ship Phase 4 first, defer Phase 2.5 by ~3 days. The v6.2 prompt architecture isn't usefully test-able for Sindhu without documents wired through.

Phase 3 (extractor reconciler) is independent of Phase 4 — it touches extractor outputs, not document findings. They can ship in any order.

Existing-doc backlog (B2 gating requirement)¶

The runner + sweeper only process new documents whose ocr_status transitions through pending/queued/processing. Documents that completed OCR before PR-4.3 ships have extracted_findings=NULL and are never processed by the live path. The B2 claim "v6.2 prompt renders real findings" is FALSE for those docs until the backfill runs.

Gating sequence — PR-2.5 Sindhu rollout MUST be preceded by:

Run the backfill script against Sindhu's tenant (dry-run first):

railway run -s curaway --environment production -- \
    python scripts/backfill_vision_interpretation.py \
    --tenant-id <sindhu-tenant-id> --dry-run

railway run -s curaway --environment production -- \
    python scripts/backfill_vision_interpretation.py \
    --tenant-id <sindhu-tenant-id>

Verify the backlog drained to zero using the script's final log line ("Remaining backlog (complete + no findings): 0") or the spot-check SQL in docs/runbook/v6-2-rollout-checklist.md.
Only after backlog = 0 proceed to the PR-2.5 Sindhu flag flip.

The script (scripts/backfill_vision_interpretation.py) is idempotent — docs that already have findings are skipped.

10. Risk model¶

ID	Risk	Mitigation
R1	Vision API per-image cost balloons beyond estimate	Tiered tier-selection (Haiku for booleans, Sonnet for measurements). Metabase panel + Telegram alert when monthly vision spend crosses $20 (2× projected). Per-SOP tier override in `model_registry.yaml` for hot-fix.
R2	Vision hallucinates fields outside schema	Schema-bounded parser drops non-schema fields. Off-schema observations relegated to `notes` (not surfaced to LLM). Human spot-check on first 50 production interpretations + a `notes`-distribution Metabase panel to catch systematic off-schema leakage.
R3	PHI sent in vision call without BAA coverage	Verify Anthropic BAA covers image content (§11 Q4). Strip EXIF in preprocessor (test pinned). No PHI in synthesized prompt (schema-only). Same PII-redaction layer that gates conversation LLM applies — verify in PR-4.2 reviewer pass.
R4	OCR + vision both run and produce contradictory data	OCR populates `ocr_text` (free-form), vision populates `findings` (structured). No contradiction surface — they're disjoint columns. If both fail, Phase 2 prompt template tells patient to describe verbally.
R5	Sonnet vision call latency stalls the turn	All Phase 4 calls are out-of-band (post-OCR or TTL-driven), NOT synchronous to the conversation turn. Findings appear in the NEXT turn's prompt. Worst case: 60s lag, well below the ttl_seconds=300 declared per SOP.
R6	Naidu's schema review changes the field set; production findings become stale	`SOPContract.invalidate_cache` fan-out exists (`sop_contract.py:649-680`). Re-interpretation deferred to Phase 4.5 admin endpoint; until then, schema changes propagate forward but old findings remain (with their original schema). Acceptable for MVP.
R7	Vision call cost-tracking gap (similar to the pre-`_record_usage` bug at `claude_pdf_extractor.py`)	`llm_gateway.invoke` is mandatory per `.claude/rules/backend-agents.md`. No direct `anthropic.Anthropic().messages.create()` calls in `document_interpreter.py`. Reviewer subagent (code-reviewer) must flag any direct SDK use.
R8	Per-tenant rollout — flag a vision interpretation feature off for a tenant whose docs aren't ready	New Flagsmith flag `document_interpretation_enabled` (default false). Phase 2.5 / Phase 4.5 rollout flips it for Sindhu's tenant first, then percentage-rolls. Default off preserves current behavior (Phase 2 prompt shows "(not yet extracted)").

11. Open questions for SD¶

Should vision interpretation be always-on (every uploaded doc gets analyzed) or on-demand (only when SOP needs that doc)? Lean: on-demand. The current spec routes through the SOP's required_documents[] lookup; docs not declared in the SOP get status=skipped_not_in_sop and consume no vision budget. Cost is the driver — at 1,000 cases/mo with always-on across all uploaded files, the $10-15/mo estimate could 3-5× because patients upload incidental files (insurance cards, passport scans) we don't need to interpret.
Should the vision call run synchronously (blocks the conversation turn until findings are ready) or async (turn proceeds with "doc analyzing", findings appear in next turn)? Lean: async. Vision calls run 2-8s at Sonnet tier; synchronous would add that latency to every turn after an upload. Async means turn N renders status=processing, turn N+1 renders findings. Phase 2's §7 enum already supports this (processing value + eta_seconds).
Default vision model tier — Haiku (cost) or Sonnet (clinical accuracy)? Or per-SOP override? Lean: per-SOP override with the §2.2 schema-driven default (Haiku for booleans, Sonnet for numerics). The model_registry.yaml per-SOP override lets admin dial it. Default schema-driven choice keeps the simple case simple.
PR-4 chain order — Sindhu rollout dependency: ship PR-4.* before PR-2.5? Lean: yes, ship Phase 4 PR-4.0 through PR-4.4 BEFORE flipping prompt_arch_v6_2_flexible for Sindhu's tenant. Without Phase 4, Sindhu's canary cases will see empty findings and the structural fix Phase 2 advertises (D1-D6 close) won't actually fire. The 3-day Phase 4 delay is small relative to the credibility cost of a canary that visibly under-delivers on the documents axis.

12. Summary¶

Phase 4 fills the document findings dict that Phase 2's prompt template already renders. The pipeline is: SOP declares interpretation_schema (Phase 1, already in place for TKR + ACL) → vision preprocessor pulls the file from R2 and resizes to under 3.5MB (new) → prompt synthesizer emits a schema-bounded vision-LLM prompt (new) → llm_gateway.invoke runs the call at Haiku or Sonnet tier per schema type (new) → schema-bounded parser writes findings into consolidated_state.documents.{doc_id}.findings (new, also closes the Phase 2 Gap 1 documents writer issue) → Phase 2 prompt template surfaces the findings → SOPContract.document_findings_complete (Phase 1, already in place) gates auto_invoke_matcher promotion.

Estimated incremental monthly cost: $1-2 at canary scale, ~$10-15 at 1,000 cases/mo. Tiered model selection keeps cost in the single-percent range of total LLM spend. No commercial radiology API; no DICOM parsing (the JPG-derived files patients upload are sufficient for the joint-space / grade fields the SOPs declare).

The architectural bet: schema-declared vision extraction beats free-form OCR + downstream parsing because (a) the LLM sees the SOP's required fields directly in the synthesizer prompt, (b) the parser is schema-bounded so hallucinations can't sneak through, (c) the same interpretation_schema is the single source of truth from SOP YAML to SOP completeness gate. The Sindhu D1-D6 hallucination class — "the extraction failed", "I see your X-ray uploading now" — closes structurally because the LLM finally sees real findings instead of empty placeholders.