v6.2 Flexible — Phase 4 Document Interpretation Spec¶
Status: DRAFT — proposes the vision-extractor pipeline that fills consolidated_state.documents.{doc_id}.findings against the SOP-declared interpretation_schema. Filed in parallel with Phase 2 PR-2.1 per the Phase 2 DoD review (docs/specs/v6.2-flexible-phase2-dod-and-audit-coverage-review.md §Part 4, recommendation #11).
Author: Phase 4 dispatch, 2026-06-03.
SD's stated hard requirement: "SOP framework needs to interpret the reports (images/documents/xrays) — without which some of the validation checks will not work."
Companion specs:
- docs/specs/v6.2-flexible-phase1-sop-contract-spec.md (interpretation_schema field declared on SOP required_documents[]; TKR + ACL examples §2)
- docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §6 documents block, §7 findings shape, §11 cache invalidation, status enum
- docs/specs/v6.2-flexible-phase2-dod-and-audit-coverage-review.md Gap 1 (D1-D6 documents wiring) + Phase 4 scoping risks
1. Goal restated¶
Phase 1 declared interpretation_schema on required_documents[] (config/prompts/sops/tkr.yaml:907-918 knee_xray schema → joint_space_mm: float, osteophyte_grade: int, malalignment_deg: float; config/prompts/sops/acl_repair.yaml:185-192 knee_mri schema → acl_tear_grade: int, associated_meniscus_tear: bool, associated_collateral_injury: bool). Phase 1's SOPContract.document_findings_complete() method (app/services/sop_contract.py:571-599) already validates findings against the schema and returns False when any mandatory document type lacks any schema key.
Phase 2 surfaces document presence + status + findings in the prompt (docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md §3.1 <documents> block, §7 enum). The prompt already renders Findings: {doc.findings | join(", ")} when status == "complete". The shape is there; the values aren't filled.
Phase 4 fills them. Concretely:
- A document interpreter service that, when a doc finishes OCR (or as a TTL fallback for queued OCR), runs a vision-LLM call against the SOP's
interpretation_schemaand emits structured findings. - Findings storage at
consolidated_state.documents.{doc_id}.findings— the same dict shape Phase 2's prompt template reads. - The Phase 1 SOP completeness gate —
SOPContract.document_findings_complete(consolidated_state)— starts returning True only when every mandatory doc has its findings populated;auto_invoke_matcher(Phase 2 §12.3) consumes this gate. - Prompt visibility preserved per Phase 2 — the LLM sees what was extracted and can ask the patient to confirm or supplement.
The Sindhu D1-D6 audit class (audit 0a6b7e48, fabricated "the extraction failed" / "uploading now" status hallucinations) is structurally addressed by Phase 2's <documents> block — Phase 4 turns the empty findings: {} into actual content so the LLM stops asking patients to verbally describe an X-ray that's already been read.
2. Architecture overview¶
2.1 New module: app/services/document_interpreter.py¶
# app/services/document_interpreter.py
"""v6.2 Phase 4 — vision-based document interpretation against SOP schemas."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Any, Literal
from app.services.sop_contract import SOPContract, RequiredDocumentEntry
InterpretationStatus = Literal[
"skipped_no_schema", # SOP doesn't declare a schema for this doc type
"skipped_oversized", # file too large to compress to safe vision-LLM size
"skipped_not_in_sop", # uploaded doc type doesn't match any SOP required_documents
"complete", # findings emitted, all schema keys present (some may be null)
"failed_transient", # vision call errored, retry pending
"failed_permanent", # exhausted retries (default 3×)
]
@dataclass(frozen=True)
class InterpretationResult:
doc_id: str
doc_type: str
status: InterpretationStatus
findings: dict[str, Any] # {schema_field: value or None}
rationale: dict[str, str] # {schema_field: one-sentence rationale}
notes: list[str] # informational; off-schema observations
llm_meta: dict[str, Any] # tier, provider, latency_ms, cost_usd
failure_reason: str | None
async def interpret_document(
*,
document: "DocumentReference", # app.models.document.DocumentReference
sop_contract: SOPContract,
tenant_id: str,
case_id: str,
patient_id: str,
db: "AsyncSession",
langfuse_handler: object | None = None,
) -> InterpretationResult:
"""
Run the vision interpreter against a single document.
Steps:
1. Resolve the SOP's required_documents entry whose `type` matches
document.document_category (or skip if no match → status=skipped_not_in_sop).
2. If the matched entry has no `interpretation_schema`, status=skipped_no_schema.
3. Fetch + preprocess the binary from R2 (vision_preprocessor.prepare_for_vision).
If preprocessing fails / file too large after compression, status=skipped_oversized.
4. Synthesize the vision prompt from interpretation_schema (synthesize_vision_prompt).
5. Choose tier: presence-detection only (Haiku) vs structured numeric findings
(Sonnet). See §2.2 tier-selection rules.
6. Run vision call via llm_gateway.invoke() with the synthesized prompt.
7. Parse JSON response strict=False; enforce schema-bounded output
(drop fields not in interpretation_schema → notes[]). See §5.
8. Return InterpretationResult; caller commits findings via
consolidated_state_writer (see §6.2).
NEVER raises. Internal errors → status=failed_transient + failure_reason.
"""
...
def synthesize_vision_prompt(
*,
doc_type: str,
interpretation_schema: tuple[tuple[str, str], ...],
) -> str:
"""Translate SOP interpretation_schema → vision-LLM prompt.
See §4 for the worked example and template. Pure function — unit-testable
against fixture schemas without any LLM call.
"""
...
This module is intentionally isolated: it doesn't import from app.agents.* (no cross-domain), it talks to r2_client for binaries, llm_gateway for the vision call, sop_contract for the schema, and consolidated_state_writer for the write-back. Extraction-ready — could become an HTTP handler unchanged.
2.2 Vision-LLM tier selection¶
Per .claude/rules/backend-agents.md Model Tiers and CLAUDE.md cost constraints ($1K total budget, ~$9/mo current LLM spend at 100 cases):
| Schema field type | Use case | Tier | Rationale |
|---|---|---|---|
bool (e.g., associated_meniscus_tear, bone_bruise_present) |
Presence / absence detection | Haiku 4.5 | Cheap (~$0.001/image); model only needs to say yes/no with a rationale |
int enum (e.g., osteophyte_grade: 0-3, acl_tear_grade: 0-3) |
Categorical grading on bounded scale | Sonnet 4.6 | Numeric grading benefits from stronger visual reasoning; cost ~$0.005/image acceptable |
float measurement (e.g., joint_space_mm, malalignment_deg, hba1c) |
Quantitative reading | Sonnet 4.6 | Measurement extraction is the highest-stakes field; downgrade-to-null on low confidence |
OCR-extractable text fields (e.g., recency_days) |
Date / numeric from report header | Haiku 4.5 | Already legible text; vision call equivalent to OCR re-read |
Schema-level decision rule. _choose_tier(interpretation_schema) in document_interpreter.py:
- If ANY schema field is float or int (excluding recency_days-style date fields), use Sonnet.
- Else use Haiku.
Override available via config/model_registry.yaml per-SOP key (document_interpretation.knee_xray.tier: sonnet|haiku) so admin can dial cost/quality per SOP. Tier override resolved per call, never cached.
Cost estimate. - TKR case: 2-3 docs (knee_xray + optional bloodwork_recent) × ~$0.005 (Sonnet) = $0.01-0.015/case. - ACL case: 1 doc (knee_mri) × ~$0.005 = $0.005/case. - Bloodwork (Haiku): ~$0.001/case. - At 100 cases/mo: ~$1-2/mo total vision spend. Negligible vs $9/mo conversation LLM spend. - At 1,000 cases/mo (post-canary scale): ~$10-15/mo. Still negligible vs $90 LLM spend.
GPT-4o was considered as alternative but rejected: cost parity with Sonnet, weaker clinical structured-output adherence per Curaway's existing extractor evaluation (docs/reference/llm-evaluation.md). Commercial radiology APIs (e.g., Aidoc, RapidAI) deferred — out of MVP scope; would need clinical validation + procurement.
3. Vision preprocessor and signed-URL pipeline¶
3.1 Current state¶
app/integrations/claude_pdf_extractor.py is the current vision touchpoint. extract_text_from_pdf (:234-303) base64-encodes the entire file and sends it inline to Claude — fine for small scanned PDFs (<5MB). extract_text_from_image (:306-383) does the same for JPEG/PNG. There is no resize/compress step. analyze_clinical_image (:407-484) is the existing observational-only path; it's NOT structured-extraction, so it doesn't fit the SOP schema use case.
R2 access lives in app.integrations.r2_client (used at app/services/document_service.py:19, 43). Documents are stored under {tenant_id}/{patient_id}/{file_id}.{ext} (app/models/document.py:37-42).
3.2 The problem¶
Anthropic vision API accepts base64 image bodies up to ~5MB after encoding (~3.7MB raw). DICOM-derived JPGs from MRI scans routinely run 8-15MB. Sending these inline either fails outright or hits cost cliffs (image_tokens scale with pixel count).
3.3 Design: app/services/vision_preprocessor.py¶
"""v6.2 Phase 4 — vision input preprocessor (resize + recompress)."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Literal
@dataclass(frozen=True)
class PreparedImage:
media_type: str # "image/jpeg" | "image/png" | "application/pdf"
base64_data: str
pixels_w: int
pixels_h: int
bytes_after: int
compression_applied: bool
notes: str # e.g., "downsized from 4096w to 1200w; jpeg q=80"
async def prepare_for_vision(
*,
document: "DocumentReference",
target_max_width_px: int = 1200,
target_max_bytes: int = 3_500_000, # safe under Anthropic 5MB cap
jpeg_quality: int = 80,
) -> PreparedImage | Literal["too_large"]:
"""Fetch binary from R2, resize/recompress if needed, return base64-ready payload.
Steps:
1. Fetch raw bytes from R2 via r2_client (re-uses storage_key lookup).
2. If document.mime_type == 'application/pdf' AND size_bytes <= 3.5MB,
pass through unchanged (Anthropic accepts inline PDFs).
3. If size_bytes > 3.5MB OR mime_type is an image format larger than
target_max_width_px, resize via Pillow (PIL.Image.thumbnail) to
target_max_width_px while preserving aspect ratio. Re-encode as JPEG
at jpeg_quality.
4. If post-compression bytes still > target_max_bytes, drop quality to 60
and retry. If still > target_max_bytes, return "too_large" sentinel.
5. Return PreparedImage with base64-encoded data + diagnostic metadata.
Original file untouched in R2 — preprocessing produces an ephemeral payload
only for the vision call. Never writes back to R2.
"""
...
Pillow is added as a Python dep (requirements.txt); it's a 5MB install with no transitive surprises and is already implicitly needed for any future image work. The PreparedImage notes field surfaces compression diagnostics into InterpretationResult.llm_meta.preprocessing_notes for observability.
Failure mode: "too_large" → interpret_document returns status=skipped_oversized. The downstream prompt template renders (file expired before processing — ask the patient to re-upload smaller version) per Phase 2 §7 expired enum value (re-uses the same surface; no new enum value needed).
4. Interpretation schema → vision prompt synthesis¶
4.1 The synthesizer¶
synthesize_vision_prompt(doc_type, interpretation_schema) is the heart of Phase 4. It takes a tuple-of-(field, type_str) pairs (the shape RequiredDocumentEntry.interpretation_schema already exposes per app/services/sop_contract.py:78) and emits a vision-LLM prompt asking for those exact fields.
4.2 Worked example — TKR knee_xray¶
Input:
doc_type = "knee_xray"
interpretation_schema = (
("joint_space_mm", "float"),
("osteophyte_grade", "int"),
("malalignment_deg", "float"),
)
Synthesized prompt (cacheable across all knee_xray calls — cache_control: ephemeral marker placed at end of the system block):
You are a radiology-assistant vision extractor for a cross-border medical
travel coordination platform. The image is a knee X-ray uploaded by a patient.
Your job: extract these structured findings, EXACTLY these fields, NOTHING
else:
- joint_space_mm (float): the medial joint space width in millimeters.
Decimal value. Null if not measurable from the image.
- osteophyte_grade (int): bone spur severity on a 0-3 scale.
0=none, 1=mild, 2=moderate, 3=severe. Null if not determinable.
- malalignment_deg (float): varus/valgus angle in degrees.
Positive = varus, negative = valgus. Null if not measurable.
For EACH field, also emit a one-sentence visual rationale describing what in
the image you used to arrive at the value (or what was missing if null).
OUTPUT RULES — non-negotiable:
- Return strict JSON only. No prose, no markdown fences.
- Shape: {"findings": {field: value or null}, "rationale": {field: "..."},
"notes": [optional list of other observable features NOT in the
schema above — informational only, NOT used for clinical
gating]}
- If the image is NOT a knee X-ray (wrong body region, non-clinical photo,
unreadable), return {"findings": {field: null for each},
"rationale": {field: "image is not a knee X-ray"},
"notes": ["wrong image type: <observed>"]}.
- DO NOT diagnose. DO NOT suggest treatment. DO NOT mention surgery decisions.
- DO NOT add fields beyond the schema. Anything observed but off-schema goes
in "notes".
The synthesizer produces this exact text deterministically — no LLM call needed to compose it. Unit-testable via fixture: test_synthesize_vision_prompt_tkr_knee_xray asserts the output matches the expected golden string.
4.3 Edge cases¶
- Schema with no fields:
interpretation_schema = ()→interpret_documentreturnsstatus=skipped_no_schemawithout calling vision. The TKR YAML had this state at Phase 1 ship time for some SOPs; Phase 4 PR-4.0 (see §8) populates schemas for all 18 SOPs. - Schema with only bool fields: synthesizer omits the float/int grading scale boilerplate; just emits the field name + "true|false|null" instruction.
- Bloodwork numeric fields:
recency_daysis a date-arithmetic field, not a vision-measurement; the synthesizer emits "extract the report date and compute days since today (use the report's stated date in the upper-right header if visible)" — the model gets explicit field-specific guidance.
The synthesizer logic per field-type:
def _field_instruction(field: str, type_str: str) -> str:
if type_str == "bool":
return f" - {field} (bool): true / false / null"
if type_str == "int":
# Special-case schemas with known grading scales; otherwise generic.
if field in _KNOWN_GRADES: # {"osteophyte_grade", "acl_tear_grade", ...}
return f" - {field} (int): {_KNOWN_GRADES[field]}"
return f" - {field} (int): integer value or null"
if type_str == "float":
if field.endswith("_mm"):
return f" - {field} (float): millimeters, decimal value or null"
if field.endswith("_deg"):
return f" - {field} (float): degrees, decimal value or null"
return f" - {field} (float): decimal value or null"
return f" - {field} ({type_str}): value or null"
_KNOWN_GRADES is a small dict in document_interpreter.py that captures domain-specific scales for the 18 SOPs. It's hardcoded (not config) because each scale's exact wording is clinical and reviewed with Naidu — config-izing it invites accidental editing.
5. SOP scope validation (anti-hallucination)¶
Vision LLMs may emit findings outside the SOP's schema ("I see a tumor"), wrong-image findings ("I see a chest X-ray, here are lung findings"), or over-grade beyond the requested scale.
5.1 Schema-bounded write-back¶
document_interpreter._parse_vision_response enforces:
- Drop any key in
findingsnot ininterpretation_schema. - Coerce types:
int→ int,float→ float,bool→ bool. Failed coercion → null + note "type coercion failed for field X (got Y)". - Drop any rationale key whose corresponding finding was dropped.
- Off-schema observations flow into
InterpretationResult.notes(informational only — Phase 2 prompt does NOT render notes; they're observability-only).
5.2 Wrong-image-domain detection¶
When the vision response says "image is not a knee X-ray" in rationale or notes:
- ALL findings become null with rationale "image type mismatch."
- InterpretationResult.status = "complete" (the call succeeded; the findings are honestly null).
- A swallow_metric document_interpretation_wrong_image_type{doc_type, observed_type} increments so the team can spot patients uploading the wrong file.
- Phase 2 prompt renders Findings: joint_space_mm=null (image type mismatch); … — the model can then ask the patient to re-upload the correct file.
5.3 Implausible-value detection¶
For fields with bounded ranges (osteophyte_grade: 0-3, acl_tear_grade: 0-3), values outside the range are coerced to null with rationale "value out of declared range." Swallow_metric document_interpretation_implausible_value increments — informational, not blocking.
For continuous fields (joint_space_mm, malalignment_deg), Phase 4 does NOT enforce bounds at the parser layer because anatomical reference ranges are clinical knowledge that doesn't belong in the parser. Out-of-range values pass through with the rationale, and human spot-check on the first 50 production cases (§7 below) catches outliers.
5.4 PHI in vision call¶
Per CLAUDE.md ground rule "No PHI in logs, SSE, or external channels" + .claude/rules/coding-principles.md §5 PII rules:
- The vision call sends the IMAGE BINARY to Anthropic. Anthropic is a HIPAA-eligible processor under a BAA (verify before Phase 4.5 rollout — open question §11 Q4).
- The synthesized prompt contains NO patient identifiers (no name, no MRN, no DOB). The synthesizer is pure schema-only.
- Vision response storage: findings lands in consolidated_state.documents.{doc_id}.findings which inherits the existing case.tenant_id RLS policy. No new PHI surface beyond what Phase 2 already added.
- Image preprocessing strips EXIF (PIL image.info discarded on re-save) — eliminates GPS / camera metadata leakage. Test test_vision_preprocessor_strips_exif pins this.
6. Trigger points¶
Phase 4 interpretation runs at three points:
6.1 OCR completion (primary trigger)¶
app/services/document_processing.py:run_post_ocr_pipeline (:391) is the existing hook where OCR-completed docs land in extracted_data (:539-540). Phase 4 extends this:
# Pseudocode addition inside run_post_ocr_pipeline after OCR extracted_data write:
if doc.ocr_status == "completed" and case.sop_id:
sop_contract = SOPContract.load(code=case.procedure_code, name=case.procedure_name)
matching_entry = _find_required_doc_entry(sop_contract, doc.document_category)
if matching_entry and matching_entry.interpretation_schema:
result = await interpret_document(
document=doc,
sop_contract=sop_contract,
tenant_id=case.tenant_id,
case_id=case.id,
patient_id=case.patient_id,
db=db,
)
await _write_findings_to_consolidated_state(case, doc.id, result)
await consolidated_state_writer.write_consolidated_state(
db, case, reason="explicit_update"
)
The interpretation runs AFTER OCR text extraction so both ocr_text (for fallback / human review) and findings (for SOP gating) populate. Independent failure modes — vision can succeed even if OCR returned empty (e.g., pure imaging with no text labels).
6.2 TTL fallback for stalled OCR¶
app/services/document_retry_service.py already runs a sweep over ocr_status='queued' OR 'processing' cases that exceeded their TTL. Phase 4 PR-4.3 extends the sweep: when a doc's OCR has been queued > required_documents[].ttl_seconds (default 300s, declared per SOP per tkr.yaml:911), skip OCR and go directly to vision interpretation. The vision call is independent of OCR success.
The TTL semantic from required_documents[].ttl_seconds is interpreted in Phase 4: "after this many seconds, the SOP says don't wait for OCR — interpret directly." This is the first runtime use of the field; Phase 1 only declared it.
6.3 Re-interpretation on SOP YAML change¶
SOPContract.invalidate_cache() (app/services/sop_contract.py:649-680) fires on admin SOP save. Phase 4 hooks into the same fan-out to mark all consolidated_state.documents.{*}.findings as stale for cases whose SOP changed. Implementation: a background QStash job triggered by the admin save handler that re-runs interpret_document for all complete docs of the affected SOP. Deferred to Phase 4.5 — Phase 4 MVP doesn't auto-re-interpret; admin manually triggers re-interpretation via a /admin/cases/{case_id}/reinterpret-documents endpoint.
7. Failure modes¶
| Failure | Detection | Outcome | Recovery |
|---|---|---|---|
| Vision API timeout (>30s) | httpx.Timeout raised |
status=failed_transient, swallow_metric increment |
Retry once after 60s via document_retry_service sweep |
| Vision API non-2xx | exception caught | status=failed_transient |
Same retry loop; after 3 failures → status=failed_permanent |
| File too large after compression | prepare_for_vision returns "too_large" |
status=skipped_oversized, findings={} |
Prompt template tells patient to re-upload smaller version (Phase 2 §7) |
| Schema field model can't extract | parser returns null for that field | findings.field = null with rationale |
Phase 2 prompt shows null → LLM asks patient to confirm verbally |
| Wrong image domain (knee X-ray uploaded for an oncology SOP) | rationale text match OR off-schema notes signal | All findings null, status=complete, swallow_metric | Prompt template renders null findings; LLM asks for correct file |
| SOP has no schema for this doc type | _find_required_doc_entry returns no match OR interpretation_schema is () |
status=skipped_no_schema, findings={} |
OCR text still populated; prompt template renders "Findings: (not interpreted for this SOP)" |
| Anthropic API key missing | settings check at top of interpret_document | status=failed_permanent, failure_reason="ANTHROPIC_API_KEY not set" |
Operator config issue; Telegram alert |
All failure paths increment document_interpretation_outcome{status, doc_type, tier} swallow_metric for the §8 monitoring panel.
8. Phase 4 PR chain (~3 days)¶
PR-4.0 — Populate interpretation_schema for all 18 SOPs — ~0.5 day¶
- For each of 18 SOPs in
config/prompts/sops/, fillrequired_documents[].interpretation_schemawith the right schema. Five SOPs (tkr, acl_repair, hip_replacement, spine_fusion, knee_arthroscopy) need their imaging schema reviewed with Naidu — file inline comments in the YAML for his pass. The remaining SOPs use placeholder schemas (e.g.,bloodwork_recent: {hba1c, hgb, creatinine, recency_days}) which are clinical-knowledge-low-risk. - Test
test_all_sops_have_interpretation_schema_for_mandatory_imaging: every SOP with a mandatoryimaging-type document has a non-emptyinterpretation_schema. - Acceptance: all 18 SOPs have schemas; CI test passes.
PR-4.1 — Vision preprocessor + signed URL pipeline — ~1 day¶
- New module
app/services/vision_preprocessor.pyper §3.3. - Add Pillow to
requirements.txt(≥10.0.0; verify no Railway image-build surprises). - Test
test_vision_preprocessor_resizes_oversized_jpg(5MB input → ≤3.5MB output, width=1200). - Test
test_vision_preprocessor_strips_exif(input has GPS EXIF → output has none). - Test
test_vision_preprocessor_passes_through_small_pdf(2MB PDF → unchanged). - Test
test_vision_preprocessor_returns_too_large_after_min_quality(15MB unrescuable file → "too_large"). - Acceptance: Pillow installed cleanly on Railway; all 4 tests green; no production wiring yet.
PR-4.2 — Interpretation schema synthesizer + interpreter service — ~1 day¶
- New module
app/services/document_interpreter.pyper §2.1. synthesize_vision_promptper §4 (deterministic, no LLM)._choose_tierper §2.2._parse_vision_responseper §5.- LLM-gateway integration:
llm_gateway.invoke(case_id=case.id, patient_id=case.patient_id, agent_name="document_interpreter.{doc_type}", prompt_version="v6.2-phase4-vision-v1", language="en")— per.claude/rules/backend-agents.mdTracing rule, every call traces in Langfuse with case_id/patient_id metadata. - Unit tests: golden-string assertion for synthesized prompts (TKR knee_xray, ACL knee_mri, TKR bloodwork_recent); mocked LLM tests for happy-path findings + wrong-image-domain + implausible-value parser branches.
- Snapshot tests: assert
_KNOWN_GRADEScovers the 0-3 enum fields from all 18 SOPs' schemas. - Acceptance: all tests green; module imports cleanly; no wiring yet.
PR-4.3 — Wire into OCR completion + TTL fallback — ~0.5 day¶
- Modify
app/services/document_processing.py:run_post_ocr_pipelineper §6.1. - Modify
app/services/document_retry_service.pyTTL sweep to dispatch direct-to-vision per §6.2. - New helper
_write_findings_to_consolidated_state(case, doc_id, result)extendsconsolidated_state.documentswith the findings dict + status enum. - Extend
consolidated_state_writer.write_consolidated_stateto includedocumentsin the transform output (closes the Phase 2 DoD review's Gap 1 — currentlyconsolidated_state_backfill._empty_consolidated_state()returns"documents": {}and writer never updates it). Phase 2 PR-2.1 was supposed to land this; Phase 4 makes it MUST since Phase 4 produces the content. - Integration test: upload a fixture knee X-ray, simulate OCR complete → assert
case.consolidated_state.documents.{doc_id}.findings.joint_space_mmis populated. - Integration test: simulate OCR queued > 300s → assert vision interpretation fires and findings land.
- Acceptance: integration tests pass; document_findings_complete (Phase 1) returns True for a fully-interpreted TKR case.
PR-4.4 — Prompt rendering integration + observability — ~0.5 day¶
- Confirm Phase 2
<documents>block (already shipped,docs/specs/v6.2-flexible-phase2-prompt-assembly-spec.md§3.1) renders findings dict fromconsolidated_state.documents.{doc_id}.findings. Phase 2 template handles this; PR-4.4 just adds a fixture test that the rendered prompt for a TKR case with populated findings contains the expected text "knee_xray: joint_space_mm=2.1mm (severe narrowing); osteophyte_grade=3". - Hook
auto_invoke_matcherto also checkSOPContract.document_findings_complete(case.consolidated_state)under the v6.2 sticky arch (Phase 2 §12.3 already added the missing_for_matching check; Phase 4 adds the documents_complete check as a parallel gate). - Metabase panels:
document_interpretation_outcomedistribution by status / tier / doc_type / SOP. Alert iffailed_permanentrate > 5% for any single doc_type over 1h. - Langfuse trace metadata stamps:
doc_interpretation_count,doc_interpretation_total_cost_usd_centsper case (rolled intoextra_metadataof the assistant message). - Acceptance: Sindhu's case
0a6b7e48X-ray (re-uploaded through canary) produces non-empty findings; the v6.2 prompt rendered for the next turn shows the populatedFindings: joint_space_mm=2.1mm; osteophyte_grade=3; malalignment_deg=4.5instead of(no documents on file). This is the structural close of the D1-D6 hallucination class.
Total chain: ~3.5 working days. PR-4.0 (schema population) and PR-4.1 (preprocessor) can run in parallel since they touch disjoint files. PR-4.2-4.4 are sequential.
9. Sequencing with Phase 2 / Phase 3¶
Phase 2's prompt template already renders the documents block and handles empty findings gracefully — it just shows "Findings: (not yet extracted)" until Phase 4 fills the dict. Phase 4 fills it.
Sequencing recommendation: dispatch Phase 4 PRs IN PARALLEL with Phase 2 PR-2.3 (read-path swap). Both can land independently. The join point is Phase 2.5 Sindhu rollout — Phase 4 should ship BEFORE Phase 2.5 because:
- Sindhu's case
0a6b7e48had multiple documents that produced D1-D6 hallucinations. Without Phase 4, Sindhu's canary first cases will see empty findings — the structural fix Phase 2 advertises only half-fires. - The Phase 2 DoD review's Gap 1 specifically called out that
consolidated_state.documentswas never being populated; PR-4.3 closes this. Without PR-4.3, the Phase 2 §3.1<documents>block renders "(no documents on file)" universally.
If timeline forces a choice between Phase 4 and Phase 2.5: ship Phase 4 first, defer Phase 2.5 by ~3 days. The v6.2 prompt architecture isn't usefully test-able for Sindhu without documents wired through.
Phase 3 (extractor reconciler) is independent of Phase 4 — it touches extractor outputs, not document findings. They can ship in any order.
Existing-doc backlog (B2 gating requirement)¶
The runner + sweeper only process new documents whose ocr_status transitions
through pending/queued/processing. Documents that completed OCR before
PR-4.3 ships have extracted_findings=NULL and are never processed by the live
path. The B2 claim "v6.2 prompt renders real findings" is FALSE for those docs
until the backfill runs.
Gating sequence — PR-2.5 Sindhu rollout MUST be preceded by:
- Run the backfill script against Sindhu's tenant (dry-run first):
- Verify the backlog drained to zero using the script's final log line
("Remaining backlog (complete + no findings): 0") or the spot-check SQL
in
docs/runbook/v6-2-rollout-checklist.md. - Only after backlog = 0 proceed to the PR-2.5 Sindhu flag flip.
The script (scripts/backfill_vision_interpretation.py) is idempotent — docs
that already have findings are skipped.
10. Risk model¶
| ID | Risk | Mitigation |
|---|---|---|
| R1 | Vision API per-image cost balloons beyond estimate | Tiered tier-selection (Haiku for booleans, Sonnet for measurements). Metabase panel + Telegram alert when monthly vision spend crosses $20 (2× projected). Per-SOP tier override in model_registry.yaml for hot-fix. |
| R2 | Vision hallucinates fields outside schema | Schema-bounded parser drops non-schema fields. Off-schema observations relegated to notes (not surfaced to LLM). Human spot-check on first 50 production interpretations + a notes-distribution Metabase panel to catch systematic off-schema leakage. |
| R3 | PHI sent in vision call without BAA coverage | Verify Anthropic BAA covers image content (§11 Q4). Strip EXIF in preprocessor (test pinned). No PHI in synthesized prompt (schema-only). Same PII-redaction layer that gates conversation LLM applies — verify in PR-4.2 reviewer pass. |
| R4 | OCR + vision both run and produce contradictory data | OCR populates ocr_text (free-form), vision populates findings (structured). No contradiction surface — they're disjoint columns. If both fail, Phase 2 prompt template tells patient to describe verbally. |
| R5 | Sonnet vision call latency stalls the turn | All Phase 4 calls are out-of-band (post-OCR or TTL-driven), NOT synchronous to the conversation turn. Findings appear in the NEXT turn's prompt. Worst case: 60s lag, well below the ttl_seconds=300 declared per SOP. |
| R6 | Naidu's schema review changes the field set; production findings become stale | SOPContract.invalidate_cache fan-out exists (sop_contract.py:649-680). Re-interpretation deferred to Phase 4.5 admin endpoint; until then, schema changes propagate forward but old findings remain (with their original schema). Acceptable for MVP. |
| R7 | Vision call cost-tracking gap (similar to the pre-_record_usage bug at claude_pdf_extractor.py) |
llm_gateway.invoke is mandatory per .claude/rules/backend-agents.md. No direct anthropic.Anthropic().messages.create() calls in document_interpreter.py. Reviewer subagent (code-reviewer) must flag any direct SDK use. |
| R8 | Per-tenant rollout — flag a vision interpretation feature off for a tenant whose docs aren't ready | New Flagsmith flag document_interpretation_enabled (default false). Phase 2.5 / Phase 4.5 rollout flips it for Sindhu's tenant first, then percentage-rolls. Default off preserves current behavior (Phase 2 prompt shows "(not yet extracted)"). |
11. Open questions for SD¶
-
Should vision interpretation be always-on (every uploaded doc gets analyzed) or on-demand (only when SOP needs that doc)? Lean: on-demand. The current spec routes through the SOP's
required_documents[]lookup; docs not declared in the SOP getstatus=skipped_not_in_sopand consume no vision budget. Cost is the driver — at 1,000 cases/mo with always-on across all uploaded files, the $10-15/mo estimate could 3-5× because patients upload incidental files (insurance cards, passport scans) we don't need to interpret. -
Should the vision call run synchronously (blocks the conversation turn until findings are ready) or async (turn proceeds with "doc analyzing", findings appear in next turn)? Lean: async. Vision calls run 2-8s at Sonnet tier; synchronous would add that latency to every turn after an upload. Async means turn N renders
status=processing, turn N+1 renders findings. Phase 2's §7 enum already supports this (processingvalue +eta_seconds). -
Default vision model tier — Haiku (cost) or Sonnet (clinical accuracy)? Or per-SOP override? Lean: per-SOP override with the §2.2 schema-driven default (Haiku for booleans, Sonnet for numerics). The
model_registry.yamlper-SOP override lets admin dial it. Default schema-driven choice keeps the simple case simple. -
PR-4 chain order — Sindhu rollout dependency: ship PR-4.* before PR-2.5? Lean: yes, ship Phase 4 PR-4.0 through PR-4.4 BEFORE flipping
prompt_arch_v6_2_flexiblefor Sindhu's tenant. Without Phase 4, Sindhu's canary cases will see empty findings and the structural fix Phase 2 advertises (D1-D6 close) won't actually fire. The 3-day Phase 4 delay is small relative to the credibility cost of a canary that visibly under-delivers on the documents axis.
12. Summary¶
Phase 4 fills the document findings dict that Phase 2's prompt template already renders. The pipeline is: SOP declares interpretation_schema (Phase 1, already in place for TKR + ACL) → vision preprocessor pulls the file from R2 and resizes to under 3.5MB (new) → prompt synthesizer emits a schema-bounded vision-LLM prompt (new) → llm_gateway.invoke runs the call at Haiku or Sonnet tier per schema type (new) → schema-bounded parser writes findings into consolidated_state.documents.{doc_id}.findings (new, also closes the Phase 2 Gap 1 documents writer issue) → Phase 2 prompt template surfaces the findings → SOPContract.document_findings_complete (Phase 1, already in place) gates auto_invoke_matcher promotion.
Estimated incremental monthly cost: $1-2 at canary scale, ~$10-15 at 1,000 cases/mo. Tiered model selection keeps cost in the single-percent range of total LLM spend. No commercial radiology API; no DICOM parsing (the JPG-derived files patients upload are sufficient for the joint-space / grade fields the SOPs declare).
The architectural bet: schema-declared vision extraction beats free-form OCR + downstream parsing because (a) the LLM sees the SOP's required fields directly in the synthesizer prompt, (b) the parser is schema-bounded so hallucinations can't sneak through, (c) the same interpretation_schema is the single source of truth from SOP YAML to SOP completeness gate. The Sindhu D1-D6 hallucination class — "the extraction failed", "I see your X-ray uploading now" — closes structurally because the LLM finally sees real findings instead of empty placeholders.