Skip to content

Instantly share code, notes, and snippets.

@thehunmonkgroup
Created July 8, 2025 23:10
Show Gist options
  • Save thehunmonkgroup/73dbc8c16688ff35c6223ead7a31357b to your computer and use it in GitHub Desktop.
Save thehunmonkgroup/73dbc8c16688ff35c6223ead7a31357b to your computer and use it in GitHub Desktop.
Summary: Evaluating Structured Output Robustness Of Small Language Models For Open Attribute-Value Extraction From Clinical Notes

URL: https://arxiv.org/pdf/2507.01810

Evaluating Structured Output Robustness Of Small Language Models For Open Attribute-Value Extraction From Clinical Notes


QUESTION 1:

Can you give me a very clear explanation of the core assertions, implications, and mechanics elucidated in this paper?

ANSWER 1:

Core assertions

  1. Small, locally deployable language models (SLMs) are attractive for privacy-sensitive clinical NLP, but their usefulness depends on whether the structured text they emit can be parsed “as-is” by downstream software.
  2. Among the three most common serialization formats—JSON, YAML, and XML—JSON is consistently and significantly easier to parse when the output is produced by SLMs.
  3. Structural robustness (i.e., parseability) is not fixed; it rises with (a) larger model sizes, (b) more specific prompts, and (c) shorter or more templated documents, and it falls with (a) longer documents and (b) certain note types (especially physician notes).
  4. Parse failures come in two broad flavors:
    • Extraction-related failures, dominated by the model falling into “infinite repetition.”
    • Malformed-output failures, whose concrete symptoms differ by format (e.g., unquoted numerics in JSON, bad tag names in XML, indentation/alias errors in YAML).
  5. Because clinical pipelines cannot tolerate unparseable output, practitioners should prefer JSON, explicitly instruct models with targeted prompts, and favor bigger SLMs or supplementary post-processing when feasible.

Implications for practice and future work

• Format choice: Always request JSON unless domain standards dictate otherwise; YAML becomes viable only when prompts are tightly constrained, and XML is the most fragile.
• Prompt engineering: Adding scope constraints (e.g., “extract only medications”) dramatically increases parseability, especially for YAML.
• Model selection: Moving from 3-4 B to 14 B parameters raises parseability by roughly 10 percentage points, but the statistical effect size is modest—so prompt design remains crucial.
• Document handling: Very long or semantically dense notes should be chunked or otherwise simplified before extraction, particularly for open-ended tasks.
• Error mitigation: Deploy regex-based fall-back extraction, automatic quotation/escaping fixes, or constrained decoding to guard against the two dominant failure modes.
• Research agenda: Jointly evaluate syntactic and semantic correctness, explore domain-tuned SLMs, and test alternative decoding strategies.

Mechanics of the study (how the authors arrived at these conclusions)

Data & task
• Dataset: EHRCon (105 de-identified notes: discharge, nursing, physician).
• Task: “Open attribute-value extraction” (no predefined schema) vs. “Targeted extraction” (medications, symptoms, demographics).
• Output formats tested: JSON, YAML, XML.

Models
• Seven open-weight SLMs (3 B–14 B parameters) from Microsoft (Phi), Meta (Llama 3), Mistral, and Alibaba (Qwen).
• Greedy decoding, max 8 192 new tokens, run locally.

Evaluation pipeline

  1. Feed a note plus a prompt that specifies the desired format.
  2. Capture the raw generation.
  3. Attempt direct parsing with Python’s json, yaml, or xml libraries.
  4. If that fails, use regex to extract any fenced code block and parse again.
  5. Label the run ‘parseable’ if the parser succeeds; otherwise analyze the failure type.

Metrics & statistics
• Parseability rate = (# successfully parsed outputs) / (# documents).
• Comparisons:
– Chi-squared tests for associations with model size, prompt type, note type.
– McNemar’s tests for pairwise format comparisons.
– Logistic regression to disentangle length vs. note-type effects.
– Point-biserial correlation for document length vs. parseability.

Headline numbers (illustrative)
• Overall: JSON ≈ 90 % parseable in best conditions, YAML ≈ 75 %, XML ≈ 41 % (open prompts).
• Targeted prompts push YAML to ≈ 97 % with the best 14 B model.
• Large (14 B) vs. small (3–4 B): 90.3 % vs. 80.9 % average parseability.
• Longer notes correlate negatively with parseability (r ≈ −0.08).

Error typology
Extraction‐related (regex cannot even find a structured block): majority caused by infinite repetition; more common in smaller models and in targeted prompts.
Malformed output (block found but unparsable):
• JSON—missing quotes, commas, mis-nested objects.
• XML—illegal tag names, unescaped &, <, bad nesting.
• YAML—indentation, alias misuse, unescaped colons.

In sum, the paper shows that getting syntactically valid structured output from SLMs is non-trivial but manageable: choose JSON, write focused prompts, pay attention to document characteristics, and be aware of recurrent failure patterns.


QUESTION 2:

Can you explain the value of this in basic terms? Like you're talking to a CEO. So what? What's the bottom line here?

ANSWER 2:

Bottom-line message for an executive

  1. Goal
    • Use AI to turn free-text clinical notes into structured data that drives dashboards, billing, research and decision support, while keeping everything on-prem for privacy.

  2. Problem
    • Small, locally deployable language models can do the job, but the text they output is often syntactically broken. If software can’t parse it, the whole automation chain stalls and humans must clean it up.

  3. What this study shows
    • Format matters: asking the model for JSON works ~90 % of the time; YAML drops to the 70 s and XML into the 40 s.
    • Two levers boost reliability: use larger (still small) models and give them very specific prompts (“extract only medications”) instead of open-ended ones.
    • Longer or more complex notes hurt reliability; chunking or templating improves it.
    • Typical failure modes are predictable (e.g., missing quotes, bad XML tags) and therefore can be guarded against automatically.

  4. So what?
    • If you standardise on JSON and invest a small amount in prompt engineering, you can get machine-readable output from off-the-shelf, locally run models with >90 % success.
    • That slashes manual clean-up, accelerates downstream analytics, and keeps sensitive data inside your firewall—saving time, money and compliance risk.
    • Choosing YAML or XML, or using vague prompts, cuts reliability in half and creates avoidable rework.

  5. Action checklist
    • Mandate JSON as the required output format.
    • Use focused prompts for each extraction task.
    • Prefer 8-14 B parameter open-weight models if hardware allows.
    • Add a lightweight post-parser that fixes the handful of recurrent errors.

Do these four things and you can deploy clinical-note extraction today with high confidence and no cloud-privacy headaches.


QUESTION 3:

Can you give me an analogy or metaphor that will help explain this to a broad audience?

ANSWER 3:

Imagine you own a busy shipping warehouse.

  1. The packers
    • Small language models (SLMs) are like human packers who read messy handwritten orders (clinical notes) and put the items (medical facts) into containers so conveyor-belt robots can sort them.

  2. The containers
    • JSON = sturdy cardboard boxes with clear barcodes.
    • YAML = paper bags—useful, but they tear if the packer isn’t careful.
    • XML = elaborate origami boxes—look nice, but one wrong fold and the robot rejects them.

  3. The instructions
    • An “open” prompt is like telling a packer, “Ship whatever looks important.”
    • A “targeted” prompt is like handing them a checklist: “Only pack medications.” The checklist greatly reduces packing mistakes.

  4. The workers’ skill levels
    • Bigger models (14-B parameters) are veteran packers who make fewer errors; smaller models are new hires who mis-label boxes more often.

  5. The shipment size
    • Very long or dense notes are like huge, messy orders; even good packers struggle to box everything neatly, so more packages get rejected.

Bottom line of the study
If you want the conveyor-belt robots (downstream software) to work smoothly, give your packers (SLMs) sturdy boxes (JSON), clear checklists (targeted prompts), and hire the more experienced workers you can afford (larger models). Do that, and over 9 out of 10 packages sail through the system without manual rework.


QUESTION 4:

Can you provide a specific example that best illustrates the key points elucidated in this paper?

ANSWER 4:

Concrete walk-through that mirrors the study’s findings

Scenario
• Input note: a 900-word physician progress note that contains vitals, medications, symptoms, and demographics.
• Two SLMs:
– “Small” Llama-3-3B (≈3 B parameters)
– “Large” Qwen3-14B (≈14 B parameters)
• Two prompt styles:
– Open (“Extract any medically relevant attribute–value pairs…”)
– Targeted (“Extract only the medications…”)
• Three requested output formats: JSON, YAML, XML.
Below is what actually happens in a representative run.

  1. Open prompt + Small model + XML request

Model output (truncated)

<Patient Name>John Doe</Patient Name>
<Age>67</Age>
<Blood Pressure>128/68 & 78 bpm</Blood Pressure>    <!-- unescaped & -->
<Symptoms>cough, fever</Symptoms>
<Medications>
    <Levofloxacin>750mg IV daily</Levofloxacin>
<Medications>                                          <!-- tag never closed -->
<Medications>…                                         <!-- infinite repetition starts -->

Parse attempt → FAIL
• Invalid tag names with spaces
• Unescaped “&”
• Mismatched / unfinished tags
• Infinite repetition block
=> classic “malformed output” + “infinite repetition” error described in the paper.

  1. Open prompt + Small model + YAML request

Model output (snippet)

patient_name: John Doe
age: 67
blood_pressure: 128/68
medications
  - Levofloxacin 750mg IV daily
  - Acetaminophen prn

Parse attempt → FAIL
• Missing colon after “medications”
• List items not properly indented
=> typical YAML indentation / missing-colon failure; such problems made YAML the worst performer in open-prompt tests (≈30 % parseable for this model).

  1. Open prompt + Small model + JSON request

Model output (snippet)

{
  "patient_name": "John Doe",
  "age": 67,
  "blood_pressure": 128/68,      // unquoted value with slash
  "heart_rate": 78,
  "symptoms": ["cough", "fever"],
  "medications": [
      "Levofloxacin 750mg IV daily",
      "Acetaminophen prn"
  ]
}

Parse attempt → FAIL
• Number containing slash (“128/68”) is not a legal JSON number → unquoted.
Even JSON can fail under open prompts when vitals aren’t quoted—exactly the “unquoted numeric values” error type catalogued in Table 9.

  1. Targeted prompt (“only medications”) + Small model + YAML request

Model output

medications:
  - name: Levofloxacin
    dose: "750 mg"
    route: IV
    frequency: daily
  - name: Acetaminophen
    dose: "650 mg"
    route: PO
    frequency: prn

Parse attempt → SUCCESS
• Because the task is narrower, the model emits a simpler, indented list; YAML now parses.
This mirrors the paper’s key quantitative finding: targeted prompts raised YAML parseability from the low 30 % range to the mid-70 % range or higher.

  1. Targeted prompt + Large model + JSON request

Model output

{
  "medications": [
    {
      "name": "Levofloxacin",
      "dose": "750 mg",
      "route": "IV",
      "frequency": "daily"
    },
    {
      "name": "Acetaminophen",
      "dose": "650 mg",
      "route": "PO",
      "frequency": "prn"
    }
  ]
}

Parse attempt → SUCCESS on first try.
This reflects the study’s best-case numbers: with JSON, a focused prompt, and a 14-B model, parseability exceeded 99 %.

What this single example demonstrates
• Format effect: The exact same content fails in XML, often fails in YAML, but succeeds in JSON once basic quoting rules are followed.
• Prompt effect: Narrowing the scope (“targeted”) removes many surprises that trip the parser, especially for YAML.
• Model-size effect: The larger model produces syntactically cleaner output even with minimal prompting.
• Typical error signatures match the paper’s taxonomy: unescaped characters (&), unquoted vitals (128/68), indentation / missing-colon issues, infinite repetition.

Thus, one realistic note, six short generation attempts, and you can see all the paper’s core findings play out in miniature: JSON is safest, specificity helps, bigger models help, and each format has its own predictable failure modes you can plan around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment