Skip to content

Instantly share code, notes, and snippets.

@stabgan
Created June 19, 2025 17:11
Show Gist options
  • Save stabgan/10ef4ba86e34375db40130ffa8feb33d to your computer and use it in GitHub Desktop.
Save stabgan/10ef4ba86e34375db40130ffa8feb33d to your computer and use it in GitHub Desktop.
Extracting vassopressors using ontologies

A Comprehensive, Ontology-Driven Strategy for Vasopressor and Inotrope Value Set Generation

Introduction: The Imperative for an Ontology-Driven Value Set

The accurate calculation of the Sequential Organ Failure Assessment (SOFA) score is a cornerstone of modern critical care research, particularly in defining cohorts for conditions like sepsis-associated acute kidney injury (SA-AKI). A critical and often challenging component of this score is the cardiovascular assessment, which quantifies the level of pharmacological support required to maintain hemodynamic stability. The reliance on a manually curated, hardcoded list of vasopressor and inotrope medications to identify these interventions is scientifically untenable. Such static lists are inherently fragile; they are prone to omissions, fail to account for the continuous introduction of new drug formulations and brand names, and lack the fundamental scientific principles of transparency and reproducibility.1 An incomplete value set directly leads to the underestimation of cardiovascular support, resulting in erroneously low SOFA scores. This misclassification can compromise the integrity of a study by failing to correctly identify the severity of a patient's condition, ultimately invalidating research findings.

To address these shortcomings, this report outlines a robust, multi-modal strategy for programmatically generating a definitive value set for vasopressor and inotrope medications. The proposed methodology moves beyond simple lexical matching and leverages a rich ecosystem of standard medical ontologies, including RxNorm, the Anatomical Therapeutic Chemical (ATC) classification system, and SNOMED CT. This approach treats the creation of a value set not as a one-time technical task, but as a core component of rigorous, reproducible clinical science. The objective is to construct a comprehensive, semantically-grounded lookup table that can be used to reliably identify all relevant medication administrations within source Electronic Health Record (EHR) data.

A key advantage of this ontology-driven process is its capacity to create a "living" value set. Medical knowledge is not static; new drugs are approved, and new brand names enter the market.1 Standard terminologies like RxNorm are updated regularly to reflect these changes. A scripted, ontology-based generation process can be re-executed periodically against these updated terminologies. This ensures the value set remains current and comprehensive over the lifetime of a research project, a critical feature for ensuring long-term validity and adapting to the evolving landscape of pharmacology.

Section 1: Foundational Lexicon and Pharmacological Framework

Before querying complex terminologies, it is essential to establish a foundational understanding of the target agents. This involves creating a comprehensive lexical corpus to seed our searches and defining a clear pharmacological classification to guide the application of clinical rules, such as those for the SOFA score.

1.1 The Core Lexical Corpus

The initial step in identifying concepts within large terminologies is a lexical search. A broad and inclusive set of search terms is required to maximize the initial recall of potentially relevant concepts. This corpus, synthesized from extensive review of clinical literature and drug databases, will be used to query the string or term fields within the ontology tables (e.g., rxnorm.rxnconso.str, snomed_description.term).

Table 1: Comprehensive Lexical Search Terms

Category Search Terms (Case-Insensitive)
Generic Names Norepinephrine, Epinephrine, Vasopressin, Dopamine, Dobutamine, Phenylephrine, Angiotensin II, Isoproterenol, Milrinone, Levosimendan, Terlipressin 4
Common Brand Names Levophed, Adrenalin, Vasostrict, Pitressin, Intropin, Dobutrex, Neo-Synephrine, Giapreza, Isuprel, Primacor, Vazculep, Biorphen, Auvi-Q, EpiPen 1
Drug Classes & Keywords "vasopressor", "pressor", "inotrope", "inotropic agent", "adrenergic agent", "sympathomimetic", "catecholamine", "cardiotonic", "cardiac stimulant" 5

1.2 Clinical Classification of Target Agents

Simply identifying a drug is insufficient for accurate SOFA scoring. The scoring rules explicitly differentiate between agents based on their primary mechanism of action.5 For instance, dobutamine (a pure inotrope) is treated differently from vasopressors like norepinephrine. Furthermore, from a research perspective, distinguishing between agents with pure vasopressor effects versus those with mixed inotropic and vasopressor effects allows for more nuanced subgroup analyses. The classification below, based on primary receptor activity and clinical effect, provides the necessary framework for the

drug_class attribute in the final value set.4

Table 2: Pharmacological Classification of Primary Vasoactive Agents

Drug Class Primary Receptor(s) / Mechanism
Norepinephrine Vasopressor / Inotrope (Inopressor) α1​>β1​
Epinephrine Vasopressor / Inotrope (Inopressor) α1​,β1​,β2​
Vasopressin / Terlipressin Vasopressor (Non-adrenergic) V1​
Phenylephrine Vasopressor (Pure) α1​
Angiotensin II Vasopressor (Non-adrenergic) AT1​
Dopamine Vasopressor / Inotrope (Dose-dependent) D>β1​>α1​
Dobutamine Inotrope β1​>β2​
Milrinone Inotrope (Inodilator) PDE-3 Inhibitor
Isoproterenol Inotrope / Chronotrope β1​,β2​
Levosimendan Inotrope (Calcium Sensitizer) Troponin C

Section 2: A Multi-Pronged Ontology Querying Strategy

A single ontology is insufficient to capture the full spectrum of relevant drugs and their relationships. This section details a multi-pronged strategy that leverages the unique strengths of ATC, RxNorm, and SNOMED CT to build a comprehensive list of concepts, which will then be filtered for clinical relevance using UMLS semantic types.

2.1 Traversing the Anatomical Therapeutic Chemical (ATC) Hierarchy

The ATC system, maintained by the World Health Organization, provides a robust, hierarchical classification of drugs based on their therapeutic and pharmacological properties.13 This top-down approach is ideal for identifying entire classes of relevant agents.

The strategy begins by identifying appropriate seed classes within the ATC hierarchy. The most relevant class is C01C (CARDIAC STIMULANTS EXCLUDING CARDIAC GLYCOSIDES), which is further divided into C01CA (Adrenergic and dopaminergic agents) and C01CE (Phosphodiesterase inhibitors). These two subclasses perfectly capture the majority of the target catecholamine and inodilator agents.13 However, a search limited to the 'C' (Cardiovascular System) branch would be incomplete. Vasopressin and its analogues, which are critical vasopressors, are classified under the 'H' (Systemic Hormonal Preparations) branch. Therefore, it is essential to also include

H01BA (Vasopressin and analogues) in the query.14 This demonstrates the necessity of a multi-class approach, as clinically related drugs can reside in disparate parts of the ontology.

The query will target the atc.who_atc_ddd table to find all level-5 chemical substances (is_drug = true) that fall under these parent classes.

Table 3: Key ATC Codes for Seeding

ATC Code Description
C01C Cardiac stimulants excl. cardiac glycosides
C01CA Adrenergic and dopaminergic agents
C01CE Phosphodiesterase inhibitors
H01BA Vasopressin and analogues

SQL

-- SQL Example: Extract level 5 drug names from relevant ATC classes
SELECT DISTINCT
atc_name AS concept_name,
atc_code AS concept_code,
'ATC' AS source_ontology
FROM
atc.who_atc_ddd
WHERE
is_drug = true AND (
atc_code LIKE 'C01CA%' OR
atc_code LIKE 'C01CE%' OR
atc_code LIKE 'H01BA%'
);

2.2 Deconstructing RxNorm for Comprehensive Drug Product Identification

RxNorm is the United States standard for clinical drugs and is indispensable for mapping to EHR data, particularly for identifying National Drug Codes (NDCs) and all available product formulations.16 The true power of RxNorm for this task lies not in a simple lexical search of the

rxnconso table, but in leveraging the rxnrel table, which explicitly encodes the relationships between drug concepts.17

The strategy is a two-step expansion process:

  1. Identify Seed Ingredients: First, query rxnorm.rxnconso using the lexical terms from Table 1. The search is restricted to term types (tty) of 'IN' (Ingredient) and 'PIN' (Precise Ingredient) and the RxNorm source vocabulary (sab = 'RXNORM'). This provides a clean, foundational set of RxNorm Concept Unique Identifiers (RxCUIs) for the core active ingredients (e.g., Norepinephrine, Norepinephrine Bitartrate).19
  2. Expand to All Related Products: Next, use these seed RxCUIs to query the rxnorm.rxnrel table. By joining on rxcui1 (the ingredient concept) and searching for relationships (rela) such as 'ingredient_of', 'precise_ingredient_of', and 'has_tradename', it is possible to programmatically "explode" the initial set. This automated discovery process identifies all related rxcui2 concepts, including every specific 'SCD' (Semantic Clinical Drug, e.g., "Norepinephrine 4 MG/ML Injectable Solution") and 'SBD' (Semantic Branded Drug, e.g., "Levophed 4 MG/ML Injectable Solution"), ensuring no formulation or brand name is missed.18

SQL

-- SQL Example: Find all branded drugs (SBD) for the ingredient Norepinephrine (RxCUI 7577)
WITH seed_ingredient AS (
SELECT rxcui FROM rxnorm.rxnconso WHERE rxcui = 7577 -- Example RxCUI for Norepinephrine
)
SELECT DISTINCT
c2.str AS concept_name,
c2.rxcui AS concept_id,
'RxNorm' AS source_ontology
FROM
rxnorm.rxnrel r
JOIN
rxnorm.rxnconso c2 ON r.rxcui2 = c2.rxcui
WHERE
r.rxcui1 IN (SELECT rxcui FROM seed_ingredient)
AND r.rela IN ('ingredient_of', 'has_tradename')
AND c2.tty IN ('SBD', 'SCD')
AND c2.sab = 'RXNORM';

2.3 Navigating the SNOMED CT Clinical Terminology

SNOMED CT offers a comprehensive, formal ontology of clinical concepts structured by formal logic, primarily through Is a relationships. This structure is perfectly suited for defining a class of substances (e.g., "Vasopressor agent") and programmatically identifying all of its members.22

The strategy relies on identifying the correct high-level "seed" concepts and then recursively traversing the hierarchy to find all descendants. Using a SNOMED CT browser, the definitive seed concepts have been identified.23

Table 4: Definitive SNOMED CT Seed Concepts

Concept Name Concept ID
Vasopressor agent (substance) 372695006
Cardiotonic agent (substance) 372807009

The core of the SNOMED CT query is a recursive Common Table Expression (CTE). The query starts with the seed concept IDs and iteratively queries the snomed_relationship table to find all concepts (source_id) that have an Is a relationship (identified by type_id = 116680003) to the concepts already in the set (destination_id). This process continues until the entire sub-hierarchy is traversed, yielding a complete list of all substances that are subtypes of vasopressors or cardiotonics.

SQL

-- SQL Example: Recursively find all descendants of the 'Vasopressor agent' concept
WITH RECURSIVE concept_hierarchy AS (
-- Anchor member: the seed concept
SELECT id FROM snomed_ct.snomed_concept WHERE id = 372695006

UNION ALL

\-- Recursive member: join to find children ('Is a' relationship)  
SELECT r.source\_id  
FROM snomed\_ct.snomed\_relationship r  
JOIN concept\_hierarchy h ON r.destination\_id \= h.id  
WHERE r.type\_id \= 116680003 AND r.active \= true  

)
SELECT
h.id AS concept_id,
d.term AS concept_name,
'SNOMED_CT' AS source_ontology
FROM
concept_hierarchy h
JOIN
snomed_ct.snomed_description d ON h.id = d.concept_id
WHERE
d.active = true AND d.type_id = 900000000000003001; -- Fully Specified Name

2.4 Semantic Filtering with the UMLS Metathesaurus

After aggregating concepts from ATC, RxNorm, and SNOMED CT, the list must be filtered to ensure all concepts are pharmacologically relevant. A lexical search for "dopamine," for example, could retrieve concepts related to neurotransmitter pathways rather than the administered drug. The UMLS Semantic Network provides a robust mechanism for this filtering.24

The strategy involves two steps: first, mapping all collected concepts to their UMLS Concept Unique Identifier (CUI) via the umls.mrconso table; second, retaining only those CUIs that are assigned a relevant semantic type in the umls.mrsty table. To balance precision with recall, a tiered filtering approach is recommended.

Concepts assigned the semantic types T200 |Clinical Drug| or T121 |Pharmacologic Substance| are of the highest relevance and can be included with high confidence.25 Broader types like

T123 |Biologically Active Substance| and T109 |Organic Chemical| may also capture relevant agents but could introduce noise (e.g., metabolites, precursors). Concepts identified only through these broader types should be flagged for a brief manual review to ensure they represent administered medications. This tiered approach maximizes the comprehensiveness of the value set while maintaining a high-quality core.

Table 5: Recommended UMLS Semantic Types for Filtering

Tier TUI Semantic Type Name
Core (High Confidence) T200 Clinical Drug
Core (High Confidence) T121 Pharmacologic Substance
Expanded (Review Recommended) T123 Biologically Active Substance
Expanded (Review Recommended) T109 Organic Chemical

Section 3: Synthesis, Mapping, and Final Value Set Construction

The final phase of the process involves the data engineering required to synthesize the ontologically-derived concepts, map them to concrete identifiers within the target EHR schemas, and construct the final, analysis-ready lookup table.

3.1 Unifying and Deduplicating Ontological Concepts

The concepts gathered from the disparate ontologies must be unified into a single, deduplicated list. The UMLS CUI serves as the ideal primary key for this purpose, as it represents a single medical meaning that may be expressed with different codes or names across various terminologies.

The process is as follows:

  1. All concept names and codes from the ATC, RxNorm, and SNOMED CT queries are collected into a single staging table.
  2. This table is joined with umls.mrconso to resolve each entry to its corresponding CUI.
  3. The results are then grouped by CUI. This step effectively deduplicates the list, collapsing multiple representations (e.g., the ATC name for norepinephrine, its RxNorm ingredient name, and its SNOMED CT concept name) into a single entity.
  4. During this grouping, the source ontologies for each CUI are aggregated into an array (e.g., {'ATC', 'RXNORM', 'SNOMED_CT'}), providing critical provenance for each concept in the final value set.

3.2 Mapping Concepts to EHR-Specific Identifiers

This step bridges the gap between abstract ontological concepts and the concrete data in the MIMIC-IV and eICU-CRD databases. The mapping strategy must prioritize data sources that reflect actual, continuous drug administration, as this is what the SOFA score requires. Based on the provided data definitions, mimic.inputevents and eicu.infusiondrug are the highest-quality targets, as they are explicitly designed to record infusions.27 Mapping to order tables like

mimic.prescriptions is a useful secondary or confirmatory step but is not the primary goal.

  • MIMIC itemid Mapping: The primary target is the inputevents.itemid. To achieve this, the list of all string names associated with our unified concepts (from umls.mrconso) is joined to the mimic.d_items table. The join condition will be a case-insensitive string match on the label column (e.g., LOWER(d_items.label) LIKE '%norepinephrine%'). To improve accuracy, the d_items.label field should be pre-processed to remove concentrations and fluid types (e.g., "Norepinephrine 8mg in 250mL D5W" becomes "Norepinephrine"). This process yields a direct mapping from a concept CUI to a set of itemids that can be queried in the inputevents table.
  • NDC Code Mapping (Confirmatory): For concepts with an RxNorm origin, RxCUIs can be mapped to NDC codes via the rxnorm.rxnsat table. The query will filter rxnsat for records where rxcui is in our list and the attribute name (atn) is 'NDC'. The attribute value (atv) column contains the NDC, which can then be used to search the mimic.prescriptions.ndc column.27
  • eICU Free-Text Pattern Generation: The eicu.infusiondrug.drugname column is a free-text field.27 The most robust way to query it is by generating a set of SQL
    ILIKE patterns for each concept. For each CUI in our unified list, we will extract all its associated generic and brand names and convert them into patterns (e.g., "Levophed" becomes '%levophed%'). These patterns will be stored in the final lookup table for direct use in queries against the eICU data.

3.3 The Definitive saaki.vasopressor_lookup Table

The culmination of this entire process is a single, well-structured, and comprehensive lookup table. This table is designed to be directly consumed by the SOFA score calculation script, providing all necessary information to identify vasopressor and inotrope use across both MIMIC and eICU datasets.

Table 6: Proposed Schema for saaki.vasopressor_lookup

Column Name Data Type Description Example
concept_name VARCHAR The preferred, human-readable name for the drug concept. 'Norepinephrine'
concept_id VARCHAR The primary key for the concept, the UMLS CUI. 'C0028323'
identifier_type VARCHAR The type of EHR identifier. Critical for query logic. 'MIMIC_ITEMID'
identifier_value VARCHAR The actual value to be used in a query. '221906'
source_ontologies TEXT An array of ontologies that identified this concept. {'ATC', 'RXNORM', 'SNOMED_CT'}
drug_class VARCHAR Pharmacological classification from Table 2. Essential for SOFA rules. 'Vasopressor / Inotrope (Inopressor)'

Conclusion: Ensuring Long-Term Viability and Scientific Rigor

The strategy detailed in this report presents a comprehensive, reproducible, and transparent methodology for creating a vasopressor and inotrope value set. By systematically leveraging the hierarchical and relational structures of multiple standard ontologies, this approach overcomes the inherent limitations and scientific fragility of manually curated lists. The resulting saaki.vasopressor_lookup table is not merely a list of identifiers; it is a rich artifact containing pharmacological classifications and provenance data that directly supports accurate SOFA score calculation and enables more sophisticated downstream research.

The primary strength of this ontology-driven framework is its maintainability. To ensure the value set remains current and reflects the latest in clinical pharmacology, it is recommended that the generation script be re-executed on a periodic basis, for instance, every six to twelve months. This schedule would align with major release cycles of the source terminologies like RxNorm and SNOMED CT, transforming the value set from a static file into a dynamic, "living" resource. Adopting this robust and systematic approach provides the highest possible standard for cohort definition, ensuring the integrity of the SOFA score calculation and bolstering the validity of all subsequent research findings for the SA-AKI project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment