A Multi-Ontology Strategy for the Generation of a Systemic Antibiotic Value Set for Sepsis Cohort Identification

Executive Summary

The accurate identification of a sepsis cohort from electronic health record (EHR) data is critically dependent on the ability to comprehensively detect all instances of systemic antibiotic administration. Relying on small, hardcoded lists of common antibiotics is a methodologically flawed approach that leads to under-ascertainment and significant selection bias, thereby compromising the scientific validity of the research. This report outlines a robust, reproducible, and exhaustive strategy for programmatically generating a definitive value set for systemic antibiotic medications. By leveraging the rich semantic structures and relational data within multiple standard ontologies—including the Anatomical Therapeutic Chemical (ATC) classification, RxNorm, SNOMED CT, and the Unified Medical Language System (UMLS)—this multi-modal approach transcends simple lexical searching. The strategy encompasses the creation of a foundational lexical scaffold, deep interrogation of each ontology using its unique strengths, and a detailed data engineering plan to unify the findings and map them to the specific data structures of the MIMIC-IV and eICU-CRD databases. The final output is a comprehensive lookup table, saaki.antibiotic_lookup, which contains a complete set of identifiers (e.g., RxCUIs, NDCs, MIMIC itemids, name patterns) for all relevant systemic antibiotic agents. This artifact is designed not only to maximize the sensitivity and specificity of cohort identification but also to ensure the entire process is transparent, auditable, and scientifically defensible.

Section 1: Foundational Lexical Scaffolding for Antibiotic Identification

The initial phase in constructing a comprehensive antibiotic value set is to establish a broad lexical foundation. This is not the final value set itself, but rather a critical bootstrapping mechanism designed to cast a wide net into the vast description tables of the available ontologies (e.g., rxnorm.rxnconso.str, snomed_ct.snomed_description.term). This process overcomes the "cold start" problem of not knowing the specific concept identifiers to query, by instead leveraging the human-readable names of the drugs. The terms are meticulously compiled from clinical pharmacology references, drug compendia, and established sepsis treatment guidelines.1

1.1 Core Keywords and Broad Class Names

The process begins by defining a set of foundational terms that identify the concept of an antibiotic at the highest level of abstraction. This includes general keywords that describe the drug's function and the names of major pharmacological classes. These terms serve as the primary anchors for initial, broad searches within the ontology text fields.

Keywords: A set of core terms directly describing the antibacterial function is essential. These include 'antibiotic', 'antibacterial', and 'anti-bacterial'. These keywords are highly effective for initial filtering of concepts in terminologies like SNOMED CT and UMLS.
Broad Classes: The names of major antibiotic classes provide a powerful method for grouping related agents. This list is derived from standard pharmacological classifications.4 Key class names include:
'penicillin', 'cephalosporin', 'fluoroquinolone', 'macrolide', 'carbapenem', 'aminoglycoside', 'tetracycline', 'lincosamide', 'glycopeptide', 'lipopeptide', 'oxazolidinone', 'sulfonamide', 'nitroimidazole', and 'monobactam'.

These terms will be used in case-insensitive pattern matching queries (e.g., using ILIKE or full-text search capabilities) against ontology description tables to retrieve an initial set of "seed" concepts, which will then fuel the more structured queries detailed in Section 2.

1.2 Exhaustive Generic and Combination Agent Lexicon

Expanding beyond broad classes, the next step is to compile an exhaustive list of specific generic drug names. This is particularly crucial as many EHR medication entries utilize generic nomenclature. This list must be extensive, covering not only common agents but also less frequent but critically important drugs used in severe infections, as well as the numerous combination products prevalent in the ICU setting.2 Combination agents, such as beta-lactam/beta-lactamase inhibitor pairs, are fundamental to modern sepsis therapy and must be explicitly included.

This lexicon is compiled from a wide range of sources, including sepsis guidelines, drug databases, and lists of common antibiotics, to ensure maximum coverage.1 A sample of these critical generic and combination names includes:

'piperacillin-tazobactam'
'ampicillin-sulbactam'
'amoxicillin-clavulanate'
'imipenem-cilastatin'
'meropenem'
'ertapenem'
'doripenem'
'ceftriaxone'
'ceftazidime'
'cefepime'
'ceftaroline'
'vancomycin'
'daptomycin'
'linezolid'
'tedizolid'
'levofloxacin'
'moxifloxacin'
'ciprofloxacin'
'azithromycin'
'clindamycin'
'gentamicin'
'tobramycin'
'amikacin'
'doxycycline'
'tigecycline'
'metronidazole'
'aztreonam'

This comprehensive list serves two purposes: it acts as a direct source for fuzzy matching against messy EHR free-text fields and provides a rich set of specific terms to seed ontology searches for ingredient-level concepts.

1.3 Critical Brand Name Compendium for EHR Free-Text Mining

EHR free-text fields, such as mimic.prescriptions.drug and mimic.emar.medication, are notoriously unstructured and frequently contain proprietary brand names instead of generic names.10 Failure to account for these brand names would result in a significant loss of data and introduce bias. Therefore, compiling a comprehensive list of common U.S. brand names for systemic antibiotics is a non-negotiable requirement for achieving high recall.

This compendium is built from various drug information sources and clinical guidelines.7 These brand names will be used to generate robust search patterns for direct querying of free-text medication fields. Furthermore, they are essential for identifying branded drug concepts within RxNorm, which are typically denoted by the Term Type (

TTY) of 'BN' (Brand Name) or 'SBD' (Semantic Branded Drug). A representative sample of these critical brand names includes:

'Zosyn' (piperacillin-tazobactam)
'Unasyn' (ampicillin-sulbactam)
'Augmentin' (amoxicillin-clavulanate)
'Merrem' (meropenem)
'Invanz' (ertapenem)
'Maxipime' (cefepime)
'Teflaro' (ceftaroline)
'Rocephin' (ceftriaxone)
'Vancocin' (vancomycin)
'Cubicin' (daptomycin)
'Zyvox' (linezolid)
'Levaquin' (levofloxacin)
'Cipro' (ciprofloxacin)
'Avelox' (moxifloxacin)
'Zithromax' (azithromycin)
'Cleocin' (clindamycin)
'Flagyl' (metronidazole)

The following table consolidates the lexical terms from this section into a single, organized artifact, providing a ready-to-use resource for the initial phase of ontology querying.

Table 1: Comprehensive Lexical Search Terms for Systemic Antibiotics

Term	Term_Type	Source_Examples
antibiotic	Keyword	1

| antibacterial | Keyword | 15 |

| penicillin | Class | 5 |

| cephalosporin | Class | 5 |

| carbapenem | Class | 4 |

| fluoroquinolone | Class | 4 |

| macrolide | Class | 4 |

| aminoglycoside | Class | 1 |

| tetracycline | Class | 4 |

| glycopeptide | Class | 4 |

| lincosamide | Class | 4 |

| oxazolidinone | Class | 4 |

| monobactam | Class | 5 |

| piperacillin-tazobactam | Generic Combination | 2 |

| ampicillin-sulbactam | Generic Combination | 12 |

| imipenem-cilastatin | Generic Combination | 12 |

| ceftazidime-avibactam | Generic Combination | 13 |

| meropenem-vaborbactam | Generic Combination | 13 |

| ceftolozane-tazobactam | Generic Combination | 13 |

| vancomycin | Generic | 4 |

| daptomycin | Generic | 4 |

| linezolid | Generic | 4 |

| ceftriaxone | Generic | 5 |

| cefepime | Generic | 3 |

| levofloxacin | Generic | 4 |

| azithromycin | Generic | 4 |

| clindamycin | Generic | 4 |

| gentamicin | Generic | 1 |

| metronidazole | Generic | 4 |

| Zosyn | Brand | 12 |

| Unasyn | Brand | 12 |

| Merrem | Brand | 12 |

| Maxipime | Brand | 13 |

| Rocephin | Brand | 7 |

| Levaquin | Brand | 7 |

| Zyvox | Brand | 7 |

| Vancocin | Brand | 12 |

| Cubicin | Brand | 12 |

| Flagyl | Brand | 7 |

Section 2: A Multi-Ontology Strategy for Comprehensive Concept Discovery

With a foundational lexical list established, the next phase involves the deep, structured interrogation of multiple biomedical ontologies. Each terminology possesses unique strengths and structural properties. A multi-modal strategy that combines these strengths is far more robust than relying on any single source. This section details the specific, actionable query strategies for ATC, RxNorm, SNOMED CT, and UMLS to build a definitive set of antibiotic concepts.

2.1 Navigating the ATC Hierarchy: The J01 Core and Its Clinically Relevant Periphery

The Anatomical Therapeutic Chemical (ATC) classification system provides a strict hierarchical structure for drugs based on their therapeutic and chemical properties. Its primary value is in providing a high-level, curated classification of systemic antibacterial agents.

The core of our search within ATC will target the J01 group, which is explicitly defined as "ANTIBACTERIALS FOR SYSTEMIC USE".15 Querying the

atc.who_atc_ddd table for all codes that fall under this branch (e.g., where atc_level2 = 'J01' or atc_code LIKE 'J01%') will capture the vast majority of relevant drugs, including penicillins (J01C), cephalosporins (J01D), carbapenems (J01DH), macrolides (J01FA), and aminoglycosides (J01G).15

However, a rigid adherence to only the J01 class is clinically insufficient and would lead to significant omissions. Clinical practice, especially in the ICU, necessitates the use of agents classified elsewhere in ATC for treating severe systemic bacterial infections. A prime example is Metronidazole, which is a cornerstone of therapy for anaerobic infections common in intra-abdominal sepsis.3 Within ATC, metronidazole is classified under

P01AB01, as a nitroimidazole derivative for treating protozoal diseases.19 To ignore this drug would be a critical error. Therefore, our strategy must be augmented to explicitly include

P01AB01. Other drugs in the P01AB class, such as tinidazole, are not typically used for systemic bacterial infections in the ICU and can be excluded.21

Another area requiring careful clinical judgment is the A07AA class, "Intestinal Antiinfectives".22 This class includes oral formulations of vancomycin and rifaximin. While these are antibiotics, their intended use is for local action within the gastrointestinal tract (e.g., oral vancomycin for

Clostridioides difficile colitis, rifaximin for hepatic encephalopathy).24 Rifaximin is designed to be non-absorbable.26 While oral vancomycin can be systemically absorbed, particularly in critically ill patients with renal failure or inflamed GI tracts 27, its administration is not for the purpose of treating a suspected systemic infection. Including these agents would introduce noise by flagging patients treated for local GI conditions as having suspected systemic infection. Therefore, the parenteral formulations of these drugs (e.g., IV vancomycin, classified in

J01XA01) will be included, but the oral-only formulations from A07AA will be explicitly excluded from the final systemic value set. This nuanced decision highlights the necessity of blending ontological structure with clinical context.

The following SQL demonstrates the ATC query logic:

SQL

-- Retrieve concepts from the primary systemic antibacterial class
SELECT atc_code, atc_name, 'J01' AS rationale
FROM atc.who_atc_ddd
WHERE atc_code LIKE 'J01%' AND is_drug = true

UNION ALL

-- Explicitly include systemically used agents from other classes with clinical justification
SELECT atc_code, atc_name, 'P01AB - Systemic Anaerobic Use' AS rationale
FROM atc.who_atc_ddd
WHERE atc_code = 'P01AB01'; -- Metronidazole

2.2 Maximizing RxNorm for Ingredient-to-Product Expansion

RxNorm serves as the central hub for our value set, acting as the bridge between abstract concepts and prescribable medications. It is the designated U.S. standard for clinical drugs and provides a rich relational structure linking ingredients, strengths, dose forms, and brand names.29 The strategy for leveraging RxNorm is a powerful two-stage expansion process that moves from ingredients to all their corresponding products.

Stage 1: Seed Ingredient Identification. The first step is to create a high-confidence set of antibiotic "ingredient" concepts within RxNorm. This is achieved by using the lexical terms from Section 1 and the drug names derived from our ATC and SNOMED CT queries. We query the rxnorm.rxnconso table for concepts where the Term Type (tty) is 'IN' (Ingredient) or 'PIN' (Precise Ingredient) and the source (sab) is 'RXNORM', ensuring we are working with the canonical RxNorm concepts.31

SQL

-- Stage 1: Find RxCUIs for seed antibiotic ingredients
CREATE TEMP TABLE seed_abx_ingredients AS
SELECT DISTINCT rxcui, str
FROM rxnorm.rxnconso
WHERE sab = 'RXNORM'
AND tty IN ('IN', 'PIN')
AND lower(str) IN (
-- Populate with the comprehensive lexical list from Section 1
'amoxicillin', 'clavulanate', 'piperacillin', 'tazobactam', 'vancomycin', 'ceftriaxone',...
);

Stage 2: Relational Graph Traversal. An ingredient is not a prescribable drug; a product is. A single ingredient like "amoxicillin" can manifest in dozens of prescribable products (e.g., "Amoxicillin 250 MG Oral Capsule", "Amoxicillin 500 MG / Clavulanate 125 MG Oral Tablet [Augmentin]"). To capture this entire "pharmaceutical reality," we must traverse the RxNorm relational graph. The rxnorm.rxnrel table contains these connections, using the rela column to define the relationship type.31 Key relationships include

has_ingredient, tradename_of, consists_of, and isa.32 A recursive query is the most effective method to start with our seed ingredients and find all connected concepts that represent actual drug products, which are typically

SCD (Semantic Clinical Drug), SBD (Semantic Branded Drug), GPCK (Generic Pack), and BPCK (Brand Name Pack).30

The logic involves starting with the seed ingredient RxCUIs and iteratively finding all concepts linked to them via any relationship in rxnrel, then repeating the process for the newly found concepts until the entire connected component of the graph is explored. This ensures that from a single ingredient like 'piperacillin', we find not only the ingredient concept for 'tazobactam' but also the combination product 'piperacillin-tazobactam', its brand name 'Zosyn', and all specific strength and dose form variations (e.g., 'Piperacillin 4000 MG / Tazobactam 500 MG Injection'). This graph traversal is fundamental to achieving a truly comprehensive value set.

2.3 Definitive SNOMED CT Traversal from Validated Seed Concepts

SNOMED CT, as a formal clinical ontology, offers a powerful, logic-based structure for identifying concepts. Relying on simple text searches within SNOMED CT is fraught with peril, as it is semantically imprecise and can erroneously retrieve concepts like "History of antibiotic allergy" or "Guideline for antibiotic use".33 The scientifically robust method is to leverage the ontology's poly-hierarchical structure.

The core of this structure is the 'Is a' relationship (parent-child), which is identified by type_id = 116680003 in the snomed_ct.snomed_relationship table.10 The strategy hinges on identifying the correct high-level "grouper" concept(s) and then programmatically finding all their descendants. The modern SNOMED CT medicinal product model provides ideal starting points.36 The following concepts serve as excellent, validated seeds for our traversal:

763158003 |Medicinal product containing antibiotic (product)|: This concept serves as a grouper for all manufactured products that contain an antibiotic.
108601006 |Substance with antibacterial mechanism of action (substance)|: This concept groups the antibiotic substances themselves, based on their defined mechanism of action.

By starting with these two high-confidence concepts, we can write a recursive Common Table Expression (CTE) in SQL to traverse the Is a hierarchy downwards. This approach guarantees that we retrieve a complete and semantically correct set of all concepts that are logically defined within SNOMED CT as either an antibiotic substance or a product containing one. This is vastly superior to any lexical method.

SQL

-- Find all SNOMED CT concepts that are descendants of the primary antibiotic groupers
WITH RECURSIVE snomed_antibiotics AS (
-- Anchor: Start with the definitive high-level concepts
SELECT id AS concept_id
FROM snomed_ct.snomed_concept
WHERE id IN (
763158003, -- Medicinal product containing antibiotic (product)
108601006 -- Substance with antibacterial mechanism of action (substance)
)

UNION ALL

\-- Recursive Step: Find all direct children of the concepts found in the previous step  
SELECT rel.source\_id  
FROM snomed\_ct.snomed\_relationship rel  
JOIN snomed\_antibiotics sa ON rel.destination\_id \= sa.concept\_id  
WHERE rel.active \= true  
  AND rel.type\_id \= 116680003 \-- 'Is a' relationship

)
SELECT DISTINCT concept_id FROM snomed_antibiotics;

2.4 Leveraging UMLS Semantic Types for Broad-Spectrum Filtering

The Unified Medical Language System (UMLS) acts as a metathesaurus, aggregating over 200 vocabularies, making it an unparalleled resource for breadth.37 Its primary utility in this project is not for definitive classification—a task better suited to the more granular models of RxNorm and ATC—but for broad, high-recall candidate generation. The key to harnessing UMLS is the

mrsty table, which links concepts (identified by a Concept Unique Identifier, or CUI) to high-level categories called Semantic Types (identified by a TUI).10

The UMLS Semantic Network provides a hierarchy of these types.38 Analysis of this network reveals the key relationships for our purpose:

T195 |Antibiotic|: This is the most specific and highest-precision semantic type for our query.39 Any concept assigned this TUI is a strong candidate.
T121 |Pharmacologic Substance|: This is a parent type of T195 (specifically, T195 is a child of T123 |Biologically Active Substance|, which is a child of T121).40 It is a much broader category that includes all drugs.42 Querying this type alone would be too noisy.
T200 |Clinical Drug|: This type represents a specific manufactured product (substance, strength, and form).43 The usage notes for
T200 explicitly state that it should not be double-typed with T121 or T195, indicating they represent different levels of abstraction.43

This structure informs a hybrid strategy. We use UMLS to identify drug substances and then rely on the RxNorm expansion (Section 2.2) to find the corresponding products. The optimal UMLS query combines high precision with high recall:

Select all CUIs with the semantic type T195. This provides a baseline of high-confidence antibiotic concepts.
Select all CUIs with the semantic type T121 only if their associated string in umls.mrconso matches a term from our lexical list (Section 1). This leverages our domain knowledge to filter the broad T121 category, capturing antibiotics that may not have been assigned the more specific T195 type.

This approach avoids the noise of querying all of T121 while ensuring we do not miss concepts that are pharmacologic substances and are known to be antibiotics based on our lexical scaffold.

SQL

-- Generate a broad list of candidate antibiotic CUIs from UMLS
SELECT cui FROM umls.mrsty WHERE tui = 'T195'

UNION

SELECT DISTINCT m.cui
FROM umls.mrsty m
JOIN umls.mrconso c ON m.cui = c.cui
WHERE m.tui = 'T121'
-- Use the lexical list as a filter for the broader semantic type
AND lower(c.str) IN ('amoxicillin', 'zosyn', 'cefepime',...); -- Populated from Table 1

Section 3: Engineering a Robust, Unified Antibiotic Value Set

The final and most critical phase of this project is the data engineering process. This involves synthesizing the concepts discovered from the multiple ontologies into a single, coherent value set and then devising precise strategies to map these concepts to the heterogeneous medication data fields within the MIMIC-IV and eICU-CRD databases. The outcome of this phase is the final, actionable lookup table that will drive the sepsis cohort identification.

3.1 Unification and Deduplication of Ontological Concepts

Having gathered lists of concept identifiers and names from ATC, SNOMED CT, RxNorm, and UMLS, the next step is to merge them into a single, non-redundant master set. Given its role as the U.S. national standard for clinical drugs and its rich relational model, RxNorm is the ideal "hub" or central terminology for this unification process.30 Most concepts from other major terminologies, such as ATC and SNOMED CT, are cross-mapped to RxNorm concepts within the UMLS Metathesaurus. Therefore, the RxNorm Concept Unique Identifier (

RxCUI) will serve as the primary key for our unified concepts.

The unification strategy is as follows:

Create Staging Tables: For each ontology-specific query in Section 2, materialize the results into temporary staging tables (e.g., tmp_atc_codes, tmp_snomed_ids, tmp_umls_cuis, tmp_rxnorm_rxcuis).
Map to RxCUI: For the concepts from ATC, SNOMED CT, and UMLS, use the umls.mrconso or rxnorm.rxnconso tables to find their corresponding RxCUIs. This is typically done by joining on the native code (e.g., snomed_concept.id to mrconso.code where sab = 'SNOMEDCT_US') to get a CUI, and then finding the RxCUI for that CUI.
Union and Deduplicate: Perform a UNION operation on the RxCUI lists from all four sources. This automatically handles deduplication, resulting in a single master list of all RxCUIs that have been identified as being related to an antibiotic by at least one ontology.
Final Concept Selection: Join this master RxCUI list back to rxnorm.rxnconso to retrieve the canonical names (tty = 'PSN') and filter for the desired term types that represent prescribable products (SCD, SBD) or medication packs (GPCK, BPCK). This step ensures the final list contains actionable drug products, not just abstract ingredients.

SQL

-- Conceptual logic for creating the final, unified set of antibiotic RxCUIs
CREATE TABLE final_abx_rxcuis AS
SELECT rxcui FROM tmp_rxcuis_from_atc
UNION
SELECT rxcui FROM tmp_rxcuis_from_snomed
UNION
SELECT rxcui FROM tmp_rxcuis_from_umls
UNION
SELECT rxcui FROM tmp_rxcuis_from_rxnorm_expansion;

3.2 Mapping Strategies for Coded EHR Data

The target EHR databases contain several coded columns for medication data, which allow for high-precision mapping.10

National Drug Code (NDC) Mapping: The mimic.prescriptions.ndc column provides the most direct and reliable mapping target. The RxNorm data files include the rxnsat table, which serves as an explicit crosswalk between RxCUIs and NDCs.45 We can join our

final_abx_rxcuis list with rxnorm.rxnsat on rxcui where the attribute name (atn) is 'NDC'. The attribute value (atv) will be the 11-digit NDC code. This generates a definitive list of all NDCs corresponding to our antibiotic value set, which can then be used to perform a direct, high-performance join against the mimic.prescriptions table.

SQL

-- Generate NDC list for the final lookup table
SELECT
abx.rxcui,
abx.concept_name,
ndc.atv AS identifier_value
FROM final_abx_concepts abx
JOIN rxnorm.rxnsat ndc ON abx.rxcui = ndc.rxcui
WHERE ndc.atn = 'NDC' AND ndc.sab = 'RXNORM';

MIMIC itemid Mapping: Mapping to the itemid used in tables like mimic.inputevents is more challenging as it is a local hospital identifier. The meaning of each itemid is provided by the label column in the mimic.d_items dictionary table. There is no direct ontological link between an itemid and an RxCUI. The mapping must therefore be performed using string-matching techniques between the d_items.label and the list of antibiotic names (generic and brand) derived from our unified concept list. This process requires careful text normalization on both sides (e.g., lowercasing, removing extraneous information like solution volumes, standardizing drug-salt forms) and manual validation to ensure accuracy. A fuzzy matching approach using ILIKE or regular expressions is necessary here.

SQL

-- Conceptual logic for mapping to MIMIC itemids via normalized string matching
SELECT
di.itemid AS identifier_value,
di.label AS source_label,
abx.concept_name
FROM mimic.d_items di
JOIN final_abx_concepts abx
ON lower(di.label) LIKE '%' |
| lower(abx.normalized_name) |
| '%'
WHERE di.linksto IN ('inputevents', 'chartevents'); -- Filter for relevant tables
-- NOTE: 'abx.normalized_name' is a hypothetical column requiring pre-processing.

3.3 Pattern Generation for Free-Text Medication Fields

A substantial portion of medication data resides in unstructured free-text columns across both MIMIC-IV (prescriptions.drug, emar.medication) and eICU-CRD (medication.drugname, infusiondrug.drugname, admissiondrug.drugname).10 To capture these administrations, we must generate a series of robust search patterns for each antibiotic concept.

The strategy involves the following steps for each canonical antibiotic concept in our final list:

Gather Synonyms: From rxnorm.rxnconso, retrieve all associated strings, including generic names, brand names, and common synonyms.
Generate Patterns: Programmatically create a variety of patterns to account for common variations in documentation:
- Component Patterns: For combination drugs like "piperacillin-tazobactam," create patterns that look for both components, ignoring order and intervening characters (e.g., '%piperacillin%tazobactam%').
- Brand Name Patterns: Create patterns for all known brand names (e.g., '%zosyn%').
- Abbreviation Patterns: Include clinically common abbreviations (e.g., 'pip-tazo', 'pip/tazo').
- Stem Patterns: Use word stems to capture variations (e.g., 'vanco%' to match 'vancomycin', 'vancocin').

These generated patterns will be stored in the final lookup table and used in ILIKE or regular expression-based queries against the free-text columns to achieve maximum sensitivity.

3.4 Final Schema and Implementation of the saaki.antibiotic_lookup Table

The culmination of this entire process is the creation of a single, powerful lookup table. A well-designed schema for this table is paramount, as it must not only store the identifiers but also preserve the provenance and clinical rationale behind the value set. The user's proposed schema is an excellent starting point, which can be enhanced with additional columns to increase its scientific value and transparency.

A simple list of identifiers is useful, but a truly rigorous artifact must be auditable. It should document why each item is included and, just as importantly, why certain related items are excluded. For example, the clinical judgment to exclude oral vancomycin from the systemic value set is a critical piece of methodological information. Storing this rationale directly in the table makes the entire process transparent and scientifically defensible to reviewers, collaborators, and future researchers.

The proposed final schema for the saaki.antibiotic_lookup table is detailed below.

Table 2: Final Schema for saaki.antibiotic_lookup

Column Name	Data Type	Description
concept_name	VARCHAR	The preferred, human-readable name of the antibiotic concept (e.g., 'Vancomycin', 'Piperacillin-Tazobactam').
concept_id	BIGINT	The primary concept identifier, which will be the RxNorm Concept Unique Identifier (RxCUI). This serves as the anchor for linking all other identifiers.
identifier_type	VARCHAR	The type of the identifier in the identifier_value column. Examples: 'MIMIC_ITEMID', 'NDC', 'NAME_PATTERN', 'ATC_CODE', 'SNOMED_CT_ID', 'EXCLUDED_CONCEPT'.
identifier_value	VARCHAR	The actual value of the identifier (e.g., '225798', '00409400201', '%vancomycin%', 'J01XA01', '763158003').
is_systemic	BOOLEAN	A flag indicating if the specific formulation or product is intended for systemic administration. TRUE for IV vancomycin, FALSE for oral vancomycin.
source_ontologies	TEXT	A PostgreSQL array listing the source ontologies that provided evidence for including this concept (e.g., {'ATC', 'RxNorm', 'SNOMED'}). This provides an audit trail for provenance.
rationale	TEXT	A free-text field for documenting key inclusion or exclusion decisions. Essential for ambiguous or controversial agents (e.g., "Oral formulation from ATC class A07AA; intended for local GI action and excluded from systemic value set.").

Conclusion

This report has detailed a comprehensive, multi-ontology strategy for generating a definitive value set for systemic antibiotic medications. By moving beyond simplistic, hardcoded lists and embracing a methodologically rigorous approach, this plan ensures the highest possible accuracy for the critical task of sepsis cohort identification. The strategy's strength lies in its layered, synergistic use of multiple standard terminologies: leveraging the curated hierarchy of ATC, the rich relational graph of RxNorm, the formal logic of SNOMED CT, and the broad semantic categorization of UMLS.

The process begins with the creation of a thorough lexical scaffold, which fuels subsequent, more precise ontological queries. It incorporates critical clinical domain knowledge to navigate ambiguities, such as the systemic use of drugs classified outside the primary J01 ATC group and the careful exclusion of locally acting agents. The final data engineering phase unifies these disparate sources using RxCUI as a central anchor and provides concrete methods for mapping the resulting concepts to both the coded and free-text data fields present in MIMIC-IV and eICU-CRD.

The final deliverable, the saaki.antibiotic_lookup table, is more than just a list of identifiers. Its proposed schema, which includes fields for provenance and rationale, transforms it into a transparent, auditable, and scientifically defensible artifact. Implementing this strategy will provide a foundational data component of the highest quality, ensuring that the subsequent sepsis analysis is built upon a comprehensive and accurate identification of antibiotic exposures, thereby maximizing the validity and impact of the research findings.

stabgan/antibiotic.md