SB|Education
AI Healthcare Tech Stacks — and How to Master Them

AI Healthcare Tech Stacks — and How to Master Them

Liz T.
Liz T.
March 15, 2026 · 20 min read
In brief

A complete engineering and scientific guide to building, deploying, and critiquing AI systems in clinical and healthcare settings — covering data standards, ML pipelines, clinical NLP, medical imaging, federated learning, MLOps, compliance, and the hard unsolved problems the field still hasn't cracked.

The Complete Stack — Seven Layers

AI healthcare systems are not single models. They are vertically integrated stacks where each layer must be designed for clinical correctness, regulatory compliance, and operational resilience. A production system at a hospital or pharma company spans seven layers: interfaces (EHR APIs, FHIR endpoints), AI services (clinical NLP, imaging AI, LLM agents), MLOps (MLflow, Kubeflow, model monitoring), ML/DL frameworks (PyTorch, MONAI, scikit-learn), the data layer (FHIR R4, HL7, DICOM, OMOP CDM), compliance (HIPAA de-identification, GDPR, FDA SaMD audit logs), and infrastructure (Kubernetes, Airflow, Kafka, GCP/AWS GovCloud).

Key insight: Most ML practitioners underestimate the interface, data standard, and compliance layers. These are where clinical AI projects most commonly fail or get blocked by procurement and legal — not in model accuracy.

Data Layer — FHIR, HL7 & DICOM

Healthcare data is radically different from general-purpose data. It arrives in clinical standards (FHIR, HL7, DICOM), spans structured codes (ICD-10, SNOMED CT, LOINC, RxNorm), and must be treated as patient-linked sensitive data from the moment it is ingested.

FHIR R4 — Fast Healthcare Interoperability Resources

FHIR is the dominant modern standard for exchanging clinical data via REST APIs. Understanding resources, bundles, and SMART on FHIR auth is non-negotiable for any AI healthcare engineer.

python
1# fhir_client.py — Fetching patient observations from a FHIR R4 server
2
3import requests
4from fhir.resources.patient import Patient
5from fhir.resources.bundle import Bundle
6from fhir.resources.observation import Observation
7from typing import List, Optional
8import pandas as pd
Click to view full code

DICOM — Medical Imaging Standard

DICOM is the universal format for medical imaging — CT, MRI, PET, X-ray, ultrasound. Every pixel of every clinical scan arrives wrapped in DICOM metadata. You cannot build medical imaging AI without mastering it.

python
1# dicom_pipeline.py — DICOM ingestion, anonymization & tensor conversion
2
3import pydicom
4import numpy as np
5import torch
6from pathlib import Path
7from typing import Dict, Tuple
8
Click to view full code

Controversy — FHIR Is Not as Interoperable as Advertised: While FHIR R4 is mandated under the 21st Century Cures Act, every major EHR vendor implements it differently. Epic's FHIR sandbox exposes a curated subset of resources; structured data critical for AI — detailed procedure notes, nuanced medication dosing, device data — is often inaccessible or buried in unstructured text. Real AI pipelines still rely heavily on proprietary Epic/Cerner APIs, bulk FHIR exports requiring separate vendor agreements, or direct database access.

OMOP Common Data Model — Research-Grade Standardization

For research and ML training across multiple hospital systems, OMOP CDM is the gold standard. It maps ICD, SNOMED, RxNorm, and LOINC to unified concept IDs, enabling multi-site studies without data sharing.

python
1# omop_query.py — Cohort extraction from OMOP CDM via SQLAlchemy
2
3from sqlalchemy import create_engine, text
4import pandas as pd
5
6engine = create_engine("postgresql://user:pass@host/omop_cdm")
7
8# Extract T2D cohort: patients with HbA1c > 6.5% and metformin prescription
Click to view full code

ML / Deep Learning Frameworks

The ML layer sits atop the data layer and below the clinical interface. Choosing the right framework is a function of task type, data modality, and regulatory classification of the model output.

ml frameworks

Risk Scoring with XGBoost + SHAP

The most commonly deployed AI model in hospitals is not a transformer — it is a gradient-boosted tree predicting readmission risk, sepsis onset, or deterioration from structured EHR features.

python
1# sepsis_risk_model.py — Training and SHAP-explained XGBoost sepsis risk model
2
3import xgboost as xgb
4import shap
5import pandas as pd
6import numpy as np
7from sklearn.model_selection import train_test_split
8from sklearn.metrics import roc_auc_score, average_precision_score
Click to view full code

Clinical NLP & Large Language Models

Around 70–80% of actionable clinical information exists in unstructured text: physician notes, discharge summaries, radiology reports, pathology reports, operative notes. Clinical NLP extracts structured knowledge from this text. Modern LLMs are rapidly reshaping this space — not without significant controversy.

Named Entity Recognition with ClinicalBERT

python
1# clinical_ner.py — Clinical NER: extracting diagnoses, meds & findings from notes
2
3from transformers import AutoTokenizer, AutoModelForTokenClassification
4from transformers import pipeline
5import spacy
6from scispacy.linking import EntityLinker  # links entities → UMLS concepts
7
8# Load fine-tuned clinical NER model
Click to view full code

LLMs in Clinical Settings — Structured Extraction

Foundation LLMs are being tested for discharge summary generation, clinical decision support, and patient-facing Q&A. The engineering challenge is not just accuracy — it is structured output, hallucination control, and latency within clinical workflow.

python
1# llm_clinical_extraction.py — Structured extraction from notes using an LLM + Pydantic
2
3from anthropic import Anthropic
4from pydantic import BaseModel, Field
5from typing import List, Optional
6import json
7
8class Medication(BaseModel):
Click to view full code

Controversy — LLM Hallucination in Clinical Contexts Is a Patient Safety Issue: LLMs hallucinate drug interactions, fabricate dosing information, confabulate lab reference ranges, and express false confidence on rare-disease presentations — precisely the cases where clinical judgment matters most. The NEJM AI editorial board has called for a moratorium on LLM clinical decision support deployment without prospective randomized controlled trials. As of 2025, no LLM has FDA clearance as a clinical decision support system.

Medical Imaging AI with MONAI

Medical imaging is the single most commercially successful AI vertical in healthcare. The FDA has cleared over 950 AI/ML-based medical devices as of 2025, the majority in radiology. The dominant framework is MONAI (Medical Open Network for AI), a PyTorch-based framework built by NVIDIA and the clinical AI community.

python
1# monai_segmentation.py — 3D lung nodule segmentation training loop with MONAI
2
3import torch
4from monai.networks.nets import UNet
5from monai.losses import DiceCELoss
6from monai.metrics import DiceMetric
7from monai.transforms import (
8    Compose, LoadImaged, EnsureChannelFirstd,
Click to view full code
Controversy — The "AI Radiologist" Headline vs. Clinical Reality: A 2021 systematic review (Nagendran et al., BMJ) examined 81 head-to-head AI-vs.-clinician studies and found 70 had high risk of bias, none were prospective randomized trials, and almost all tested AI on data from the same institution used for training. When models are tested at different hospitals or on different scanner models, performance drops are frequently large and clinically significant — a phenomenon called dataset shift.

Federated Learning — Training Without Moving Data

Federated learning (FL) solves one of healthcare AI's hardest problems: how to train on data from multiple hospitals without ever sharing patient records. Each hospital trains a local model on its own data and shares only model weights (gradients), aggregated at a central server.

python
1# federated_learning.py — Federated Averaging (FedAvg) with Flower framework
2
3import flwr as fl
4import torch
5import torch.nn as nn
6from collections import OrderedDict
7from typing import Dict, List, Tuple
8import numpy as np
Click to view full code
Open Question — Does Federated Learning Actually Preserve Privacy? Gradient inversion attacks (Zhu et al., NeurIPS 2019; Geiping et al., NeurIPS 2021) have demonstrated that training images — including medical images — can be partially or fully reconstructed from gradients, particularly in early training rounds with small batch sizes. Combining FL with differential privacy adds formal guarantees but at significant accuracy cost. The field has not yet found a principled resolution to this tension.

MLOps for Healthcare — Production Is Different Here

Healthcare MLOps has requirements that don't exist in standard ML engineering: model versioning tied to regulatory submissions, mandatory audit trails, clinical performance drift monitoring, and the ability to roll back a deployed model within hours if a safety signal emerges.

python
1# clinical_drift_monitor.py — Statistical drift detection for deployed clinical models
2
3from evidently.report import Report
4from evidently.metric_preset import DataDriftPreset, ClassificationPreset
5from evidently.metrics import DatasetDriftMetric
6import pandas as pd
7import mlflow
8from datetime import datetime
Click to view full code
Regulatory Requirement: Under FDA's proposed AI/ML SaMD framework and the EU AI Act, any model update — including retraining on new data — may constitute a "significant change" requiring re-submission or 510(k) notification. Change-management controls must be integrated from day one, not retrofitted after deployment.

HIPAA, GDPR & Regulatory Compliance

Compliance is not a legal checkbox — it is an engineering discipline. HIPAA's Safe Harbor de-identification standard specifies 18 PHI identifiers that must be removed or transformed. Non-compliance exposes institutions to penalties of up to $1.9M per violation category per year (HIPAA) and 4% of global annual turnover (GDPR).

python
1# hipaa_deidentification.py — HIPAA Safe Harbor de-identification pipeline
2
3import re
4import hashlib
5from datetime import datetime, timedelta
6import pandas as pd
7from presidio_analyzer import AnalyzerEngine
8from presidio_anonymizer import AnonymizerEngine
Click to view full code

Explainability, Trust & Clinical Governance

A model that cannot explain its predictions to a clinician cannot safely be used in clinical decision-making. Explainability in healthcare is a precondition for physician trust, institutional liability management, and in the EU, a legal right under GDPR Article 22.

python
1# gradcam_explanation.py — Grad-CAM saliency maps for radiology AI interpretability
2
3import torch
4import torch.nn.functional as F
5import numpy as np
6import matplotlib.pyplot as plt
7from typing import Optional
8
Click to view full code
Controversy — SHAP and Grad-CAM Don't Actually Explain Models: Research by Rudin (2019, Nature Machine Intelligence) argues post-hoc explanations of black-box models are inherently unreliable: SHAP values change under perturbations that don't change model predictions. Adebayo et al. showed Grad-CAM saliency maps from a fully randomized neural network are visually indistinguishable from those of a trained model. The counterclaim: a Grad-CAM overlay showing the model focused on a lesion rather than an image artifact is actionable, even without mechanistic interpretation.

Controversies — What the Field Gets Wrong

The Racial Bias Problem Is Structural, Not Fixable by Debiasing

A 2019 Science paper (Obermeyer et al.) revealed a widely-used commercial healthcare algorithm systematically under-referred Black patients for high-risk care by using healthcare costs as a proxy for healthcare needs — a structurally racist proxy, since Black patients receive less care for equivalent illness severity due to systemic factors. Post-hoc debiasing can reduce bias on measured metrics while amplifying it on unmeasured ones. Critics argue that fair clinical AI requires first fair clinical systems — no debiasing technique can make an unjust dataset just.

The Academic-to-Deployment Gap Is Wider Than Reported

A 2021 Lancet Digital Health analysis reviewed 415 clinical AI papers in high-impact journals and found fewer than 2% were prospective clinical trials. The rest were retrospective analyses on curated, single-institution datasets. Several high-profile tools failed in prospective deployment — most notably the Epic sepsis model, which had a published AUROC of 0.76 but was independently evaluated at AUROC 0.63 in production, with high false positive rates leading to alert fatigue.

Data Governance in AI Training Is a Legal Minefield

Multiple major AI healthcare companies have faced legal challenges over patient data use. Google's Project Nightingale accessed medical records of approximately 50 million Americans without individual patient consent — permissible under a HIPAA Business Associate Agreement but widely criticized. The core legal tension: HIPAA permits use of de-identified data and BAA-covered PHI for treatment, payment, and healthcare operations — but training a commercial AI model arguably falls outside those categories. No court has definitively ruled on this.

Open Questions — The Research Frontier

These are the questions leading researchers are actively working on and haven't solved. Mastering AI healthcare means knowing where the field's knowledge ends.

Causality vs. correlation in EHR models. When a readmission model learns that patients discharged on Fridays are more likely to be readmitted, is it learning a causal signal or a confound of social determinants of health? How do we train models that learn causal mechanisms rather than hospital workflow artifacts?

Foundation models for medicine — are they ready? Models like Med-PaLM 2 and GPT-4 score highly on medical licensing exams. But USMLE-level reasoning is not the same as clinical decision-making. How do we evaluate clinical reasoning, not clinical recall?

Uncertainty quantification for high-stakes decisions. When a model is 65% confident in a diagnosis, what does that mean clinically? Bayesian deep learning, conformal prediction, and ensemble methods all produce confidence estimates — but calibration to real-world clinical outcomes at the patient level remains unsolved.

Continuous learning without catastrophic forgetting. Clinical distributions change: new pathogens, updated treatment guidelines, demographic shifts. How do you continuously update a deployed clinical AI model without degrading performance on the cohorts it was originally validated on?

The rare disease problem. Most EHR training datasets have fewer than 100 cases of most rare diseases. Standard deep learning fails at this sample size. Few-shot clinical reasoning — learning from 5–20 examples — remains an open research problem with enormous clinical value.

Multimodal clinical AI at scale. A physician integrates imaging, labs, vital signs, notes, and patient history simultaneously. Building AI systems that do the same — without privileging one modality or requiring all modalities at inference — is the grand challenge of clinical AI architecture.

The deepest open question — does clinical AI actually improve outcomes? As of 2025, fewer than 30 randomized controlled trials have tested whether AI clinical decision support actually improves patient outcomes. Several of the best-powered trials have been neutral or negative: a 2023 NEJM trial of AI-assisted diabetic retinopathy screening showed no improvement over standard care; a 2022 JAMA trial of AI sepsis alerts found increased alert fatigue with no mortality benefit. The field is, in the most charitable reading, ten years away from having the clinical trial data needed to answer this question rigorously. Practitioners who understand this uncertainty — and build AI systems that generate the evidence for their own validation — will define the next generation of the discipline.

Further Reading: Obermeyer et al., Science 2019 · Nagendran et al., BMJ 2020 · Rudin, Nature Machine Intelligence 2019 · Adebayo et al., NeurIPS 2018 · Wong et al., JAMA Internal Medicine 2021

Explore more writing on topics that matter.

← Back to all posts