
There is a particular kind of organizational dysfunction that only emerges in technical fields: the fluency gap. It's the gap between teams that can talk about a concept and teams that have actually implemented it correctly. In MLOps, that gap is unusually wide โ and unusually expensive.
The terminology of MLOps is now broadly understood. Model drift. Feature stores. CI/CD for ML. Data versioning. Monitoring pipelines. You will hear these terms in almost every serious engineering conversation about production machine learning. What you hear far less often is an honest accounting of what most teams are actually doing when they use them โ and how far that reality sits from what the terms are supposed to mean.
This is not a criticism of individuals. It's a structural observation about how a discipline matures. The vocabulary travels faster than the practice. Teams learn the language before they've had the time, tooling, or failure history to learn what's underneath it. And in MLOps specifically, the cost of that gap tends to stay hidden until something expensive breaks.
What follows is an honest look at six of the most commonly misused concepts in MLOps โ what teams say they're doing, what they're usually actually doing, and what genuinely getting it right looks like.
When people say: "We have model monitoring in place."
What's usually happening: A dashboard exists. It tracks prediction volume and latency. Someone receives an alert if the endpoint goes down. On particularly mature teams, there may be a weekly job that computes accuracy against a sample of labeled data โ assuming labels arrive, which is already a bigger assumption than most teams acknowledge.
What's actually being missed: The thing that causes models to fail silently in production is almost never a server outage. It's distributional shift โ the slow, invisible divergence between the data the model was trained on and the data it's now seeing. Prediction volume staying steady and latency staying low tells you nothing about whether the model is still doing what you think it's doing.
Real model monitoring operates at three levels simultaneously. First, data drift: are the statistical properties of incoming features changing โ their distributions, their missingness patterns, their correlation structures? Second, concept drift: even if inputs look the same, has the relationship between inputs and the correct output changed? Third, outcome monitoring: are the downstream business metrics the model is supposed to influence actually moving in the right direction?
Most teams have the first layer partially. Almost none have all three connected in a way that produces actionable alerts rather than just retrospective dashboards.
What better looks like: Population Stability Index computed on every input feature at a defined cadence. Separate drift thresholds for high-cardinality and low-cardinality features. A labeled feedback loop โ even a sampled, delayed one โ that grounds statistical drift signals in actual outcome data. Alerts that page someone only when drift exceeds a threshold and a downstream metric has moved. Everything else is noise.
When people say: "We've implemented a feature store."
What's usually happening: A shared database table โ occasionally a Redis cache โ that multiple models read from. Sometimes a bit of shared transformation logic has been extracted into a common library. The word "store" is doing a lot of heavy lifting.
What's actually being missed: The defining problem a feature store exists to solve is training-serving skew: the subtle, catastrophic divergence between the features computed during training and the features served at inference time. This happens when the transformation logic is implemented in two different places โ once in the training pipeline, once in the serving layer โ and those implementations drift apart. It is one of the most common sources of production model underperformance, and it is almost invisible without explicit tooling to detect it.
A genuine feature store doesn't just share data. It shares transformation logic โ ideally through a unified compute layer that runs the same code path at training time and at serving time. It also maintains point-in-time correctness: the ability to reconstruct, for any training example, exactly the feature values that would have been available at the moment of prediction, without leaking future data into the past. This is technically harder than it sounds, and most shared database tables do not do it.
What better looks like: A unified feature definition layer โ in tools like Feast, Tecton, or a well-engineered custom implementation โ where the transformation from raw data to feature value is written once and executed in both contexts. Time-travel queries for training set construction that respect point-in-time boundaries. Regular skew checks that compare feature distributions at training time against distributions at serving time in production. Governance metadata: who owns each feature, what it computes, what models depend on it.
When people say: "We have CI/CD for our ML models."
What's usually happening: There is a pipeline. Committing code triggers a run. Tests execute โ usually unit tests on transformation functions, occasionally an integration test that checks the model loads without error. If everything passes, the model is deployed. This is software CI/CD applied to ML code. It is not ML CI/CD.
What's actually being missed: In software engineering, the primary artifact being validated is code behavior. In machine learning, the primary artifact is a trained model โ and the trained model is not just a function of the code. It's a function of the code, the data, the hyperparameters, the random seeds, and the training infrastructure. A pipeline that validates the code but not the trained artifact is doing perhaps 30% of the necessary work.
ML CI/CD needs to validate the model itself: does it meet defined performance thresholds on a held-out evaluation set? Has its performance regressed relative to the currently deployed version? Does it behave correctly on a curated set of challenging edge cases โ the slices of the data distribution where failures matter most? Are its predictions stable across runs with equivalent inputs? Is its inference latency within the defined SLA?
What better looks like: A multi-stage validation gate between training and deployment. Stage one: code quality โ linting, unit tests, type checks. Stage two: training correctness โ does the model train to convergence, does it reproduce expected performance within tolerance on the evaluation set? Stage three: behavioral validation โ slice-based evaluation on known hard cases, regression testing against the production model's performance, fairness checks if applicable. Stage four: infrastructure validation โ latency benchmarks, memory footprint, shadow deployment on live traffic before full rollout. Most teams have stage one. Few have all four.
When people say: "We version our data."
What's usually happening: Training datasets are saved to cloud storage with a timestamp in the folder name. Occasionally there is a naming convention. A very sophisticated team may have a README in the folder describing what the dataset contains.
What's actually being missed: Data versioning, done correctly, is not about saving files with dates in the name. It is about reproducibility โ the ability to look at any model currently in production and reconstruct, precisely, the exact dataset it was trained on: which records were included, which were excluded, what transformations were applied, what the train/validation/test split boundaries were, and what the source data looked like at the moment of snapshot.
Without that, you cannot reliably debug production failures. You cannot run controlled experiments comparing models trained on different data. You cannot satisfy audit or compliance requirements that increasingly exist in regulated industries. And you cannot safely retrain a model โ because you do not know what you're retraining on relative to what produced the version currently deployed.
What better looks like: Tools like DVC, Delta Lake, or LakeFS that track data at the commit level โ not just what the data was, but how it was produced, from what upstream sources, through what transformations. Lineage graphs that connect raw source data through processing steps to the specific training set consumed by a specific model version. Automated snapshot creation at the start of each training run, linked to the model artifact and the code version that produced it. The test: if a model deployed eight months ago starts misbehaving, can you reproduce its training set in under an hour? If not, data versioning is not complete.
When people say: "We use a model registry."
What's usually happening: MLflow is running somewhere โ often locally, occasionally on a shared server. Models are logged after training. The experiment tracker has a UI that someone occasionally opens. In the most common configuration, models are stored with a name and a version number, and the "registry" is primarily used as a file storage system with a slightly nicer interface than S3.
What's actually being missed: A model registry is not a storage system. It is a governance system. The distinction matters. Storage answers the question "where is this model?" Governance answers the questions that actually determine whether your ML deployment is safe and auditable: Who approved this model for production? What evaluation criteria did it pass? What data was it trained on? What is its expected performance profile? What is its failure behavior? Who is responsible for monitoring it? What is the escalation path if it degrades?
In regulated industries โ finance, healthcare, insurance โ these questions have legal answers required by compliance frameworks. But even outside regulation, teams that cannot answer them are flying blind in production.
What better looks like: A registry entry that includes, for every production model: the training dataset version, the evaluation report with slice-level breakdowns, the approval record (who signed off and against what criteria), the deployment configuration, the monitoring thresholds, the rollback procedure, and the model card describing its intended use, known limitations, and out-of-distribution behavior. MLflow can hold some of this. The organizational discipline to populate and maintain it is the harder part.
When people say: "We have automated retraining."
What's usually happening: A scheduled job runs on a fixed cadence โ weekly, monthly โ that retrains the model on a recent window of data and redeploys it automatically if training completes without error. The implicit assumption is that more recent data is always better and that a model that trains successfully is a model that should be deployed.
Both assumptions are wrong often enough to cause serious problems.
What's actually being missed: Scheduled retraining without trigger logic is a proxy for the real thing. The real thing is event-driven retraining โ retraining that fires because something measurable has changed: drift has exceeded a threshold, performance has degraded past a defined floor, a significant data distribution shift has been detected, or a labeled feedback sample has revealed systematic errors.
Beyond triggers, the deeper problem with most retraining pipelines is the absence of a challenger-champion framework. The retrained model is not automatically better than the incumbent. It has seen more recent data, yes โ but it may have overfit to a transient pattern, lost performance on a data slice that matters, or regressed on an important behavioral property. Deploying without comparison is not automation. It is optimism.
What better looks like: Trigger-based retraining initiated by drift alerts or performance degradation signals, not just the calendar. A mandatory shadow period where the challenger model runs in parallel with the champion, accumulating real traffic results before any cutover decision. A defined promotion criteria โ not just "training succeeded" but "challenger outperforms champion on the evaluation set by at least X% and shows no regression on critical slices." Automated rollback logic that fires if production metrics degrade within a defined window post-deployment. A full audit log: why was this retrain triggered, what did the evaluation show, who or what made the promotion decision.
Reading across these six areas, the pattern is consistent. Teams have implemented the surface layer โ the tooling, the naming, the dashboards, the pipelines. What's missing is the second-order thinking: what is this system supposed to guarantee, and how do we know it's actually guaranteeing that?
Model monitoring without outcome feedback doesn't guarantee model quality. Feature stores without skew detection don't guarantee training-serving alignment. CI/CD without behavioral validation doesn't guarantee safe deployment. Data versioning without lineage doesn't guarantee reproducibility. Model registries without governance don't guarantee auditability. Retraining pipelines without challenger frameworks don't guarantee improvement.
The gap between saying the term and doing the thing is not a knowledge gap. Most engineering teams know, at some level, that their monitoring could be deeper or their retraining logic more principled. It's a prioritization gap โ one that stays invisible until something breaks expensively enough to make the cost concrete.
The teams that close it don't do so all at once. They pick one area, define what "actually correct" looks like for their specific context and failure modes, and build toward it deliberately. Then they do the next one.
That's the whole practice, underneath the terminology.
This article is part of our ongoing series on production machine learning. If you're thinking through any of these areas for your own systems, we're happy to go deeper โ reach out or explore the rest of our work.
Explore more writing on topics that matter.