Nobody told you the hard part

Stanis B.

April 18, 2026 · 11 min read

In brief

In a professional context, a model is never the deliverable. The model is the evidence. The deliverable is the decision it enables, the confidence with which that decision can be made, and the clear articulation of what would have to be true for the decision to change. Students who understand this early are the ones who get promoted.

The gap no one talks about in classrooms

There is a version of data science education that works extremely well. It builds mathematical intuition, teaches the mechanics of learning algorithms, and produces graduates who can implement a model from scratch or interpret a paper with confidence. That version is genuinely valuable and genuinely incomplete.

The incompleteness is specific. It is not about advanced theory — most industry problems do not require cutting-edge research. It is not about scale — most companies are not running petabyte pipelines. The gap is something quieter and more structural: the difference between solving a problem that has already been cleaned, framed, and handed to you, and solving a problem that arrives as a vague complaint from a sales director who thinks the model "just doesn't feel right."

Students learn to answer questions. Industry teaches you to figure out what the question actually is.

What follows is an account of five techniques where this gap is most visible — and the specific, practical things that professional data scientists do differently that nobody writes on a syllabus.

1. Exploratory data analysis — from summary statistics to a point of view

What students do:

The typical student EDA looks like this: df.shape, df.describe(), df.isnull().sum(), a correlation heatmap, a few distribution plots. The conclusion is something like "the data has 12,000 rows and 8 features, some missing values in column C which we will impute, and feature X is correlated with the target." Then they move on to modelling within the same notebook.

This is not wrong. It is just not finished.

What industry actually needs:

In a professional setting, EDA is not a preprocessing step. It is an investigation. You are not confirming the data is usable — you are building a theory about the phenomenon the data describes, looking for the things that would break your model before you build it, and identifying the findings that change what questions are worth asking.

Professional analysts approach a new dataset with explicit hypotheses written down before they open it. They look for anomalies that are interesting rather than just inconvenient. They notice when the data was collected and think about whether collection methodology changed over time. They look at distributions by subgroup — not just the overall shape — because industry data is almost never homogeneous in the way textbook data is.

The output of a professional EDA is not a notebook. It is a document — sometimes a slide, sometimes a memo — with three to five findings, each one answering the question: so what? What does this mean for the model? What does this mean for the business? What would we do differently if this weren't true?

One specific technique professionals use that students almost never do: stratified temporal EDA. You split the data by time period and run the same summaries on each slice. If the distribution of your target variable shifts from Q1 to Q4, your model trained on the full dataset is learning a blended signal that doesn't reflect any real period. That shift will not appear in any summary statistic. It only appears if you look for it.

The real project fix:

Take a public dataset with a time column — energy consumption, hospital admissions, retail transactions. Before fitting anything, split it into four equal time windows and run your full EDA on each window separately. Write one paragraph per window on how the data has changed. Then write a final paragraph on what that means for how you would train a model. The discipline of writing it forces you to notice what you otherwise skim past.

2. Feature engineering — from transformation to hypothesis

What students do:

Students apply a standard preprocessing pipeline: encode categoricals, scale numerics, maybe add interaction terms or polynomial features because the course mentioned them. The features that go into the model are almost entirely determined by what was already in the dataset.

What industry actually needs:

Professional feature engineering starts from domain knowledge, not from the dataset. Before a practitioner touches the data, they ask: what do I actually believe causes the outcome I'm trying to predict? What are the mechanisms? What signals would a human expert use to make this judgement, and can those signals be operationalised from the data I have?

This changes everything about what gets built. Instead of encoding a "days since last purchase" column as-is, a practitioner thinks: purchase recency has a non-linear relationship with churn — the first two weeks of inactivity are not the same as week six. So they create a bucketed feature, or a decay function, or a binary flag for "inactive more than 30 days." Each one is a hypothesis about the mechanism. Each one can be tested.

The professional also thinks about what not to include with the same rigour. Leakage is the most expensive mistake in production data science — including a feature that is only available after the outcome is known, so the model works perfectly in evaluation and fails completely in deployment. Professionals build a leakage audit as part of every feature engineering process: for each feature, they explicitly ask "at prediction time, does this value exist?" For a churn model predicting at the start of the month, a feature built from the last day of that month is leakage. It is obvious in retrospect. It is invisible if you are not specifically looking.

One technique that separates intermediate from senior practitioners: feature importance is a diagnostic, not a conclusion. Students treat high-importance features as confirmation they did good work. Professionals treat them as the start of an investigation. Why is this feature so important? Is it genuinely predictive, or is it a proxy for something I should not be using? Is it stable over time, or is it capturing a historical pattern that won't persist? A model with one dominant feature is almost always fragile.

The real project fix:

Before engineering any features, write a document — even a single page — titled "What I believe causes this outcome." List five mechanisms. For each one, describe the data signal that would reflect that mechanism. Then build features that operationalise those signals. After modelling, come back and check: did the features that encoded your beliefs actually carry predictive signal? If not, why not? That gap between your hypothesis and the model's result is almost always where the real insight lives.

3. Model evaluation — from metrics to decisions

What students do:

Pick the model with the highest accuracy or F1. Report the number. Possibly plot an ROC curve without explaining what it means. Conclude the model is "good."

What industry actually needs:

The first thing a professional asks when evaluating a model is not "what is the score?" It is "what is the cost of being wrong, and in which direction?" Those two questions completely reframe what evaluation means.

A fraud detection model with 99% precision sounds excellent until you learn that it catches only 30% of actual fraud — meaning 70% of fraudulent transactions pass through undetected. A medical screening model with 95% overall accuracy sounds fine until you realise that it misses 40% of positive cases because positives are rare and the model has learned to be sceptical of them. Accuracy is not a meaningful metric in isolation. It never has been. Industry practitioners know this instinctively because they have seen the cost of not knowing it.

The professional approach to evaluation involves several things students are rarely taught. First, threshold analysis: the default 0.5 decision threshold is almost never optimal. Professionals plot the business cost — false positives times their cost plus false negatives times their cost — across the full threshold range, and choose the threshold that minimises cost, not the one that maximises F1. This requires quantifying error costs, which requires talking to a stakeholder, which is itself a skill.

Second, calibration. A model that says "70% probability" should be right 70% of the time. Most models are not well calibrated by default — tree-based models tend to be overconfident, neural networks can be poorly calibrated without post-processing. Students almost never check calibration. Professionals check it because a poorly calibrated model produces probability estimates that stakeholders treat as real probabilities and make bad decisions on.

Third, performance by subgroup. A model with 88% overall accuracy might have 94% accuracy on the majority group and 61% accuracy on a minority group. The aggregate number hides the failure. Professionals always segment evaluation — by time period, by customer cohort, by data source, by whatever dimension is most operationally relevant.

The real project fix:

Build a cost matrix for any binary classification project. Assign a realistic dollar value to a false positive and a false negative — you can make these up, but commit to them. Write a one-page memo recommending a specific decision threshold, showing the expected cost at that threshold, and comparing it to the cost at 0.5. Then show the performance broken down by at least two subgroups. This is the evaluation a senior analyst would produce. It takes twice as long as reporting a single metric and is about ten times more useful.

4. Communicating uncertainty — from predictions to distributions

What students do:

Return a number. "The model predicts this customer will churn." "The forecast for next quarter is €2.4M." No interval. No caveat. No indication of what the model does or does not know.

What industry actually needs:

Point predictions without uncertainty estimates are not just incomplete — they are actively misleading. A stakeholder who receives a forecast of €2.4M will plan around €2.4M. If the actual range is €1.8M to €3.1M, the plan may be built on a false precision that the model never justified. Communicating uncertainty is not pessimism or hedging. It is honesty about what the data can and cannot support.

Professionals distinguish between several types of uncertainty that students are rarely introduced to. Aleatoric uncertainty is irreducible — the inherent randomness in the outcome. Epistemic uncertainty reflects what the model doesn't know due to limited data, and it can be reduced with more information. For a customer segment with very few training examples, the model's uncertainty should be high — it simply hasn't seen enough cases to be confident. Communicating that clearly ("we have high confidence in our predictions for enterprise customers, low confidence for the SMB segment which is underrepresented in training") is a professional skill.

One specific technique: out-of-distribution detection. Every model has a region of input space where it was trained, and a surrounding region where it has no training data and is essentially extrapolating. Professional deployments flag predictions where the input is far from the training distribution — because those predictions are unreliable regardless of what confidence score the model returns. Students almost never implement this. It is one of the most practical and underused ideas in applied ML.

Prediction intervals are different from confidence intervals and both are different from model probability outputs — and conflating them is a common error. A 90% prediction interval says "90% of actual outcomes will fall in this range." A confidence interval says "90% of the time, the true mean will fall in this range." A model probability says "based on the training data, this class has this likelihood." They answer different questions. Professionals know which one the stakeholder is actually asking for.

The real project fix:

For any regression project, implement prediction intervals using quantile regression or bootstrapping. Present results in the following structure: point estimate, 80% interval, 90% interval, and one sentence on what input conditions cause the interval to widen significantly. Then write a two-sentence interpretation for a non-technical reader — no jargon, no Greek letters. Practise writing: "We expect X, and we'd be surprised if the true value fell outside [A, B]. If the customer has fewer than six months of history, this estimate is much less reliable." That sentence structure, written clearly and consistently, is what professional analysts produce.

5. Reproducibility — from a notebook to a handoff

What students do:

A Jupyter notebook with 47 cells, some run in non-linear order, a variable called df2 that overwrites df in cell 23, outputs that depend on a CSV file that exists only on the author's laptop, no requirements.txt, and a comment density of approximately zero. The author can reproduce the results. No one else can.

What industry actually needs:

Code is read far more than it is written. The person who reads your code six months from now is often future you, who remembers nothing. The second most likely reader is a colleague who is trying to reproduce a result after something breaks in production. The third is a regulator or auditor. None of them have access to your mental state at 11pm when you wrote the thing.

Professional data science code is structured around the assumption that the author will not be present to explain it. That means: a single entry point that runs the full pipeline. A configuration file or argument parser for anything that might change — file paths, hyperparameters, dates. A requirements.txt or environment.yml that pins dependencies. A README that describes what the code does, what it needs, and what the expected output is. Version-controlled with meaningful commit messages.

Beyond the mechanics, there is a professional habit that students are almost never taught: writing down your assumptions explicitly in the code. Not in comments that say # load the data — comments that say # assumes the input file has one row per customer per month with no duplicate customer-month combinations — validation below. And then actual validation that raises an error if that assumption is violated. Defensive data science: your pipeline should fail loudly when the data doesn't look like what you expected, rather than silently producing wrong results on corrupted input.

One pattern professionals use constantly and students almost never implement: a data validation step at ingestion. Before any transformation, assert that the shape, types, value ranges, and key distributions of the incoming data match expectations. This catches data pipeline failures immediately instead of allowing them to propagate silently through a model and produce wrong predictions that end up in a dashboard for three weeks before anyone notices.

The real project fix:

Refactor one past project with a single success criterion: a colleague with your GitHub credentials and a fresh machine should be able to reproduce your final result with two commands — git clone and python run.py. Write a README.md with exactly three sections: what it does, what it needs, how to run it. Add one data validation function at the start of the pipeline that checks at least three things about the input data and raises a descriptive error if any fail. Time how long it takes. If it takes more than four hours, that is roughly what it would cost a team to clean up if you were unavailable. That is the professional standard of code as communication.

The underlying pattern

Each of these five fixes points at the same underlying shift: from doing the technique to taking responsibility for the outcome the technique produces.

A student doing EDA is demonstrating they know how to explore data. A professional doing EDA is accepting responsibility for finding anything in that data that could matter to the decision ahead. That is a different posture entirely.

Students are almost never told this directly, because it is hard to grade. It is easy to check whether someone ran a confusion matrix. It is much harder to assess whether they understood what it was for, chose the right threshold, communicated the result honestly, and structured the code so the next analyst doesn't have to start from scratch.

The projects that bridge this gap are not more complex. They are more deliberate. They impose constraints that simulate professional conditions: a non-technical audience, a time pressure, a colleague who needs to run your code, a stakeholder who needs a number they can act on. Those constraints are not arbitrary — they are the conditions under which data science actually creates value.

The students who seek out those constraints before they graduate are not just better candidates. They are faster to become useful once hired. And in an industry that increasingly has enough people who can train a model, that difference — between technically capable and genuinely useful — is the one that determines everything.

Explore more writing on topics that matter.

← Back to all posts