When Data Falls Silent: Diagnosing Missing Data Mechanisms
Stanis B.
April 15, 2026 ยท 7 min read
Every real dataset has holes. A column full of blanks, a timestamp that simply never arrived, a field the respondent left empty โ on the surface they all look the same. In practice they are not. The reason a value is missing is as important as the value itself, and conflating different reasons leads to models that are quietly, confidently wrong. A patient cohort where sicker people drop out of a trial is a different problem than a sensor that randomly powers off. Imputing both the same way is the kind of mistake that passes cross-validation and fails in production.
The field has three named mechanisms for why values go missing. MCAR โ missing completely at random โ is the benign case: the absence of a data point has nothing to do with anything measured or unmeasured. MAR โ missing at random โ is more subtle: the probability of missingness depends on other observed columns, but not on the value that is actually missing. MNAR โ missing not at random โ is the hard case: the value is absent precisely because of what it would have been. An understanding of which regime you are in changes everything that follows โ which tests to run, which imputer to trust, and how much uncertainty to carry forward into your downstream model.
What follows is a working guide to diagnosing and handling all three, with Python code you can adapt to real pipelines.
MCAR โ the benign blank
"A sensor randomly drops a packet. A survey respondent accidentally skips a page. The phone dies mid-form. The value is gone for reasons entirely unrelated to the data itself."
MCAR is the statistician's dream and the practitioner's most overused assumption. When data is truly missing completely at random, the rows with missing values are an unbiased random sample of all rows. You can delete them without introducing systematic error. You can fill them with the column mean without distorting the distribution in any directional way. The danger is not in the technique โ it is in the diagnosis. Many analysts assume MCAR because it is convenient, not because they have tested it.
Little's MCAR test gives you a formal chi-squared statistic. A high p-value (conventionally above 0.05) provides evidence you cannot reject the MCAR hypothesis. It is not proof โ it is a failure to disprove. Combine it with a visual missingness heatmap and an examination of whether missingness correlates with any observed column before you commit.
MCAR: Use cases
MCAR most commonly appears in IoT sensor streams with random dropout, large paper surveys with accidental skips, and automated data collection pipelines with transient network failures. It is far rarer in human-generated survey data on sensitive topics, where structured avoidance tends to dominate.
MCAR: Techniques
For MCAR, your choices are: mean or median imputation (fast, preserves column-level statistics), random sample imputation (preserves the full empirical distribution), or listwise deletion (removes the row entirely). The last option is only defensible when the percentage of missing rows is small โ under five percent is a rough rule of thumb โ and the dataset is large enough to absorb the loss..
python
1import pandas as pd
2import numpy as np
3from scipy.stats import chi2
45np.random.seed(42)6n =40078df = pd.DataFrame({9'age': np.random.randint(20,65, n).astype(float),10'income': np.random.randint(25000,95000, n).astype(float),11'score': np.random.randint(300,850, n).astype(float),12'tenure': np.random.randint(0,30, n).astype(float)13})1415# Introduce 15% MCAR โ purely random mask16random_mask = np.random.rand(*df.shape)<0.1517df[random_mask]= np.nan
1819print("Missing counts per column:")20print(df.isnull().sum())2122total_missing = df.isnull().sum().sum()23pct_missing = df.isnull().mean().mean()2425print(26f"\nTotal missing: {total_missing} of {df.size} cells "27f"({pct_missing:.1%})"28)293031deflittles_mcar_test(data: pd.DataFrame):32"""
33 Approximate Little's MCAR test.
34 H0: data is MCAR.
35 High p-value โ insufficient evidence to reject MCAR.
36 """37 d = data.copy()38 chi2_stat =0.039 dof =04041for col in d.columns:42 observed = d[col].dropna()43 n_miss = d[col].isnull().sum()4445if n_miss ==0or observed.var()==0:46continue4748 grand_mean = observed.mean()49 grand_var = observed.var()5051 chi2_stat +=(52 n_miss *((observed.mean()- grand_mean)**2)/ grand_var
53)54 dof +=15556 p_value =1- chi2.cdf(chi2_stat, df=max(dof,1))57return chi2_stat, p_value
585960stat, p = littles_mcar_test(df)6162print(63f"\nLittle's MCAR test โ chi2: {stat:.3f}, "64f"p-value: {p:.3f}"65)6667verdict ="MCAR likely โ"if p >0.05else"NOT MCAR โ check MAR or MNAR"68print("Verdict:", verdict)697071# Strategy 1: Mean / median imputation7273df_mean = df.fillna(df.mean(numeric_only=True))74df_median = df.fillna(df.median(numeric_only=True))7576print("\nMean-imputed (first 3 rows):")77print(df_mean.head(3).round(1))787980# Strategy 2: Random sample imputation8182defrandom_sample_impute(series: pd.Series)-> pd.Series:83 s = series.copy()84 missing_idx = s[s.isnull()].index
8586 fill_values =(87 s.dropna()88.sample(len(missing_idx), replace=True)89.values
90)9192 s.loc[missing_idx]= fill_values
93return s
949596df_rs = df.apply(random_sample_impute)9798print("\nDistribution comparison โ income column:")99100print(101f" Original mean (non-missing): "102f"{df['income'].mean():,.0f}"103)104105print(106f" Mean-imputed mean: "107f"{df_mean['income'].mean():,.0f}"108)109110print(111f" Random-sample-imputed mean: "112f"{df_rs['income'].mean():,.0f}"113)114115116# Strategy 3: Listwise deletion117118complete_rows = df.dropna()119pct_retained =len(complete_rows)/len(df)120121print(122f"\nListwise deletion: {len(complete_rows)} rows retained "123f"({pct_retained:.1%} of original)"124)125126if pct_retained <0.80:127print(128"Warning: >20% of rows dropped โ "129"listwise deletion not recommended here."130)
MAR โ the pattern hiding in plain sight
"Younger respondents skip the income question more often. Men underreport health scores. Patients with lower education levels leave clinical trial forms incomplete. The missing value has a pattern โ but the pattern lives in columns you can see."
MAR is the workhorse case in real-world tabular data. The missingness is not random, but it is explainable by other observed variables. Once you condition on those variables, the remaining uncertainty is random. This means model-based imputation โ which explicitly uses the observed columns to predict the missing ones โ is both valid and effective. You are not guessing blindly; you are using information the data has already given you.
The classic diagnostic is to create a binary indicator column for each variable with missingness: income_missing = 1 where income is absent, 0 otherwise. Then check the correlation between that indicator and every other column in the dataset. A strong correlation tells you missingness is not random and points directly at which observed variables are driving the pattern. That is your MAR signature.
MAR: Use cases
MAR dominates in survey research (demographic characteristics predict non-response on sensitive questions), clinical datasets (sicker patients attend fewer follow-up visits, and severity is measured), credit scoring (employment status predicts missing income fields), and recommender systems (active users are observed more frequently than passive ones).
MAR: Techniques
KNN imputation uses the k nearest neighbours โ measured by Euclidean distance across observed features โ to fill each gap with a distance-weighted average of the neighbours' values. It is non-parametric and captures non-linear relationships, but scales poorly with dimensionality. MICE (Multiple Imputation by Chained Equations, implemented in scikit-learn as IterativeImputer) cycles through each column with missing values, fitting a regression model using all other columns as predictors and imputing from that model. It repeats this cycle until convergence. MICE is the gold standard for MAR data when you have the compute budget for it.
python
1import pandas as pd
2import numpy as np
3from sklearn.impute import KNNImputer
4from sklearn.experimental import enable_iterative_imputer # noqa: F4015from sklearn.impute import IterativeImputer
6from sklearn.linear_model import BayesianRidge, LinearRegression
78np.random.seed(0)9n =5001011age = np.random.randint(20,65, n).astype(float)12edu_yrs = np.random.randint(8,22, n).astype(float)1314income =(151800016+ age *90017+ edu_yrs *120018+ np.random.normal(0,6000, n)19)2021score =(2230023+ income *0.00324+ np.random.normal(0,40, n)25)2627df = pd.DataFrame({28'age': age,29'edu_yrs': edu_yrs,30'income': income,31'score': score
32})3334# MAR: younger and less educated โ higher probability income missing35miss_logit =-3+(-0.06* age)+(-0.08* edu_yrs)3637miss_prob =1/(1+ np.exp(-miss_logit))3839mcar_noise = np.random.rand(n)4041df.loc[mcar_noise < miss_prob,'income']= np.nan
4243missing_count = df['income'].isnull().sum()44missing_pct = df['income'].isnull().mean()4546print(47f"Missing income: {missing_count} / {n} "48f"({missing_pct:.1%})"49)505152# Detect MAR pattern -5354df['income_missing']= df['income'].isnull().astype(int)5556corr =(57 df[['age','edu_yrs','score','income_missing']]58.corr()['income_missing']59.drop('income_missing')60)6162print("\nCorrelation with income_missing indicator:")63print(corr.round(3))6465print(66"โ Negative correlation with age and edu_yrs "67"confirms MAR pattern"68)6970df = df.drop(columns='income_missing')717273# KNN Imputation7475knn = KNNImputer(76 n_neighbors=7,77 weights='distance'78)7980df_knn = df.copy()8182df_knn[df_knn.columns]= knn.fit_transform(df_knn)8384print(85f"\nKNN imputed income mean: "86f"{df_knn['income'].mean():,.0f}"87)8889print(90f"True income mean: "91f"{income.mean():,.0f}"92)939495# MICE / Iterative Imputer9697mice = IterativeImputer(98 estimator=BayesianRidge(),99 max_iter=15,100 random_state=42,101 imputation_order='ascending'102)103104df_mice = df.copy()105106df_mice[df_mice.columns]= mice.fit_transform(df_mice)107108print(109f"MICE imputed income mean: "110f"{df_mice['income'].mean():,.0f}"111)112113114# Regression Imputation115116train_mask = df['income'].notna()117118reg = LinearRegression()119120reg.fit(121 df.loc[train_mask,['age','edu_yrs','score']],122 df.loc[train_mask,'income']123)124125df_reg = df.copy()126127missing_mask = df['income'].isna()128129df_reg.loc[missing_mask,'income']= reg.predict(130 df.loc[missing_mask,['age','edu_yrs','score']]131)132133print(134f"Regression imputed mean: "135f"{df_reg['income'].mean():,.0f}"136)137138r2 = reg.score(139 df.loc[train_mask,['age','edu_yrs','score']],140 df.loc[train_mask,'income']141)142143print(144f"Regression Rยฒ on observed: {r2:.3f}"145)146147148# Post-imputation distribution check..149150observed_income = df['income'].dropna()151152from scipy.stats import ks_2samp
153154for label, series in[155("KNN", df_knn['income']),156("MICE", df_mice['income']),157("Regression", df_reg['income'])158]:159160 stat, p = ks_2samp(observed_income, series)161162print(163f"KS test vs observed โ {label:12s}: "164f"stat={stat:.3f}, p={p:.3f}"165)
The KS test compares the imputed distribution to the observed distribution. A p-value above 0.05 is a rough sanity check that the imputer has not dramatically shifted the shape of the column. It is not definitive โ the whole point is that missing values may have come from a different part of the distribution โ but a very low p-value suggests the imputer is fabricating implausible values.
MNAR โ the hole that knows its own shape
"High earners decline to report income. Patients who stop improving drop out of a clinical trial. People with the most debt skip the debt question. The value is absent because of what it would have been. The bias is structural."
MNAR is the case where no amount of clever imputation fully rescues you. The missingness is a function of the unobserved value itself, which means you cannot model it away using only the observed data โ you would need to observe the very thing that is missing. This is a fundamentally different epistemological situation from MCAR or MAR, and it demands a different response: not confident imputation, but honest uncertainty quantification.
The practical tools are sensitivity analysis (what does my estimate look like under different assumptions about the missing values?), flagging missingness as an explicit binary feature in your model (which lets the model learn from the absence itself), and two-stage selection models like Heckman correction, which use the pattern of who is observed to partially correct for the selection bias. None of these eliminates the problem. All of them make the problem legible.
The telltale sign of MNAR is a gap between the observed mean and what you would expect the true mean to be from domain knowledge. If the average reported income in a survey is ยฃ42,000 but administrative records suggest the true average is ยฃ58,000, the people who chose not to report are disproportionately the high earners. That gap is the MNAR fingerprint.
MNAR: Use cases
MNAR appears wherever the act of reporting a value is itself informative: income and wealth surveys, mental health assessments, substance use questionnaires, attrition in longitudinal studies, online reviews (people who had neutral experiences rarely bother), and loan default data (applicants who would default are often the ones who do not complete applications).
MNAR: Techniques
Sensitivity analysis constructs a range of plausible estimates by filling missing values with optimistic, neutral, and pessimistic assumptions, then checking how much the downstream statistic (mean, model coefficient, risk estimate) varies across that range. If it barely moves, the missingness does not matter much. If it swings by 30%, you have a problem you need to communicate. The indicator method adds a variable_missing column to your feature matrix and fills the original with any fixed value (often the median). This lets tree-based models and neural networks learn that "this person did not answer" is itself a predictive signal. The Heckman selection model uses a two-stage approach: first model who gets observed, then use the inverse Mills ratio from that model as a bias-correction term in the outcome regression.
python
1import pandas as pd
2import numpy as np
3from sklearn.linear_model import LogisticRegression, LinearRegression
45np.random.seed(7)6n =60078# True income โ high earners less likely to report9income_true = np.abs(np.random.normal(55000,22000, n))10income_true = np.clip(income_true,15000,200000)1112age = np.random.randint(22,68, n)13edu_yrs = np.random.randint(8,22, n)1415# MNAR: missingness probability rises with income16logit_miss =-2.5+(income_true -55000)/1800017miss_prob =1/(1+ np.exp(-logit_miss))18mnar_mask = np.random.rand(n)< miss_prob
1920df = pd.DataFrame({21'income': income_true,22'age': age,23'edu_yrs': edu_yrs
24})2526df.loc[mnar_mask,'income']= np.nan
2728observed_income = df['income'].dropna()2930print(31f"Missing income : {mnar_mask.sum()} / {n} "32f"({mnar_mask.mean():.1%})"33)3435print(f"Observed mean : ยฃ{observed_income.mean():,.0f}")36print(f"True mean : ยฃ{income_true.mean():,.0f}")3738bias = observed_income.mean()- income_true.mean()3940print(41f"Bias : ยฃ{bias:+,.0f} "42f"โ MNAR fingerprint"43)444546# Strategy 1: Missingness indicator (feature engineering)4748df_flagged = df.copy()4950df_flagged['income_observed']=(51 df_flagged['income'].notna().astype(int)52)5354df_flagged['income']=(55 df_flagged['income']56.fillna(df_flagged['income'].median())57)5859print("\nFlagged dataset โ sample:")60print(df_flagged.head(4))616263# Strategy 2: Sensitivity analysis6465n_miss = df['income'].isnull().sum()66obs_sum = observed_income.sum()67obs_n =len(observed_income)6869scenarios ={70'Optimistic (fill = p25)': observed_income.quantile(0.25),71'Neutral (fill = median)': observed_income.median(),72'Pessimistic (fill = p90)': observed_income.quantile(0.90),73}7475print("\nSensitivity analysis โ estimated population mean income:")7677for label, fill_val in scenarios.items():7879 est_mean =(obs_sum + n_miss * fill_val)/ n
8081print(f" {label}: ยฃ{est_mean:,.0f}")8283print(f" True mean : ยฃ{income_true.mean():,.0f}")848586# Strategy 3: Heckman two-stage selection correction8788# Stage 1: model P(observed) using age and edu_yrs8990df_work = df.copy()9192df_work['observed']= df_work['income'].notna().astype(int)9394sel_model = LogisticRegression(max_iter=300)9596sel_model.fit(97 df_work[['age','edu_yrs']],98 df_work['observed']99)100101prob_observed = sel_model.predict_proba(102 df_work[['age','edu_yrs']]103)[:,1]104105106# Inverse Mills Ratio โ correction term107108imr = np.where(prob_observed >0.01,1.0/ prob_observed,0.0)109110111# Stage 2: outcome model on observed rows112113obs_mask = df_work['observed']==1114115X_obs = np.column_stack([116 df_work.loc[obs_mask,'age'],117 df_work.loc[obs_mask,'edu_yrs'],118 imr[obs_mask]119])120121y_obs = df_work.loc[obs_mask,'income']122123outcome_model = LinearRegression()124125outcome_model.fit(X_obs, y_obs)126127128# Predict for missing rows129130miss_mask = df_work['observed']==0131132X_miss = np.column_stack([133 df_work.loc[miss_mask,'age'],134 df_work.loc[miss_mask,'edu_yrs'],135 imr[miss_mask]136])137138corrected_preds = outcome_model.predict(X_miss)139140df_heckman = df.copy()141142df_heckman.loc[miss_mask,'income']= corrected_preds
143144145print(146f"\nHeckman-corrected mean : "147f"ยฃ{df_heckman['income'].mean():,.0f}"148)149150print(151f"Naive (observed only) mean: "152f"ยฃ{observed_income.mean():,.0f}"153)154155print(156f"True mean : "157f"ยฃ{income_true.mean():,.0f}"158)159160residual_bias =(161 df_heckman['income'].mean()- income_true.mean()162)163164print(165f"Residual bias after correction: "166f"ยฃ{residual_bias:+,.0f}"167)