SB|Education
The Undertow: What Bias Hides

The Undertow: What Bias Hides

Liz T.
Liz T.
February 1, 2026 ยท 10 min read
In brief

The most dangerous biases in artificial intelligence are not the ones that surface as errors โ€” they are the ones that feel like common sense. They do not announce themselves in a wrong answer. They accumulate quietly, the way a river changes course not through force, but through years of invisible pressure. No benchmark catches them. No audit names them. And the five techniques that actually reveal them are not what most practitioners expect.

Latent Bias in AI โ€” No One Sees

We have grown comfortable with the idea that bias in artificial intelligence is something we can point to. A hiring algorithm that penalises women. A facial recognition system that fails darker skin tones. A language model that associates certain names with criminality. These are the visible injuries โ€” documented, debated, and in some cases, corrected.

But there is another kind of bias. Quieter. More patient. It does not announce itself in a single wrong answer. It accumulates, the way a river slowly changes course โ€” not through any single moment of force, but through years of invisible pressure. This is the undertow. And most practitioners, if they are being honest, will tell you it frightens them far more than anything they can measure.

What We Mean When We Say "Latent"

The word latent comes from the Latin latere โ€” to lie hidden. In the context of machine learning, latent bias refers to systematic distortions embedded so deeply in a model's representations that they do not surface as discrete errors. They manifest, instead, as tendencies. Subtle gravitational pulls on outputs that, taken individually, seem entirely reasonable โ€” and taken collectively, constitute a worldview.

A model does not need to say something false to be biased. It only needs to consistently frame things a certain way. To weight certain associations slightly higher. To reach for particular conclusions a fraction of a second sooner. The output looks clean. The reasoning looks sound. The undertow does its work beneath the surface.

This is the first misconception practitioners must dismantle: that bias is primarily a matter of incorrect outputs. It is not. The most consequential biases produce outputs that are technically accurate, contextually plausible, and culturally invisible โ€” precisely because they mirror the assumptions of the dominant culture that generated the training data in the first place.

The Misconception of the Neutral Corpus

There is a belief โ€” still surprisingly common, even among technically sophisticated teams โ€” that if you gather enough data, diversity emerges naturally. That scale is a corrective force. That a model trained on the breadth of human expression will, by sheer volume, average out to something fair.

This is a seductive idea. It is also wrong.

Data is never neutral. It is a fossil record of human decisions โ€” who wrote, who was published, who was quoted, who was digitised, who was deemed worth archiving. The internet, which forms the backbone of most large-scale training corpora, is not a mirror of humanity. It is a mirror of a particular subset of humanity, at a particular moment in history, with particular access to technology and particular incentives to produce content.

When you train on this data at scale, you do not average out bias. You crystallise it. You give it mathematical permanence. The undertow becomes structural.

How Practitioners Actually Encounter It

Ask a machine learning engineer when they first truly understood latent bias, and most will not point you to a paper. They will point you to a moment.

A sentiment model that rated identical customer complaints differently based on the name signed at the bottom. A summarisation system that, when given biographies of equally accomplished men and women, consistently foregrounded professional achievements for one and personal relationships for the other. A recommendation engine that, over months of deployment, quietly narrowed what users were shown โ€” not through any single bad decision, but through ten thousand incrementally reinforcing ones.

What unites these experiences is the word eventually. The bias was not visible at launch. It revealed itself through time, through use, through the slow accumulation of asymmetric pressure. By the time it was noticed, it had already shaped behaviour โ€” of the system, and of the people who had learned to trust it.

This is the clinical reality that benchmark evaluations rarely capture. You cannot find the undertow by standing on the shore. You have to swim in it.

The Measurement Problem

This brings us to one of the most persistent misconceptions in the field: that we can evaluate our way out of the problem.

Bias audits, fairness metrics, red-teaming exercises โ€” these are valuable. They are also, by their nature, retrospective and bounded. They test for the biases we have already thought to look for, measured against the categories we have already decided are relevant. They are extraordinarily ill-equipped to detect the biases that exist below our current vocabulary for describing them.

Consider: twenty years ago, few evaluation frameworks would have thought to test for the way a model implicitly genders competence, or associates geographic origin with intellectual credibility, or subtly adjusts its register of formality depending on inferred socioeconomic markers in a user's writing. These were not on anyone's checklist. They were the undertow of their era.

The biases we cannot measure today are not the ones that do not exist. They are the ones we have not yet developed the language to name.

The Feedback Loop That Nobody Talks About

There is a dynamic at work in deployed AI systems that receives far less attention than it deserves. When a model influences behaviour โ€” and at scale, models always influence behaviour โ€” the resulting actions enter the world and eventually return to training pipelines. Outputs become inputs. The model's prior assumptions are reflected back to it, slightly amplified, and it learns from them again.

This is not a flaw in implementation. It is a structural property of systems that learn from the world they are simultaneously shaping. The undertow does not merely carry outputs downstream. It bends the river itself.

Practitioners who work in recommender systems and conversational AI are acutely familiar with this phenomenon. A model that slightly over-represents a particular perspective will surface that perspective more often. Users who encounter it more often will engage with it more often. Engagement signals feed back into the model. The slight over-representation becomes a moderate over-representation. Then a dominant one. The process is self-concealing โ€” because at every step, the model is simply learning from what users responded to.

The result is not a broken system. It is a system working precisely as designed, pulling in a direction that no single human being chose.

Diving Below the Surface โ€” Five Techniques to Find What Hides

This is where theory must give way to practice. Knowing that the undertow exists is not enough. The discipline lies in developing the instruments to feel it โ€” systematically, rigorously, and with enough intellectual honesty to act on what you find. The following five techniques represent the most substantive approaches available to practitioners today.

I. Counterfactual Probing โ€” Changing One Variable at a Time

The most elegant entry point into latent bias is also the most methodologically clean: hold everything constant and change a single attribute. Swap a name. Replace a gender pronoun. Substitute one geography for another. Feed the resulting pairs through the model and measure the delta in outputs โ€” not just in classification scores, but in language, tone, framing, and confidence.

What makes this technique powerful is its surgical precision. You are not asking the model to explain itself. You are watching what it does when the human signal changes but everything else remains identical. A model that produces substantively different outputs for "John has a strong opinion on this" versus "Fatima has a strong opinion on this" โ€” when neither name should carry meaning in context โ€” is revealing something about the geometry of its latent space that no aggregate metric would surface.

Practitioners who employ this rigorously do not limit themselves to the obvious demographic variables. They probe for class markers, regional dialects, writing style as a socioeconomic signal, and the subtle ways professional titles shift a model's deference. The goal is to map the invisible topology โ€” to find every slope along which the output quietly slides.

II. Activation Patching and Representation Analysis โ€” Reading the Interior

If counterfactual probing watches what the model does, activation patching attempts to understand why โ€” by reaching into the model's interior and examining the representations it builds along the way.

In this technique, researchers isolate specific layers or attention heads and examine what concepts are encoded in proximity to one another. Latent bias often lives not in any single weight but in the neighbourhood structure of the embedding space โ€” in the fact that certain concepts are geometrically close in ways that were never explicitly programmed, but emerged from the statistical texture of the training data.

Word embedding studies have long demonstrated this: models trained on historical text encode associations between gender and profession, between ethnicity and criminality, between geography and intelligence, as measurable distances in vector space. The same principle applies to far more sophisticated architectures. The associations are subtler, more distributed, and harder to extract โ€” but they are there.

Tools such as probing classifiers โ€” lightweight models trained to predict demographic attributes from internal representations โ€” can reveal whether a model has encoded information it was never supposed to use. If a probing classifier can accurately predict the inferred race of a subject from the activations of a summarisation model, the model is carrying that information, whether or not it ever surfaces explicitly in outputs. That is the undertow made visible.

III. Longitudinal Output Tracking โ€” Watching the Drift

Many of the most consequential latent biases are not static properties of a model at launch. They are emergent properties of deployment โ€” patterns that develop over time as the system interacts with users, collects feedback signals, and in many cases continues to learn.

Longitudinal output tracking involves building an observatory around a deployed model โ€” systematically logging outputs across a representative sample of use cases, over months and years, and measuring statistical drift in language, framing, topic distribution, and sentiment across different demographic or contextual cohorts.

This technique requires infrastructure and patience that most organisations resist investing in. It also requires a willingness to surface findings that may be uncomfortable โ€” because drift is often directional, and the direction it tends toward is rarely toward greater equity.

What practitioners consistently find when they do this work is that models do not stay where you put them. The feedback loops described earlier are real and measurable. A model that was reasonably balanced at deployment may show statistically significant asymmetries eighteen months later, having been shaped by millions of interactions that nobody examined individually but that collectively pushed the system in a particular direction. Longitudinal tracking is the instrument that makes this visible before it becomes undeniable.

IV. Adversarial Demographic Stress Testing โ€” Pushing at the Edges

Standard evaluation benchmarks test a model under normal conditions. Adversarial demographic stress testing deliberately constructs edge cases, ambiguous scenarios, and high-stakes decision points โ€” and then systematically varies the demographic signals embedded in those cases to see where the model's behaviour diverges.

The technique draws on the logic of stress testing in engineering: you do not learn where a structure will fail by loading it normally. You learn by pushing it toward its limits. In the context of latent bias, the "limits" are the scenarios where the model's prior assumptions exert the most pressure โ€” moments of ambiguity, where the training data provides insufficient signal and the model falls back on statistical defaults that carry its embedded worldview most nakedly.

A medical diagnosis assistant, for instance, may perform comparably across demographic groups on clear-cut cases. It is in the ambiguous, symptom-complex presentations that latent associations between patient demographics and disease likelihood โ€” encoded from decades of historically biased clinical literature โ€” begin to influence outputs. Stress testing specifically targets these moments. It asks not "does the model work?" but "what does the model reach for when it isn't sure?" The answer is almost always revealing.

V. Contrastive Discourse Analysis โ€” Reading the Language, Not Just the Score

The fifth technique is perhaps the most underused, partly because it requires a form of close reading that feels more humanistic than technical. Contrastive discourse analysis involves systematically comparing the language a model uses โ€” not just its classifications or scores โ€” across different demographic and contextual conditions.

The insight behind this technique is that bias frequently lives in register, not in error. A model may correctly identify two equally qualified candidates. But if it describes one with active, agentic language โ€” "she drove the initiative," "he executed with precision" โ€” and the other with passive, relational language โ€” "she was well-liked," "he collaborated effectively with his team" โ€” the bias is present even though no factual error has occurred.

Practitioners conducting this analysis build large contrastive corpora of outputs โ€” same semantic task, varied demographic inputs โ€” and apply both automated linguistic analysis and human review to surface systematic differences in word choice, sentence structure, attribution of agency, use of hedging language, and the presence or absence of qualifications. The patterns that emerge are often deeply uncomfortable, because they reflect not machine errors but human patterns so naturalised that they were encoded into training data without anyone noticing they were there.

This is the undertow in its most precise form: not a mistake the model makes, but a worldview it has absorbed so thoroughly that it expresses it as fluently as any human would.

What Responsible Practice Actually Looks Like

The practitioners who navigate this most thoughtfully tend to share a particular disposition. They are less interested in the question "is our model biased?" โ€” to which the honest answer is always yes, in ways we don't fully understand โ€” and more interested in the question "what are we doing to surface what we cannot yet see?"

In practice, this means building for longitudinal observation โ€” not just evaluating models at deployment, but tracking the drift of outputs and user behaviour over months and years. It means investing in interpretability not merely as a compliance exercise, but as a genuine epistemic practice. It means bringing in perspectives that are structurally absent from the teams building these systems โ€” not as a gesture toward representation, but as a methodological necessity. And it means cultivating institutional humility โ€” the organisational capacity to say, credibly and without defensiveness, that we will find things we did not anticipate, and we will take them seriously when we do.

The five techniques above are not a checklist to be completed and filed. They are instruments of ongoing attention โ€” practices to be embedded in the culture of a team, not delegated to a quarterly audit cycle. Latent bias is not a problem you solve. It is a condition you remain vigilant toward.

Thoughts

There is a temptation, when confronted with the scale and subtlety of latent bias, to reach for despair โ€” or worse, for the false comfort of declaring the problem unsolvable and therefore not worth pursuing. Both responses are forms of avoidance.

The undertow is not a reason to stop building. It is a reason to build differently. To hold our systems with a lighter grip and a longer gaze. To resist the narrative that deployment is the end of responsibility rather than the beginning of a new kind of attention.

The most dangerous currents are the ones we stop believing exist. The work of a responsible practitioner โ€” of a responsible field โ€” is to keep looking for them, even when, especially when, the water appears perfectly still.

The question is never whether the undertow is there. The question is whether we are paying enough attention to feel it.

Explore more writing on topics that matter.

โ† Back to all posts