SB|Education
KNN and Random Forest: A Practitioner's Technical Reference

KNN and Random Forest: A Practitioner's Technical Reference

Stanis B.
Stanis B.
January 2, 2025 · 7 min read
In brief

Two algorithms that sit at opposite ends of the ML complexity spectrum. KNN is a lazy learner that does zero work at training time and everything at prediction. Random Forest is a battle-tested ensemble that handles messy, real-world data with minimal preprocessing. This guide breaks down how each one actually works, when practitioners reach for them, when they don't — and includes live interactive demos where you can place points and watch decision boundaries form in real time.

Most resources teach you what these algorithms do. This covers what you actually need to know to use them without making costly mistakes in production. Instead of focusing only on theory, we shift toward practical understanding—why models fail, how data quality shapes outcomes, and what trade-offs you make when choosing one approach over another. The goal is to build intuition so you can debug, adapt, and make informed decisions, thinking like a practitioner who works with real-world systems rather than just someone who runs models on clean datasets.

K-Nearest Neighbors (KNN)

The Mental Model : KNN

KNN is a memory-based, non-parametric, instance-based learner. There is no training phase in any meaningful sense — the algorithm is the dataset. Every prediction is a full scan (or approximate scan) of your training corpus at inference time. This distinction matters enormously for system design.

The inductive bias of KNN is purely geometric: similar inputs produce similar outputs, where similarity is defined entirely by your choice of distance metric. No assumptions about linearity, feature independence, or distribution shape. This makes it unusually flexible — and unusually brittle in high dimensions.

How KNN Actually Works

Step 1 — Index the training data. For naive KNN this is O(1) — just store the matrix. In practice, for anything beyond ~10k points you should be building a spatial index: a k-d tree (efficient up to ~20 features), a ball tree (better in higher dimensions), or an approximate nearest neighbor (ANN) structure like HNSW or FAISS for large-scale retrieval. The naive implementation is only acceptable as a baseline.

Step 2 — Compute distances at inference. For each query point, compute its distance to every training point. Default is Euclidean (L2). But the right metric depends on your data:

  • Manhattan (L1) — more robust to outliers, better for high-dimensional spaces
  • Cosine similarity — for text embeddings or any vector where magnitude doesn't matter, only direction
  • Minkowski — generalization of L1/L2, tune with parameter p
  • Mahalanobis — accounts for feature correlations; computationally expensive but eliminates the need to normalize separately
  • Hamming — for categorical or binary features

The metric you choose is your model. A bad metric choice cannot be rescued by tuning K.

Step 3 — Select K neighbors and aggregate. For classification: majority vote. For regression: mean (or median for robustness). Two refinements practitioners actually use:

  • Distance-weighted voting — neighbors vote with weight 1/d², so closer neighbors dominate. This consistently outperforms uniform voting on real data.
  • Soft predictions — output vote fractions as class probabilities rather than hard labels. Critical for calibration and downstream decision thresholds.

The Curse of Dimensionality — What It Actually Means

This is not a theoretical concern. Beyond ~15–20 features, Euclidean distance becomes numerically meaningless. In high-dimensional spaces, the ratio of the distance to the farthest neighbor versus the nearest neighbor converges to 1 — all points become equidistant. Your "nearest" neighbors are no longer meaningfully similar to your query point.

The practical implication: KNN requires aggressive dimensionality reduction before use in high-dimensional problems. PCA, UMAP, or learned embeddings (via an autoencoder or a supervised model) are prerequisites, not optional steps.

What Practitioners Get Wrong with KNN

Not normalizing. A feature measured in thousands dominates all others in Euclidean space. StandardScaler or MinMaxScaler is not optional — it's load-bearing. This is the single most common reason KNN underperforms in practice.

Treating K as the only hyperparameter. The metric, the weighting scheme, and whether you're using approximate or exact search all matter as much as K, often more.

Using KNN on imbalanced data without adjustment. If 90% of your training data is Class A, KNN will almost always return Class A regardless of K. Use stratified sampling, weighted voting, or SMOTE before applying KNN to imbalanced datasets.

Ignoring inference cost. O(n·d) per query means that doubling your dataset doubles your prediction latency. This is not acceptable in production systems with strict SLA requirements. If you need KNN semantics at scale, you need ANN — not exact KNN.

When KNN Is Actually the Right Choice

  • Similarity and retrieval systems where distance is the product (recommendation engines, semantic search, anomaly detection via distance to k-th neighbor)
  • Small datasets (<50k rows) where inference cost is not a constraint
  • Establishing a non-parametric baseline before committing to a more complex model
  • Problems where the decision boundary is genuinely non-linear and irregular, and you lack enough data to train a neural network

When to Walk Away from KNN

Large datasets at inference time, high-dimensional raw features without reduction, real-time low-latency requirements, need for feature importance, noisy data with many outliers. In all of these cases, KNN will fail silently — it won't throw an error, it'll just produce quietly mediocre results.

Random Forest

The Mental Model : RF

Random Forest is a bagged ensemble of high-variance decision trees, with an additional randomization layer at the feature level. The core insight is counterintuitive: you can take a collection of individually overfit, high-variance models and produce a low-variance ensemble by averaging their predictions — provided the models' errors are sufficiently uncorrelated.

The two randomization mechanisms are what create that decorrelation:

Bootstrap sampling — each tree trains on a different random draw (with replacement) from the training set. ~63% of unique samples per tree on average.

Random feature subsets at each split — at every node, only √p features (classification) or p/3 features (regression) are candidates for splitting. This prevents all trees from converging on the same dominant features.

Without the second mechanism, you have bagging. With it, you have Random Forest — and it's the feature randomization that accounts for most of the performance difference.

How RF Actually Works

Step 1 — Bootstrap sampling. Draw N samples with replacement from your N training points. Each bootstrap sample contains ~63.2% unique samples (1 - 1/e). The remaining ~36.8% — the out-of-bag (OOB) samples — form a natural held-out set for that tree. This is not a minor implementation detail; OOB error is a statistically valid estimate of generalization error that costs you nothing extra.

Step 2 — Grow deep trees with random feature subsets. At each node, sample √p features (for classification) and find the best split among only those features. Split criterion is typically Gini impurity or information gain (entropy). Trees are grown to maximum depth — full overfit is intentional. Each individual tree has high variance and low bias. The ensemble averaging is what will fix the variance.

Step 3 — Aggregate. Classification: majority vote across all trees. Regression: mean of all tree outputs. The predicted class probability is the fraction of trees voting for each class — more calibrated than hard voting. Increasing the number of trees improves stability monotonically but with strongly diminishing returns. The variance of the ensemble is approximately ρσ² + (1-ρ)σ²/T, where ρ is the average pairwise tree correlation and T is tree count. After ~200–500 trees, further gains are marginal.

Two Features Practitioners Underuse

OOB error as a free validation signal. Since each tree only saw ~63% of the data, the remaining 37% can score each training sample without data leakage. Aggregating OOB predictions across all trees gives you an unbiased estimate of generalization error with no need for a separate validation split. In small-data regimes, this is genuinely valuable — you're not wasting 20% of your data on a held-out set.

Feature importance for data understanding. Mean decrease in impurity (MDI) aggregates how much each feature reduces Gini impurity across all splits in all trees. More reliable variant: permutation importance — randomly shuffle one feature's values and measure the drop in OOB accuracy. This is model-agnostic, more robust to high-cardinality features, and catches features that matter even when they're correlated with others. Use this in the exploratory phase to prune noise before training XGBoost or a neural network.

One critical caveat: MDI-based importance is biased toward high-cardinality features (features with many unique values get more split opportunities). If you have a mix of continuous and categorical features, permutation importance is more trustworthy.

What Practitioners Get Wrong with RF

Assuming defaults are good enough. n_estimators=100 and max_features='sqrt' are reasonable starting points, not optimal configurations. The hyperparameters that matter most: max_features (controls tree correlation — lower means more diverse trees), min_samples_leaf (controls overfitting — especially important in small or noisy datasets), max_depth (explicit depth cap can improve generalization on noisy data).

Ignoring memory at scale. Each tree stores its full node structure. 500 deep trees on a dataset with 100 features can consume several GB of RAM. Profile memory before productionizing.

Using it for extrapolation. Random Forest cannot predict outside the range of its training data. It's a piecewise constant function bounded by the min/max of training targets. For regression problems where test data may fall outside the training distribution, RF will systematically under/over-predict at the boundaries. This is not a tunable problem — it's structural.

Treating it as interpretable. A 300-tree forest is a black box. Feature importance gives you correlational signal, not causal explanation. If you need to explain individual predictions, use SHAP values (TreeSHAP is efficient for tree ensembles). If you need full model interpretability for compliance or stakeholder requirements, use a single shallow decision tree instead.

When Random Forest Is the Right Choice

  • Structured/tabular data with mixed feature types and unknown interactions
  • When you want a strong baseline without extensive preprocessing or hyperparameter tuning
  • When feature importance is part of the deliverable
  • Medium to large datasets (handles millions of rows, though XGBoost/LightGBM will be faster)
  • When you want reliable uncertainty estimates via prediction variance across trees

When to Walk Away from RF

High-dimensional sparse data (text, one-hot encoded categoricals with many levels — linear models or gradient boosting win here), problems requiring extrapolation beyond training range, ultra-low-latency inference requirements, when full prediction explainability is legally or operationally required.

Choosing Between Them in Practice

The decision is almost never about abstract performance — it's about which constraints dominate your problem.

The standard tabular ML progression in practice is: Logistic Regression → Random Forest → XGBoost/LightGBM. KNN sits outside this progression — it's used for specific retrieval-oriented problems, small-data baselines, or cases where the similarity interpretation is the point of the model, not a side effect.

What KNN teaches you regardless of whether you use it: the geometry of feature spaces, the effect of scale and dimensionality on distance, and why preprocessing is not optional. That intuition transfers to every distance-based model you'll encounter — including the embedding layers inside the neural networks that have largely replaced KNN in production retrieval systems, but which operate on the same underlying principles.

Explore more writing on topics that matter.

← Back to all posts