ML Product from Scratch: Capital Bikeshare as Your Launchpad

Stanis B.

April 25, 2025 · 17 min read

In brief

Most data science tutorials start with a dataset and end with an accuracy score. But an ML product is different — it changes a decision, saves money, or surfaces insight that wasn't visible before. Capital Bikeshare is one of the best public datasets to practice this full arc: real city, real costs, real consequences when a station runs empty. This guide walks the entire journey from raw trip CSVs to a deployed rebalancing system — because the gap between a model and a product is exactly where most projects die.

Most data science tutorials start with a dataset and a model. They end with an accuracy score. Then they stop — and that's exactly the problem.

An ML product is different from an ML model. A model is a mathematical function. A product is something that changes a decision, saves money, or surfaces insight that wasn't visible before. The gap between those two things is where most projects die.

Capital Bikeshare is one of the best public datasets in existence for practicing this full arc. It's not a toy. It's a real system operating in a real city, with real costs when things go wrong, and a clean paper trail of everything that's happened for years. If you can build an ML product on this, you can build one on almost anything.

This article walks you through the entire journey — why this problem matters to a business, what the technical architecture looks like, and the code that ties it all together.

Part 1: The Business Problem

The Rebalancing Crisis Nobody Talks About

Capital Bikeshare operates roughly 700 stations across Washington DC and surrounding areas. Every day, thousands of riders pick up a bike at one station and drop it at another. This sounds straightforward. The math, however, creates a slow-motion disaster.

Commuters are not symmetric. In the morning, bikes flow from residential neighbourhoods toward downtown, transit hubs, and office corridors. By 9am, stations near Capitol Hill are overflowing. Stations near Columbia Heights are empty. By 5pm, the same thing happens in reverse. But it's never perfectly symmetric — weather, events, holidays, and random variation mean the imbalance is never the same two days in a row.

The result: riders arrive at an empty station and can't pick up a bike. Or they arrive at a full station and can't return one. Both are failures. Both cost money in a surprisingly direct way.

Translating Failure Into Dollars

The business impact of this problem is large and underappreciated.

Direct operational cost. Bikeshare systems employ teams of drivers operating trucks and vans to manually move bikes from full stations to empty ones. This is called rebalancing. In a city the size of Washington DC, this is a multi-million-dollar annual operational line item. The trucks run on diesel. The drivers earn wages. The routes are planned by coordinators who are guessing, not calculating.

Rider churn. A rider who arrives at an empty station misses their meeting, misses their train, gets soaked in unexpected rain. They remember. Studies on micromobility retention consistently show that a failed trip in the first few months of use dramatically increases the probability of a rider cancelling their subscription. A single empty-station failure can cost a bikeshare operator many times the value of that one trip in long-term revenue.

Ghost rebalancing. Without prediction, operators often send trucks based on yesterday's pattern. But yesterday was sunny and today is raining. The truck arrives, moves bikes that didn't need moving, and misses the real shortage forming three stations away. This is not a hypothetical — it's the norm in reactive systems.

Regulatory pressure. Cities that contract bikeshare operators often include service-level agreements with financial penalties for high rates of empty or full stations during peak hours. Missing SLAs means paying fines and risking contract renewal.

What ML Actually Solves Here

The core prediction problem is: given everything I know right now about a station, how many bikes will be demanded there in the next 1, 2, and 4 hours?

If you can answer that accurately, a dispatcher can:

Pre-position trucks before the morning rush, not during it
Identify stations that will fail before they fail
Optimise truck routes across a cluster of predicted failures simultaneously
Simulate "what if" scenarios: what happens if a Nationals game ends at 10pm?

The ML model doesn't drive a truck. It tells the dispatcher where to send one, and when, and with how many bikes. That's the product.

Part 2: The Data Foundation

What Capital Bikeshare Gives You

The system publishes monthly CSV files of every trip taken. Each row is one completed trip and includes:

ride_id — unique trip identifier
rideable_type — classic or electric bike
started_at, ended_at — datetime of trip
start_station_name, start_station_id
end_station_name, end_station_id
start_lat, start_lng, end_lat, end_lng
member_casual — subscription member or casual rider

This is rich. But raw trips are not what the model needs. The model needs station-hour level demand: how many bikes departed from and arrived at each station each hour.

Building the Core Dataset

python

1import pandas as pd
2import polars as pl
3from pathlib import Path
4import glob
5
6# --- Step 1: Load and concatenate multiple months ---
7# Polars is 5-10x faster than pandas for this kind of aggregation
8files = glob.glob("data/raw/*.csv")

Click to view full code

Enriching With Weather

Weather is one of the strongest signals in the entire dataset. A 10°C drop combined with rain will cut demand by 40–60% at recreational stations. This is knowable in advance from forecasts.

python

1import requests
2import pandas as pd
3
4def fetch_weather_openmeteo(lat: float, lon: float, 
5                             start: str, end: str) -> pd.DataFrame:
6    """
7    Fetch hourly historical weather from Open-Meteo (free, no API key).
8    DC coordinates: lat=38.9, lon=-77.0

Click to view full code

Part 3: Feature Engineering — Where Models Win or Lose

Raw demand numbers contain almost no signal on their own. Features are the translation layer between raw data and learnable patterns. This is the craft of the job.

typescript

1import pandas as pd
2import numpy as np
3from pandas.tseries.holiday import USFederalHolidayCalendar
4
5def build_features(demand: pd.DataFrame, weather: pd.DataFrame) -> pd.DataFrame:
6    """
7    Build the full feature matrix for station-hour demand prediction.
8    Input:  demand  — [station_id, hour, departures, arrivals]

Click to view full code

Part 4: Modelling with Rigour

The Most Important Rule: Never Leak Time

A random train/test split on time series data is one of the most common — and silent — mistakes in ML. If you randomly split, your model trains on September and predicts May. This seems fine. But the model has seen the future: lag features from future rows leak backward. Your validation metrics are lies, and you won't find out until you deploy.

Always split chronologically.

python

1from sklearn.metrics import mean_absolute_error, mean_squared_error
2import lightgbm as lgb
3import numpy as np
4
5def time_based_split(df: pd.DataFrame, 
6                     train_end: str, 
7                     val_end: str) -> tuple:
8    """

Click to view full code

Training LightGBM

LightGBM wins on tabular data with temporal structure. It handles missing values natively, trains fast, and gives you feature importances without extra work.

python

1def train_lgbm(X_train, y_train, X_val, y_val) -> lgb.Booster:
2
3    train_data = lgb.Dataset(X_train, label=y_train)
4    val_data   = lgb.Dataset(X_val,   label=y_val, reference=train_data)
5
6    params = {
7        "objective":        "regression_l1",   # MAE loss — robust to outliers
8        "metric":           "mae",

Click to view full code

Part 5: Explainability — From Black Box to Business Story

A model with good MAE is necessary. It's not sufficient. Operators need to trust the system. A dashboard that says "Station X will be empty at 8am" with no reasoning gets ignored. A dashboard that says "Station X will be empty because it's a Monday morning with heavy demand predicted — it's run dry 14 of the last 20 Monday mornings — and rain is incoming" gets acted on.

SHAP (SHapley Additive exPlanations) gives you exactly this.

python

1import shap
2import matplotlib.pyplot as plt
3import matplotlib
4
5matplotlib.rcParams["font.family"] = "DejaVu Sans"
6
7# Compute SHAP values — use TreeExplainer for LightGBM (fast)
8explainer   = shap.TreeExplainer(model)

Click to view full code

Part 6: The Rebalancing Optimiser

This is where the product separates itself. Predictions are intelligence. Actions are value. The optimiser converts predictions into truck dispatch instructions.

python

1import numpy as np
2from dataclasses import dataclass
3from typing import List, Tuple
4
5@dataclass
6class StationAlert:
7    station_id:   str
8    lat:          float

Click to view full code

Part 7: Deployment

FastAPI Prediction Endpoint

typescript

1from fastapi import FastAPI, HTTPException
2from pydantic import BaseModel
3from datetime import datetime
4import joblib
5import pandas as pd
6import numpy as np
7
8app = FastAPI(

Click to view full code

Tracking Experiments With MLflow

python

1import mlflow
2import mlflow.lightgbm
3
4mlflow.set_experiment("bikeshare-demand-forecasting")
5
6with mlflow.start_run(run_name="lgbm-baseline-v1"):
7
8    # Log parameters

Click to view full code

Part 8: The Technical Impact

The impact of this system is measurable across three dimensions.

Operational efficiency. A reactive system sends trucks after stations fail. A predictive system pre-positions trucks before stations fail. The difference in truck mileage between these two approaches, on a real DC-scale system, is estimated at 20–35% reduction in total vehicle distance per day. Fewer kilometres means lower fuel costs, lower driver hours, and reduced emissions.

Rider experience. Failed trips (arriving at an empty or full station) directly measure service quality. Predictive rebalancing can reduce failed trips during peak hours by 40–60% compared to reactive scheduling, based on published results from similar systems in New York and Paris.

Model accuracy. A well-engineered LightGBM model on this dataset typically achieves a MAE of 1.5–2.5 bikes per station per hour on the test set, compared to a same-week baseline of 3.5–5 bikes. That's a 40–50% improvement over naive forecasting.

The explainability layer means operators understand why the model is alerting them — which means they override it less often, act on it faster, and trust it more over time. Trust is itself a KPI.

The MLOps layer ensures this doesn't become a science project. A model that degrades silently is worse than no model — operators rely on it and don't notice when accuracy drops. Monitoring prediction error against actuals, version control over models, and automated retraining keep the system honest.

This Is How ML Becomes a Product

The journey from "I trained a model" to "I built a product" has a specific shape. It starts with a business problem stated in dollars and decisions, not accuracy scores. It runs through data engineering that no tutorial skips gracefully. It demands feature engineering that takes domain knowledge seriously. It requires evaluation discipline that most notebook projects never enforce.

And it ends not with a number on a leaderboard, but with a dispatcher looking at a map, seeing which stations will fail in the next four hours, and knowing exactly which truck to send.

That is the product. Everything else is the work that earns it.

Dataset: capitalbikeshare.com/system-data | Weather: open-meteo.com | Stack: Polars · LightGBM · SHAP · FastAPI · MLflow

Explore more writing on topics that matter.

← Back to all posts