
Most data science tutorials start with a dataset and a model. They end with an accuracy score. Then they stop — and that's exactly the problem.
An ML product is different from an ML model. A model is a mathematical function. A product is something that changes a decision, saves money, or surfaces insight that wasn't visible before. The gap between those two things is where most projects die.
Capital Bikeshare is one of the best public datasets in existence for practicing this full arc. It's not a toy. It's a real system operating in a real city, with real costs when things go wrong, and a clean paper trail of everything that's happened for years. If you can build an ML product on this, you can build one on almost anything.
This article walks you through the entire journey — why this problem matters to a business, what the technical architecture looks like, and the code that ties it all together.
Capital Bikeshare operates roughly 700 stations across Washington DC and surrounding areas. Every day, thousands of riders pick up a bike at one station and drop it at another. This sounds straightforward. The math, however, creates a slow-motion disaster.
Commuters are not symmetric. In the morning, bikes flow from residential neighbourhoods toward downtown, transit hubs, and office corridors. By 9am, stations near Capitol Hill are overflowing. Stations near Columbia Heights are empty. By 5pm, the same thing happens in reverse. But it's never perfectly symmetric — weather, events, holidays, and random variation mean the imbalance is never the same two days in a row.
The result: riders arrive at an empty station and can't pick up a bike. Or they arrive at a full station and can't return one. Both are failures. Both cost money in a surprisingly direct way.
The business impact of this problem is large and underappreciated.
Direct operational cost. Bikeshare systems employ teams of drivers operating trucks and vans to manually move bikes from full stations to empty ones. This is called rebalancing. In a city the size of Washington DC, this is a multi-million-dollar annual operational line item. The trucks run on diesel. The drivers earn wages. The routes are planned by coordinators who are guessing, not calculating.
Rider churn. A rider who arrives at an empty station misses their meeting, misses their train, gets soaked in unexpected rain. They remember. Studies on micromobility retention consistently show that a failed trip in the first few months of use dramatically increases the probability of a rider cancelling their subscription. A single empty-station failure can cost a bikeshare operator many times the value of that one trip in long-term revenue.
Ghost rebalancing. Without prediction, operators often send trucks based on yesterday's pattern. But yesterday was sunny and today is raining. The truck arrives, moves bikes that didn't need moving, and misses the real shortage forming three stations away. This is not a hypothetical — it's the norm in reactive systems.
Regulatory pressure. Cities that contract bikeshare operators often include service-level agreements with financial penalties for high rates of empty or full stations during peak hours. Missing SLAs means paying fines and risking contract renewal.
The core prediction problem is: given everything I know right now about a station, how many bikes will be demanded there in the next 1, 2, and 4 hours?
If you can answer that accurately, a dispatcher can:
The ML model doesn't drive a truck. It tells the dispatcher where to send one, and when, and with how many bikes. That's the product.
The system publishes monthly CSV files of every trip taken. Each row is one completed trip and includes:
ride_id — unique trip identifierrideable_type — classic or electric bikestarted_at, ended_at — datetime of tripstart_station_name, start_station_idend_station_name, end_station_idstart_lat, start_lng, end_lat, end_lngmember_casual — subscription member or casual riderThis is rich. But raw trips are not what the model needs. The model needs station-hour level demand: how many bikes departed from and arrived at each station each hour.
Weather is one of the strongest signals in the entire dataset. A 10°C drop combined with rain will cut demand by 40–60% at recreational stations. This is knowable in advance from forecasts.
Raw demand numbers contain almost no signal on their own. Features are the translation layer between raw data and learnable patterns. This is the craft of the job.
A random train/test split on time series data is one of the most common — and silent — mistakes in ML. If you randomly split, your model trains on September and predicts May. This seems fine. But the model has seen the future: lag features from future rows leak backward. Your validation metrics are lies, and you won't find out until you deploy.
Always split chronologically.
LightGBM wins on tabular data with temporal structure. It handles missing values natively, trains fast, and gives you feature importances without extra work.
A model with good MAE is necessary. It's not sufficient. Operators need to trust the system. A dashboard that says "Station X will be empty at 8am" with no reasoning gets ignored. A dashboard that says "Station X will be empty because it's a Monday morning with heavy demand predicted — it's run dry 14 of the last 20 Monday mornings — and rain is incoming" gets acted on.
SHAP (SHapley Additive exPlanations) gives you exactly this.
This is where the product separates itself. Predictions are intelligence. Actions are value. The optimiser converts predictions into truck dispatch instructions.
The impact of this system is measurable across three dimensions.
Operational efficiency. A reactive system sends trucks after stations fail. A predictive system pre-positions trucks before stations fail. The difference in truck mileage between these two approaches, on a real DC-scale system, is estimated at 20–35% reduction in total vehicle distance per day. Fewer kilometres means lower fuel costs, lower driver hours, and reduced emissions.
Rider experience. Failed trips (arriving at an empty or full station) directly measure service quality. Predictive rebalancing can reduce failed trips during peak hours by 40–60% compared to reactive scheduling, based on published results from similar systems in New York and Paris.
Model accuracy. A well-engineered LightGBM model on this dataset typically achieves a MAE of 1.5–2.5 bikes per station per hour on the test set, compared to a same-week baseline of 3.5–5 bikes. That's a 40–50% improvement over naive forecasting.
The explainability layer means operators understand why the model is alerting them — which means they override it less often, act on it faster, and trust it more over time. Trust is itself a KPI.
The MLOps layer ensures this doesn't become a science project. A model that degrades silently is worse than no model — operators rely on it and don't notice when accuracy drops. Monitoring prediction error against actuals, version control over models, and automated retraining keep the system honest.
The journey from "I trained a model" to "I built a product" has a specific shape. It starts with a business problem stated in dollars and decisions, not accuracy scores. It runs through data engineering that no tutorial skips gracefully. It demands feature engineering that takes domain knowledge seriously. It requires evaluation discipline that most notebook projects never enforce.
And it ends not with a number on a leaderboard, but with a dispatcher looking at a map, seeing which stations will fail in the next four hours, and knowing exactly which truck to send.
That is the product. Everything else is the work that earns it.
Dataset: capitalbikeshare.com/system-data | Weather: open-meteo.com | Stack: Polars · LightGBM · SHAP · FastAPI · MLflow
Explore more writing on topics that matter.