CitiBike NYC: Demand, Risk, and Net Flow (2023-2025)

Exploring how CitiBike usage and crash risk data can create value for CitiBike and an insurance partner.

1. Overview

1.1 Background

CitiBike is New York City's largest bikeshare system, used for millions of short trips each month. Despite its scale and role as public-transport substitute, riders currently receive no accident or liability insurance. Use of the system requires accepting a broad liability waiver, meaning riders effectively self-insure against injury, medical costs, third-party liability, and damage to the bike.

CitiBike itself is also not fully insulated: waivers do not cover incidents involving maintenance issues, equipment failure, or negligence. This leaves residual legal exposure, reputational risk, and potential settlement pressure.

Other micromobility providers have taken a different approach. For example, Lime partnered with Allianz to provide automatic accident and liability insurance for riders. A similar model may be feasible for CitiBike but requires high-resolution, data-driven understanding of location and time-specific risk.

Key idea: We combine CitiBike trip data with NYPD crash data to construct a transparent, interpretable risk per trip measure by station, time of day, and their interaction. This enables user communication (e.g., in-app warnings), pricing of insurance products, and targeted safety interventions.

1.2 Data & Time Frame

1.3 Structure

1.4 Key Findings

  • Section 2 — Data Analysis
    • Demand & maturity: Strong seasonal peaks; growth from 2023 to 2024 but clear stagnation in 2025 -> CitiBike has entered a mature phase.
    • Usage structure: Extremely stable weekday/weekend split and hourly commuter peaks; trips are short.
    • Imbalance: A small set of stations act as persistent sources/sinks and drive most rebalancing needs.
    • Operational timing: Weekends (especially Sundays) are best for major rebalancing and repairs; daily rebalancing is most effective before/after the morning and early-evening peaks, and repairs fit best overnight.
    • Actionable insights: Prioritize e-bike availability at peak times; target winter-demand stimulation; focus rebalancing on the structural source/sink stations.
  • Section 3 — Risk Analysis
    • Spatial risk: Most stations are low-risk, but a small number form persistent medium- or high-risk clusters.
    • Temporal risk: Time-of-day dominates weekday differences: late evening and night show 1.5*-4* higher risk than daytime; weekday patterns are nearly identical.
    • Station * time-of-day: Some stations that are safe at midday become high-risk at night -> aggregate station averages conceal meaningful temporal spikes.
    • Actionable insights: Warn riders at high-risk stations/times; use station*time risk for insurance pricing; target nighttime location hotspots for safety improvements.
  • Section 4 — Net Flow Prediction
    • Imbalance patterns: Most stations are balanced; extreme imbalances are predictable and recur at the same locations.
    • Prediction approach: Classify stations into under-supply / balanced / over-supply using lagged net flow, calendar structure, and spatial features.
    • Actionable insights: Enable proactive rebalancing before stations empty/fill, reducing service failures and operational costs.

2. Data Analysis: CitiBike (2023–2025)

This section summarizes CitiBike demand based strictly on the exploratory data analysis. The analysis investigates: (i) overall daily demand, (ii) net flow and structural imbalances, (iii) usage patterns across bike type, membership, weekday/weekend, and hour of day, and (iv) trip duration and distance distributions.

2.1 Demand & System Maturity

[Figure: Daily usage and average daily demand per station (30-day rolling mean)]

Actionable insights:

2.2 Net Flow & Imbalance

Net flow for station \(j\) and day \(t\) is defined as:

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

with A: arrivals, D: departures. Negative values indicate stations that tend to empty, positive values stations that tend to fill.
[Figure: System-wide average absolute net flow + top persistent source/sink stations]

Actionable insights:

2.3 Usage Patterns: Who, When, and How People Ride

Bike Type

[Figure: Bike type shares]

Membership vs Casual

[Figure: Membership shares]

Actionable insight:

Usage by Weekday

[Figure: Weekday vs Weekend usage]

Usage by Hour

[Figure: Hourly usage]

Actionable insights:

Trip Duration & Distance

[Figure: Trip duration and distance]

2.4 Summary

  • CitiBike demand is strongly seasonal and growth has slowed since 2024.
  • Net flow imbalances are structural and predictable, concentrated in a small set of stations.
  • Usage patterns (bike type, membership, weekday/weekend, hourly) are stable across years.
  • Trips are short and consistent in duration and distance.
  • These regularities support risk estimation (Section 3) and net-flow prediction (Section 4).

3. Risk Analysis

We combine CitiBike trips with NYPD collision data to estimate an exposure-adjusted, Empirical-Bayes-smoothed risk per trip by station, time of day, and station * time-of-day.

3.1 Assigning Crashes to Stations

Crashes are assigned to the nearest station using a BallTree with Haversine distance, subject to a strict 300 m cutoff so that only crashes plausibly related to CitiBike trips are included.

3.2 Risk per Trip

For station \(j\) and time bucket \(b\):

Raw risk: \[ R_{j,b} = \frac{H_{j,b}}{E_{j,b} + \epsilon}. \] To stabilize noisy ratios, we apply Empirical Bayes smoothing:

\[ R_{j,b}^{\mathrm{EB}} = \lambda_{j,b} R_{j,b} + (1 - \lambda_{j,b}) \mu_b, \qquad \lambda_{j,b} = \frac{E_{j,b}}{E_{j,b} + C}, \]

where \(\mu_b\) is a time-of-day specific mean risk; \(C\) is a credibility constant that controls the strength of shrinkage toward the mean. Hence, stations with low exposure (few trips) have their risk estimates shrunk more strongly toward the global mean.

3.3 Crash Severity

Crash severity \(S_i\) weights injuries and fatalities heavily, and assigns much higher severity to cyclist-involved crashes. The specification ensures that:

3.4 Station-Level Risk

Hazard and exposure are aggregated to obtain EB-smoothed station-level risk: \[ R_j^{\mathrm{EB}} = \lambda_j R_j + (1-\lambda_j)\mu. \]

[Figure: Station-level EB risk map]

Key findings from the analysis:

3.5 Time-of-Day & Weekday Risk

Hazard and exposure are aggregated to obtain EB-smoothed time-level risk: \[ R_b^{\mathrm{EB}} = \lambda_b R_b + (1-\lambda_b)\mu. \]

[Figure: EB risk by time of day]

Findings:

Interpretation: System-wide risk is not commuter-driven; it is strongly shaped by visibility and nighttime traffic conditions.

3.6 Station x Time-of-Day Risk

[Figure: Station x time-of-day risk grid]

Findings:

3.7 Use for CitiBike & Insurers

  • User safety alerts: highlight high-risk stations and nighttime conditions.
  • Insurance pricing: use the risk measure for risk-based premiums.
  • Operational targeting: identify night-time or location-specific safety improvement opportunities.

4. Net Flow Prediction

Net flow determines where CitiBike needs to rebalance. For station \(j\) and day \(t\):

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

where negative values indicate emptying stations and positive values indicate filling stations.

4.1 From Distribution to Prediction Target

[Figure: Distribution of station-day net flows]

We therefore predict imbalance classes rather than exact values.

4.2 Ternary Imbalance Classification

For each station-day \((j,t)\):

\[ y_{j,t} = \begin{cases} -1, & \text{if } \text{NetFlow}_{j,t} < -5, \\ 0, & \text{if } |\text{NetFlow}_{j,t}| \le 5, \\ +1, & \text{if } \text{NetFlow}_{j,t} > 5. \end{cases} \]

4.3 Prediction Setup and Data Split

We predict tomorrow's imbalance class \(y_{j,t+1}\) for each station using only information available up to day \(t\).

Time-series cross-validation would be ideal for hyperparameter tuning but is computationally expensive for the full data set. A time-consistent train-validation-test split provides an efficient and clean alternative.

4.4 Baseline Models

Two simple baselines provide reference points for evaluating model improvements.

Because under- and over-supply cases are rare, accuracy is uninformative. We instead report Macro-F1 (balanced performance across classes) and per-class recall (ability to detect rare but critical events).

Evaluation metrics.

We report Macro-F1 and per-class recall, which are appropriate for this imbalanced three-class setting. The F1-score for a class is the harmonic mean of precision and recall, balancing how often the model’s positive predictions are correct (precision) with how many true cases it successfully detects (recall). Macro-F1 computes the F1-score separately for each class (−1, 0, +1) and then averages them with equal weight, preventing the majority class from dominating performance. Per-class recall measures the share of actual instances of each class that the model identifies correctly; high recall for −1 and +1 is operationally crucial because missing an emptying or filling station is far more costly than misclassifying a balanced one. These metrics therefore capture performance on the rare but important imbalance events that CitiBike cares about.

4.5 Predictive Models and Features

Each station-day is characterized by lagged imbalance signals, calendar structure, station location, and lagged weather — all observable at prediction time.

We estimate two models:

Class weights compensate for imbalance during training. These make the model pay more attention to the rare −1 and +1 cases by penalizing their misclassification more strongly, preventing the model from defaulting to the majority class. All tuning follows the time-consistent train-validation setup.

4.6 Results

Models are evaluated on the test set using Macro-F1 and per-class recall. Accuracy is not emphasized because the data are strongly imbalanced.

Baseline Models

The two baseline models provide reference points. Baseline 0 (always predicting “balanced”) achieves high accuracy but detects none of the exporter or importer cases (Macro-F1 ≈ 0.28). Baseline 1 (persistence), which predicts tomorrow’s class as today’s, performs better (Macro-F1 ≈ 0.38) by exploiting simple day-to-day persistence, but still identifies only about 16–19% of exporter/importer station-days.

Multinomial Logistic Regression

The multinomial logistic regression with class weights improves minority detection substantially. It identifies roughly 50–56% of exporter and importer station-days and reaches a Macro-F1 ≈ 0.40. This confirms that lagged net flow, calendar structure, and weather features contain useful predictive signal, but the linear decision boundary limits performance on more complex spatial– temporal patterns.

Gradient-Boosted Trees (CatBoost)

The CatBoost gradient-boosting model is the best-performing approach. On the test set it achieves:

Compared to Baseline 0, CatBoost improves Macro-F1 by about +23 points (0.28 → 0.51) and raises minority-class recall from 0% to 60%. Relative to the persistence baseline, minority recall improves by more than a factor of three (≈0.16–0.19 → 0.60). Compared to logistic regression, CatBoost increases Macro-F1 from roughly 0.40 to 0.51 and improves both exporter and importer recall (from ≈0.50–0.56 to 0.60), reflecting its ability to capture nonlinear spatial–temporal and weather interactions.

Trade-offs and Interpretation

CatBoost attains this gain by predicting minority classes more often, which lowers recall for the balanced class (≈0.57 vs. ≈0.78 for the persistence baseline). This trade-off is appropriate for the task: misclassifying some balanced days as imbalanced has low operational cost, whereas missing exporter or importer days leads directly to empty or full stations. Overall, CatBoost makes the correct trade-off for proactive rebalancing and is the preferred model for net-flow imbalance prediction.

Future Model Refinement

Although CatBoost performs best, several refinements are possible. Additional features such as improved weather representations, or station-level clustering may strengthen predictions. More targeted hyperparameter tuning and probabilistic calibration could further improve minority-class detection. Exploring alternative gradient-boosting models may also offer potential performance gains.

4.7 Operational Value

  • Proactive rebalancing: identify tomorrow's problematic stations.
  • Fewer outages: reduce empty/full stations and lost rides.
  • Cost efficiency: target redistribution to where imbalance is predicted, not just observed.

5. Conclusion & Strategic Takeaways

5.1 Main Findings

5.2 Proposed Actions

5.3 Value for CitiBike

5.4 Value for an Insurance Partner

5.5 Next Steps