CitiBike NYC: Demand, Risk, and Net Flow Analysis (2023

1. CitiBike & Project Overview

1.1 CitiBike in New York City

CitiBike is New York City’s largest bike sharing system with thousands of bikes and a dense network of stations across Manhattan, Brooklyn, Queens, parts of the Bronx, and Jersey City. For many trips, CitiBike is a realistic alternative to public transport, ride-hailing, or private car use.

This project uses CitiBike trip data (January 2023–October 2025) jointly with NYPD collision data to:

understand how and where CitiBike is used,
quantify safety risk around stations using a transparent risk measure, and
explore how these insights can support insurance products and operational decisions.

1.2 Data & Time Frame

Usage data: CitiBike trip records (start and end station, timestamps, trip duration, bike type, user type).
Crash data: NYPD collision records with detailed information on injuries, fatalities, and vehicles involved.
Geospatial data: Station locations and, where necessary, cleaned/merged station identifiers.

Key idea: CitiBike trip data gives us exposure (how often and where bikes are used), while crash data provides information on severity. Combining both allows us to construct a risk per trip measure that is meaningful for users, CitiBike, and a potential insurer.

1.3 Structure of the Presentation

The remainder of this webpage is organized as follows:

Section 2 – Data Analysis: How demand evolves over time, across space, and by user type.
Section 3 – Risk Analysis: How we construct and interpret a station-level and time-of-day risk measure.
Section 4 – Net Flow Prediction: How predictive models can support rebalancing and capacity planning.
Section 5 – Conclusion: Strategic takeaways and implications for an insurance partnership.

2. Data Analysis: CitiBike (2023–2025)

We analyze CitiBike demand along four dimensions: (i) system-wide usage and maturity, (ii) net flow and imbalance, (iii) usage patterns by user/bike type and time, and (iv) trip duration and distance. These patterns are essential inputs for both the risk analysis (Section 3) and net flow prediction (Section 4).

2.1 Demand & System Maturity

[Figure: Daily usage and average daily demand per station (30-day rolling mean)]

Strong seasonality: high summer, low winter across all years.
Usage grows notably from 2023 → 2024, but almost no per-station growth from 2024 → 2025.
This suggests CitiBike is entering a mature phase in which organic growth slows.

Actionable insights:

Stimulate winter demand: targeted promotions or corporate partnerships to smooth the seasonal cycle.
Acquire risk-averse non-users: insurance-backed offerings can convert hesitant potential riders.

2.2 Net Flow & Imbalance

For station $j$ and day $t$,

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

measuring how many bikes a station gains (positive) or loses (negative).

[Figure: Average absolute net flow per station + top persistent source/sink stations]

Average $|\text{NetFlow}_{j,t}|$ follows stable yearly patterns—larger summer imbalances and smaller winter imbalances.
A small subset of stations shows large persistent imbalances, either consistently gaining or losing bikes.
These stations drive a large share of operational pressure and risk of empty/full stations.

Actionable insights:

Forecasting focus: prediction models should emphasize these structurally extreme stations.
Rebalancing optimization: schedule proactive redistribution for high-imbalance corridors.

2.3 Usage Patterns: Who, When, and How People Ride

[Figure: Bike type and membership shares; weekday/weekend and hourly profiles; duration and distance distributions]

Bike type: e-bikes have become a large and stable share of trips; classic bikes still important.
Membership: members dominate usage; casual riders peak in high-demand seasons.
Temporal patterns: strong commuting peaks; weekday/weekend splits stable over time.
Exposure: most trips are 10–20 minutes and 1–2 km, with long trips being rare.

Actionable insights:

Fleet composition: ensure e-bike availability at peak times and high-demand stations.
Targeted campaigns: weekend or off-peak promotions can shift demand where there is spare capacity.
Risk quantification: stable duration and distance distributions help calibrate exposure in Section 3’s risk measure.

2.4 Summary

Demand growth has slowed, especially per station, suggesting a maturing system.
Net flow imbalances are seasonal and concentrated in a small set of stations.
Usage patterns (who rides, when, and how far) are stable and predictable.
These structured patterns provide the foundation for risk and prediction models.

3. Risk Analysis

We combine CitiBike trips with NYPD crashes to construct a risk per trip measure by station and time, for safety insights and insurance pricing.

3.1 From Crashes to Risk per Trip

Each crash is assigned to the nearest station within 300m using a BallTree with Haversine distance $d((\text{lat}_i,\text{lon}_i),(\text{lat}_j,\text{lon}_j))$.

For station $j$ and time bucket $b$:

$H_{j,b}$: total crash hazard (severity),
$E_{j,b}$: number of CitiBike trips (exposure),
$\epsilon > 0$: small constant.

\[ R_{j,b} = \frac{H_{j,b}}{E_{j,b} + \epsilon} \]

is the raw risk per trip. To stabilize, we use Empirical Bayes smoothing:

\[ R_{j,b}^{\text{EB}} = \lambda_{j,b} R_{j,b} + (1 - \lambda_{j,b})\mu,\quad \lambda_{j,b} = \frac{E_{j,b}}{E_{j,b} + C}, \]

where $\mu$ is a (global or time-specific) mean, $C>0$ a credibility constant.

3.2 Crash Severity

For crash $i$, severity is:

\[ S_i = \bigl(1 + W_I \,\text{injured}_i + W_K \,\text{killed}_i\bigr) \cdot \bigl(\alpha + (1 - \alpha)\,\text{cyclist}_i\bigr), \]

with baseline parameters $W_I=5$, $W_K=20$, $\alpha=0.1$. Non-cyclist crashes contribute, but cyclist and severe crashes are strongly up-weighted, matching our focus.

3.3 Spatial & Temporal Risk Patterns

[Figure: Station map colored by $R_{j}^{\text{EB}}$]

Low-risk clusters (quiet areas, good infrastructure), medium-risk corridors, and high-risk hotspots.
High-risk stations are natural targets for infrastructure upgrades, enforcement, or in-app warnings.

[Figure: Risk by time-of-day and weekday/weekend]

Risk per trip is higher in evenings and late evenings; differences exist between weekdays and weekends.
Temporal patterns are stable ⇒ suitable for pricing and risk communication (“higher risk at night”).

[Figure: Station × time-of-day risk grid $R_{j,t}^{\text{EB}}$]

Some stations are safe overall but risky at specific times (e.g. late evening).
Station × time-of-day risk is key for fine-grained pricing and targeted mitigation.

3.4 Use for CitiBike & Insurers

CitiBike: prioritize safety interventions and warn riders at high-risk stations/times.
Insurers: use $R_{j,b}^{\text{EB}}$ or $R_{j,t}^{\text{EB}}$ as rating factors and for portfolio monitoring.
Jointly: identify hotspots for collaborative, data-driven risk reduction.

4. Net Flow Prediction

Net flow drives rebalancing needs. For station $j$ and day $t$:

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

where negative values indicate stations that tend to empty, positive values stations that tend to fill.

4.1 From Distribution to Prediction Target

[Figure: Distribution of station-day net flows]

Most station-days lie near zero (balanced).
Operational concern is in the tails: large negative/positive imbalances.
Source–sink patterns are persistent and seasonal ⇒ structurally predictable.

We therefore predict imbalance classes rather than exact net flow values.

4.2 Ternary Imbalance Classification

For each station-day $(j,t)$, define:

\[ y_{j,t} = \begin{cases} -1, & \text{if } \text{NetFlow}_{j,t} < -5, \\ 0, & \text{if } |\text{NetFlow}_{j,t}| \le 5, \\ +1, & \text{if } \text{NetFlow}_{j,t} > 5. \end{cases} \]

$-1$: likely under-supply (emptying station),
$0$: balanced,
$+1$: likely over-supply (filling station).

4.3 Features & Model

For each station-day, we use:

Net flow history: lag-1, lag-7, short rolling mean.
Calendar: weekday/weekend, month, year.
Station: latitude, longitude (and later capacity/land-use if available).
Lagged weather: yesterday’s temperature, rain, snow (observable at prediction time).

We treat $y_{j,t} \in \{-1,0,+1\}$ as a supervised classification problem, using tree-based models (e.g. gradient-boosted trees) with class weighting or macro F1 to ensure the under-/over-supply classes matter.

4.4 Operational Value

Proactive rebalancing: flag tomorrow’s problematic stations and plan routes and staffing.
Fewer outages: reduce empty/full stations and associated lost rides.
Cost efficiency: focus redistribution efforts where imbalances are predicted, not just observed.

5. Conclusion & Strategic Takeaways

5.1 Summary

CitiBike usage exhibits strong seasonality and signs of recent growth stagnation, especially in per-station demand.
Trips are short and highly predictable in duration and distance, which is favorable for modeling and risk assessment.
The constructed risk measure provides a granular, interpretable view of crash risk per trip, by station and time-of-day.
Net flow patterns are stable and structured, suggesting that predictive models can meaningfully support rebalancing.

5.2 Value for CitiBike and an Insurance Partner

For CitiBike:
- Identify stations and times with elevated risk and target infrastructure or informational interventions.
- Use net flow predictions to improve service reliability and reduce operational costs.
For an insurer:
- Use $R_{j,b}^{\text{EB}}$ as a risk factor for pricing per-ride or membership insurance for riders.
- Monitor portfolio risk across space and time, and design reinsurance or risk limits where needed.
For both:
- Develop joint, data-driven safety initiatives at high-risk stations and times.
- Communicate transparent, quantitative risk information to riders (e.g., “higher risk at this station at night”).

5.3 Next Steps

Immediate next steps include:

Finalizing and validating the net flow prediction model.
Refining the risk measure (e.g., alternative severity weights, additional covariates).
Prototyping a web app where stakeholders can explore demand and risk interactively by station, time, and scenario.