CitiBike NYC: Demand, Risk, and Net Flow (2023–2025)

Exploring how CitiBike usage and crash risk data can create value for CitiBike and an insurance partner.

1. CitiBike & Project Overview

1.1 CitiBike in New York City

CitiBike is New York City’s largest bike sharing system with thousands of bikes and a dense network of stations across Manhattan, Brooklyn, Queens, parts of the Bronx, and Jersey City. For many trips, CitiBike is a realistic alternative to public transport, ride-hailing, or private car use.

This project uses CitiBike trip data (January 2023–October 2025) jointly with NYPD collision data to:

1.2 Data & Time Frame

Key idea: CitiBike trip data gives us exposure (how often and where bikes are used), while crash data provides information on severity. Combining both allows us to construct a risk per trip measure that is meaningful for users, CitiBike, and a potential insurer.

1.3 Structure of the Presentation

The remainder of this webpage is organized as follows:

2. Data Analysis: CitiBike (2023–2025)

We analyze CitiBike demand along four dimensions: (i) system-wide usage and maturity, (ii) net flow and imbalance, (iii) usage patterns by user/bike type and time, and (iv) trip duration and distance. These patterns are essential inputs for both the risk analysis (Section 3) and net flow prediction (Section 4).

2.1 Demand & System Maturity

[Figure: Daily usage and average daily demand per station (30-day rolling mean)]

Actionable insights:

2.2 Net Flow & Imbalance

For station \(j\) and day \(t\),

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

measuring how many bikes a station gains (positive) or loses (negative).

[Figure: Average absolute net flow per station + top persistent source/sink stations]

Actionable insights:

2.3 Usage Patterns: Who, When, and How People Ride

[Figure: Bike type and membership shares; weekday/weekend and hourly profiles; duration and distance distributions]

Actionable insights:

2.4 Summary

3. Risk Analysis

We combine CitiBike trips with NYPD crashes to construct a risk per trip measure by station and time, for safety insights and insurance pricing.

3.1 From Crashes to Risk per Trip

Each crash is assigned to the nearest station within 300m using a BallTree with Haversine distance \(d((\text{lat}_i,\text{lon}_i),(\text{lat}_j,\text{lon}_j))\).

For station \(j\) and time bucket \(b\):

\[ R_{j,b} = \frac{H_{j,b}}{E_{j,b} + \epsilon} \]

is the raw risk per trip. To stabilize, we use Empirical Bayes smoothing:

\[ R_{j,b}^{\text{EB}} = \lambda_{j,b} R_{j,b} + (1 - \lambda_{j,b})\mu,\quad \lambda_{j,b} = \frac{E_{j,b}}{E_{j,b} + C}, \]

where \(\mu\) is a (global or time-specific) mean, \(C>0\) a credibility constant.

3.2 Crash Severity

For crash \(i\), severity is:

\[ S_i = \bigl(1 + W_I \,\text{injured}_i + W_K \,\text{killed}_i\bigr) \cdot \bigl(\alpha + (1 - \alpha)\,\text{cyclist}_i\bigr), \]

with baseline parameters \(W_I=5\), \(W_K=20\), \(\alpha=0.1\). Non-cyclist crashes contribute, but cyclist and severe crashes are strongly up-weighted, matching our focus.

3.3 Spatial & Temporal Risk Patterns

[Figure: Station map colored by \(R_{j}^{\text{EB}}\)]
[Figure: Risk by time-of-day and weekday/weekend]
[Figure: Station × time-of-day risk grid \(R_{j,t}^{\text{EB}}\)]

3.4 Use for CitiBike & Insurers

4. Net Flow Prediction

Net flow drives rebalancing needs. For station \(j\) and day \(t\):

\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]

where negative values indicate stations that tend to empty, positive values stations that tend to fill.

4.1 From Distribution to Prediction Target

[Figure: Distribution of station-day net flows]

We therefore predict imbalance classes rather than exact net flow values.

4.2 Ternary Imbalance Classification

For each station-day \((j,t)\), define:

\[ y_{j,t} = \begin{cases} -1, & \text{if } \text{NetFlow}_{j,t} < -5, \\ 0, & \text{if } |\text{NetFlow}_{j,t}| \le 5, \\ +1, & \text{if } \text{NetFlow}_{j,t} > 5. \end{cases} \]

4.3 Features & Model

For each station-day, we use:

We treat \(y_{j,t} \in \{-1,0,+1\}\) as a supervised classification problem, using tree-based models (e.g. gradient-boosted trees) with class weighting or macro F1 to ensure the under-/over-supply classes matter.

4.4 Operational Value

5. Conclusion & Strategic Takeaways

5.1 Summary

5.2 Value for CitiBike and an Insurance Partner

5.3 Next Steps

Immediate next steps include: