1. CitiBike & Project Overview
1.1 CitiBike in New York City
CitiBike is New York City’s largest bike sharing system with thousands of bikes and a dense network of stations across Manhattan, Brooklyn, Queens, parts of the Bronx, and Jersey City. For many trips, CitiBike is a realistic alternative to public transport, ride-hailing, or private car use.
This project uses CitiBike trip data (January 2023–October 2025) jointly with NYPD collision data to:
- understand how and where CitiBike is used,
- quantify safety risk around stations using a transparent risk measure, and
- explore how these insights can support insurance products and operational decisions.
1.2 Data & Time Frame
- Usage data: CitiBike trip records (start and end station, timestamps, trip duration, bike type, user type).
- Crash data: NYPD collision records with detailed information on injuries, fatalities, and vehicles involved.
- Geospatial data: Station locations and, where necessary, cleaned/merged station identifiers.
1.3 Structure of the Presentation
The remainder of this webpage is organized as follows:
- Section 2 – Data Analysis: How demand evolves over time, across space, and by user type.
- Section 3 – Risk Analysis: How we construct and interpret a station-level and time-of-day risk measure.
- Section 4 – Net Flow Prediction: How predictive models can support rebalancing and capacity planning.
- Section 5 – Conclusion: Strategic takeaways and implications for an insurance partnership.
2. Data Analysis: CitiBike (2023–2025)
We analyze CitiBike demand along four dimensions: (i) system-wide usage and maturity, (ii) net flow and imbalance, (iii) usage patterns by user/bike type and time, and (iv) trip duration and distance. These patterns are essential inputs for both the risk analysis (Section 3) and net flow prediction (Section 4).
2.1 Demand & System Maturity
- Strong seasonality: high summer, low winter across all years.
- Usage grows notably from 2023 → 2024, but almost no per-station growth from 2024 → 2025.
- This suggests CitiBike is entering a mature phase in which organic growth slows.
Actionable insights:
- Stimulate winter demand: targeted promotions or corporate partnerships to smooth the seasonal cycle.
- Acquire risk-averse non-users: insurance-backed offerings can convert hesitant potential riders.
2.2 Net Flow & Imbalance
For station \(j\) and day \(t\),
\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]
measuring how many bikes a station gains (positive) or loses (negative).
- Average \(|\text{NetFlow}_{j,t}|\) follows stable yearly patterns—larger summer imbalances and smaller winter imbalances.
- A small subset of stations shows large persistent imbalances, either consistently gaining or losing bikes.
- These stations drive a large share of operational pressure and risk of empty/full stations.
Actionable insights:
- Forecasting focus: prediction models should emphasize these structurally extreme stations.
- Rebalancing optimization: schedule proactive redistribution for high-imbalance corridors.
2.3 Usage Patterns: Who, When, and How People Ride
- Bike type: e-bikes have become a large and stable share of trips; classic bikes still important.
- Membership: members dominate usage; casual riders peak in high-demand seasons.
- Temporal patterns: strong commuting peaks; weekday/weekend splits stable over time.
- Exposure: most trips are 10–20 minutes and 1–2 km, with long trips being rare.
Actionable insights:
- Fleet composition: ensure e-bike availability at peak times and high-demand stations.
- Targeted campaigns: weekend or off-peak promotions can shift demand where there is spare capacity.
- Risk quantification: stable duration and distance distributions help calibrate exposure in Section 3’s risk measure.
2.4 Summary
- Demand growth has slowed, especially per station, suggesting a maturing system.
- Net flow imbalances are seasonal and concentrated in a small set of stations.
- Usage patterns (who rides, when, and how far) are stable and predictable.
- These structured patterns provide the foundation for risk and prediction models.
3. Risk Analysis
We combine CitiBike trips with NYPD crashes to construct a risk per trip measure by station and time, for safety insights and insurance pricing.
3.1 From Crashes to Risk per Trip
Each crash is assigned to the nearest station within 300m using a BallTree with Haversine distance \(d((\text{lat}_i,\text{lon}_i),(\text{lat}_j,\text{lon}_j))\).
For station \(j\) and time bucket \(b\):
- \(H_{j,b}\): total crash hazard (severity),
- \(E_{j,b}\): number of CitiBike trips (exposure),
- \(\epsilon > 0\): small constant.
\[ R_{j,b} = \frac{H_{j,b}}{E_{j,b} + \epsilon} \]
is the raw risk per trip. To stabilize, we use Empirical Bayes smoothing:
\[ R_{j,b}^{\text{EB}} = \lambda_{j,b} R_{j,b} + (1 - \lambda_{j,b})\mu,\quad \lambda_{j,b} = \frac{E_{j,b}}{E_{j,b} + C}, \]
where \(\mu\) is a (global or time-specific) mean, \(C>0\) a credibility constant.
3.2 Crash Severity
For crash \(i\), severity is:
\[ S_i = \bigl(1 + W_I \,\text{injured}_i + W_K \,\text{killed}_i\bigr) \cdot \bigl(\alpha + (1 - \alpha)\,\text{cyclist}_i\bigr), \]
with baseline parameters \(W_I=5\), \(W_K=20\), \(\alpha=0.1\). Non-cyclist crashes contribute, but cyclist and severe crashes are strongly up-weighted, matching our focus.
3.3 Spatial & Temporal Risk Patterns
- Low-risk clusters (quiet areas, good infrastructure), medium-risk corridors, and high-risk hotspots.
- High-risk stations are natural targets for infrastructure upgrades, enforcement, or in-app warnings.
- Risk per trip is higher in evenings and late evenings; differences exist between weekdays and weekends.
- Temporal patterns are stable ⇒ suitable for pricing and risk communication (“higher risk at night”).
- Some stations are safe overall but risky at specific times (e.g. late evening).
- Station × time-of-day risk is key for fine-grained pricing and targeted mitigation.
3.4 Use for CitiBike & Insurers
- CitiBike: prioritize safety interventions and warn riders at high-risk stations/times.
- Insurers: use \(R_{j,b}^{\text{EB}}\) or \(R_{j,t}^{\text{EB}}\) as rating factors and for portfolio monitoring.
- Jointly: identify hotspots for collaborative, data-driven risk reduction.
4. Net Flow Prediction
Net flow drives rebalancing needs. For station \(j\) and day \(t\):
\[ \text{NetFlow}_{j,t} = A_{j,t} - D_{j,t}, \]
where negative values indicate stations that tend to empty, positive values stations that tend to fill.
4.1 From Distribution to Prediction Target
- Most station-days lie near zero (balanced).
- Operational concern is in the tails: large negative/positive imbalances.
- Source–sink patterns are persistent and seasonal ⇒ structurally predictable.
We therefore predict imbalance classes rather than exact net flow values.
4.2 Ternary Imbalance Classification
For each station-day \((j,t)\), define:
\[ y_{j,t} = \begin{cases} -1, & \text{if } \text{NetFlow}_{j,t} < -5, \\ 0, & \text{if } |\text{NetFlow}_{j,t}| \le 5, \\ +1, & \text{if } \text{NetFlow}_{j,t} > 5. \end{cases} \]
- \(-1\): likely under-supply (emptying station),
- \(0\): balanced,
- \(+1\): likely over-supply (filling station).
4.3 Features & Model
For each station-day, we use:
- Net flow history: lag-1, lag-7, short rolling mean.
- Calendar: weekday/weekend, month, year.
- Station: latitude, longitude (and later capacity/land-use if available).
- Lagged weather: yesterday’s temperature, rain, snow (observable at prediction time).
We treat \(y_{j,t} \in \{-1,0,+1\}\) as a supervised classification problem, using tree-based models (e.g. gradient-boosted trees) with class weighting or macro F1 to ensure the under-/over-supply classes matter.
4.4 Operational Value
- Proactive rebalancing: flag tomorrow’s problematic stations and plan routes and staffing.
- Fewer outages: reduce empty/full stations and associated lost rides.
- Cost efficiency: focus redistribution efforts where imbalances are predicted, not just observed.
5. Conclusion & Strategic Takeaways
5.1 Summary
- CitiBike usage exhibits strong seasonality and signs of recent growth stagnation, especially in per-station demand.
- Trips are short and highly predictable in duration and distance, which is favorable for modeling and risk assessment.
- The constructed risk measure provides a granular, interpretable view of crash risk per trip, by station and time-of-day.
- Net flow patterns are stable and structured, suggesting that predictive models can meaningfully support rebalancing.
5.2 Value for CitiBike and an Insurance Partner
- For CitiBike:
- Identify stations and times with elevated risk and target infrastructure or informational interventions.
- Use net flow predictions to improve service reliability and reduce operational costs.
- For an insurer:
- Use $R_{j,b}^{\text{EB}}$ as a risk factor for pricing per-ride or membership insurance for riders.
- Monitor portfolio risk across space and time, and design reinsurance or risk limits where needed.
- For both:
- Develop joint, data-driven safety initiatives at high-risk stations and times.
- Communicate transparent, quantitative risk information to riders (e.g., “higher risk at this station at night”).
5.3 Next Steps
Immediate next steps include:
- Finalizing and validating the net flow prediction model.
- Refining the risk measure (e.g., alternative severity weights, additional covariates).
- Prototyping a web app where stakeholders can explore demand and risk interactively by station, time, and scenario.