End-to-end real-time data pipeline for bike-share availability metrics. Ingests Citi Bike GBFS feeds, captures changes via CDC, computes streaming aggregates, and serves analytical data through a low-latency API.
| Component | Role |
|---|---|
ingest/ |
Polls Citi Bike GBFS feed into Postgres every 60s |
connectors/ |
Debezium CDC connector config |
flink/ |
Flink SQL: tumbling 1-min availability windows |
analytics/consumer/ |
Kafka → DuckDB landing |
analytics/dbt/ |
dbt mart: mart_station_availability_hourly |
api/ |
FastAPI read-only API over DuckDB |
docker compose up -dThis starts Postgres, Kafka, Debezium, Flink, the GBFS ingestor, and the DuckDB consumer.
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @connectors/debezium-postgres.jsondocker compose exec jobmanager ./bin/sql-client.sh -f /opt/flink/sql/station_availability_1min.sqlcd analytics/dbt
dbt runcd api
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload --port 8000$ curl localhost:8000/metrics/stations/hourly?limit=2[
{
"station_id": "66dd5a42-0aca-11e7-82f6-3863bb44ef7c",
"hour_ts": "2026-04-03T10:00:00",
"avg_bikes_available": 18.0,
"avg_docks_available": 2.0,
"avg_capacity": 22.0,
"avg_availability_ratio": 0.818,
"min_availability_ratio": 0.818,
"total_low_availability_events": 0,
"total_events": 8
}
]$ curl localhost:8000/system/freshness{
"latest_emitted_at": "2026-04-03T10:40:00",
"latest_mart_hour_ts": "2026-04-03T10:00:00",
"checked_at": "2026-04-03T10:40:55.494591+00:00",
"raw_lag_seconds": 55
}- Streaming correctness — Flink watermarks and tumbling windows produce deterministic 1-minute aggregates from CDC events
- Batch/serving parity — the same DuckDB mart backs both dbt analysis and API serving
- Production-style data contracts — each layer (raw → staging → mart → API) has a clear schema boundary
