A production-style ETL pipeline designed to automate the ingestion, transformation, and loading of Airbnb listings data into a PostgreSQL data warehouse.
The workflow is fully containerized with Docker and orchestrated using Apache Airflow for daily scheduling and monitoring.
The Airbnb Data Pipeline automates the following key processes:
- Extract – Ingests Airbnb listings from CSV files or a public API
- Transform – Cleans, enriches, and validates the data using pandas
- Load – Loads the processed data into a PostgreSQL warehouse using SQLAlchemy
This setup demonstrates how to design a modular, scalable, and reproducible ETL process suitable for real-world data engineering environments.
- Python – Data manipulation, validation, and scripting
- Apache Airflow – Workflow orchestration and task scheduling
- PostgreSQL – Centralized data storage and modeling
- SQLAlchemy – ORM for efficient database interaction
- Docker & Docker Compose – Containerization for portability and consistency
Below is a snapshot of the pipeline running in Apache Airflow, showing the three core ETL tasks executed in sequence:
Each DAG run performs:
- Extract: Fetch raw Airbnb data
- Transform: Apply cleaning and enrichment logic
- Load: Store refined data in PostgreSQL
This project highlights hands-on experience with modern Data Engineering practices, including:
- Workflow automation using Airflow DAGs
- Data extraction and cleaning with pandas
- Data modeling and loading into PostgreSQL
- Deployment via Docker Compose for easy reproducibility
It serves as a demonstration of building a scalable, maintainable ETL pipeline from scratch — a valuable foundation for production-grade data workflows.
- Add data quality validation with Great Expectations
- Integrate automated logging and alerting
- Extend to real-time ingestion using Kafka or Spark Streaming
