Skip to content

heyitsmanal/airbnb-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏠 Airbnb Data Pipeline (Python • Airflow • Docker • PostgreSQL)

Python Airflow PostgreSQL Docker Build Status License

A production-style ETL pipeline designed to automate the ingestion, transformation, and loading of Airbnb listings data into a PostgreSQL data warehouse.
The workflow is fully containerized with Docker and orchestrated using Apache Airflow for daily scheduling and monitoring.


⚙️ Project Overview

The Airbnb Data Pipeline automates the following key processes:

  1. Extract – Ingests Airbnb listings from CSV files or a public API
  2. Transform – Cleans, enriches, and validates the data using pandas
  3. Load – Loads the processed data into a PostgreSQL warehouse using SQLAlchemy

This setup demonstrates how to design a modular, scalable, and reproducible ETL process suitable for real-world data engineering environments.


🚀 Technologies Used

  • Python – Data manipulation, validation, and scripting
  • Apache Airflow – Workflow orchestration and task scheduling
  • PostgreSQL – Centralized data storage and modeling
  • SQLAlchemy – ORM for efficient database interaction
  • Docker & Docker Compose – Containerization for portability and consistency

📊 Airflow DAG in Action

Below is a snapshot of the pipeline running in Apache Airflow, showing the three core ETL tasks executed in sequence:

Airbnb DAG Screenshot

Each DAG run performs:

  • Extract: Fetch raw Airbnb data
  • Transform: Apply cleaning and enrichment logic
  • Load: Store refined data in PostgreSQL

🎯 Purpose & Key Takeaways

This project highlights hands-on experience with modern Data Engineering practices, including:

  • Workflow automation using Airflow DAGs
  • Data extraction and cleaning with pandas
  • Data modeling and loading into PostgreSQL
  • Deployment via Docker Compose for easy reproducibility

It serves as a demonstration of building a scalable, maintainable ETL pipeline from scratch — a valuable foundation for production-grade data workflows.


🧠 Future Improvements

  • Add data quality validation with Great Expectations
  • Integrate automated logging and alerting
  • Extend to real-time ingestion using Kafka or Spark Streaming

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages