Skip to content

hasan4adnan/AWS-Data-Analytics-Pipeline

Repository files navigation

📊 AWS Data Analytics Pipeline

Overview

This project demonstrates a fully serverless data analytics pipeline on AWS. It ingests raw sales data stored in Amazon S3, catalogs the data using AWS Glue, queries it with Amazon Athena, and visualizes the insights using Amazon QuickSight.

Key Features

  • ✅ 100% Infrastructure as Code using Terraform
  • ✅ Uses IAM best practices to manage permissions
  • ✅ Visual dashboards for quick business insights
  • ✅ Easy to extend with larger datasets and more complex ETL

Architecture

Pipeline Diagram

Pipeline Components

  1. Amazon S3: Raw sales CSV data storage
  2. AWS Glue: Data crawler that scans S3 and creates metadata in the Glue Data Catalog
  3. Glue Data Catalog: Stores schema information for Athena queries
  4. Amazon Athena: Executes SQL queries on cataloged data
  5. Amazon QuickSight: Connects to Athena and builds interactive dashboards
  6. IAM Roles: Securely manages service permissions with least privilege access

Features

  • Infrastructure as Code (IaC): Easily reproducible and version-controlled with Terraform
  • Scalable: Handles datasets from hundreds of rows to millions of records
  • Serverless: No servers to manage, maintain, or patch
  • Secure: Implements IAM roles with least privilege access principles
  • Real-time Visualization: Interactive business dashboards with Amazon QuickSight
  • Cost-Effective: Pay only for what you use with serverless architecture

Deployment Guide

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Terraform installed (version 1.0+)
  • Git for repository management

Step-by-Step Deployment

1. Clone the Repository

git clone https://github.com/your-username/terraform-data-pipeline.git
cd terraform-data-pipeline

2. Customize Variables

Edit variables.tf to modify:

  • Bucket names
  • AWS region
  • Project tags
  • Resource naming conventions

3. Initialize Terraform

terraform init

4. Review and Apply Infrastructure

terraform plan
terraform apply

5. Upload Sample Data

Upload sales_data.csv (or your own dataset) to the created S3 bucket:

aws s3 cp sample-data/sales_data.csv s3://your-bucket-name/

6. Run the Glue Crawler

  • Navigate to AWS Glue Console
  • Start the created crawler to populate the Data Catalog
  • Verify table creation in the Data Catalog

7. Query with Athena

  • Open Amazon Athena Console
  • Run SQL queries on your cataloged data
  • Example query:
SELECT product_category, SUM(sales_amount) as total_sales
FROM your_table_name
GROUP BY product_category
ORDER BY total_sales DESC;

8. Create QuickSight Dashboard

  • Connect QuickSight to your Athena data source
  • Create visualizations using the drag-and-drop interface
  • Publish dashboards for business stakeholders

Screenshots

AWS Glue Crawler

Glue Crawler

Amazon Athena Query Editor

Athena Query

Amazon QuickSight Dashboard

QuickSight Dashboard

Amazon QuickSight Dashboard

QuickSight Dashboard

Amazon QuickSight Dashboard

QuickSight Dashboard

Amazon QuickSight Dashboard

QuickSight Dashboard


Security Best Practices

This project implements several security best practices:

  • No Hardcoded Credentials: All access is managed through IAM roles
  • State File Security: terraform.tfstate is excluded via .gitignore
  • Least Privilege Access: IAM policies grant minimal required permissions
  • Resource Isolation: Dedicated IAM roles for each service component
  • Encryption: S3 buckets and data transfers are encrypted

Customization and Extension

Adding New Data Sources

  1. Update S3 bucket structure in variables.tf
  2. Modify Glue crawler configuration for new data formats
  3. Adjust Athena queries for additional tables

Scaling for Production

  • Enable CloudTrail for audit logging
  • Implement data partitioning strategies
  • Add automated data quality checks
  • Set up monitoring and alerting with CloudWatch

ETL Enhancement

  • Add Glue ETL jobs for data transformation
  • Implement data validation and cleansing
  • Schedule automated data processing workflows

Cost Optimization

  • S3: Use Intelligent Tiering for automatic cost optimization
  • Athena: Optimize queries and use columnar formats (Parquet)
  • Glue: Schedule crawlers efficiently to avoid unnecessary runs
  • QuickSight: Choose appropriate licensing model based on user count

Troubleshooting

Common Issues

Glue Crawler Fails

  • Check IAM permissions for S3 access
  • Verify S3 bucket and path configuration
  • Ensure data format is supported

Athena Query Errors

  • Confirm Data Catalog table exists
  • Check query syntax and table names
  • Verify result location S3 bucket permissions

QuickSight Connection Issues

  • Ensure QuickSight has permissions to access Athena
  • Check VPC configuration if using private subnets
  • Verify data source configuration

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License. See LICENSE.txt for details.


Author

Built by Hasan Adnan 🚀


Acknowledgments

  • AWS Documentation and Best Practices
  • Terraform AWS Provider Documentation
  • Community feedback and contributions

Last updated: July 2025

About

This project demonstrates a fully serverless data analytics pipeline on AWS. It ingests raw sales data stored in Amazon S3, catalogs the data using AWS Glue, queries it with Amazon Athena, and visualizes the insights using Amazon QuickSight.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages