β¨ A curated collection of 70+ diverse datasets for data science, machine learning, and analytics projects
β¬οΈ Get Started β’ π View All Datasets β’ π― Find Your Dataset
| π¦ Total Files | π Categories | β Beginner Friendly | π₯ Updated |
|---|---|---|---|
| 70+ | 11 | 15+ | Regularly |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β π¬ Data Science β π€ Machine Learning β
β π Analytics β π Learning & Teaching β
β πΌ Business Projects β π Competitions & Kaggle β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This repository contains a comprehensive collection of 70+ datasets spanning various domains including healthcare, entertainment, transportation, demographics, finance, and more. Each dataset is carefully organized and ready for analysis!
β¨ 70+ Curated Datasets | π― Well-Organized | π Documented | π Ready to Use | π Quality Verified
| π₯ Healthcare | π¬ Entertainment | π Transport | π Real Estate | π Demographics |
|---|---|---|---|---|
| 8+ Datasets | 8+ Datasets | 3+ Datasets | 2+ Datasets | 2+ Datasets |
| π Explore | π Explore | π Explore | π Explore | π Explore |
| π° Finance | π Education | π¬ Science | π Forecasting | πΎ Environment |
|---|---|---|---|---|
| 5+ Datasets | 3+ Datasets | 5+ Datasets | 4+ Datasets | 3+ Datasets |
| π Explore | π Explore | π Explore | π Explore | π Explore |
Each category includes dataset descriptions, file names, use cases, and difficulty levels!
π₯ Healthcare & Medical (8 datasets) β POPULAR
Medical data for health analytics and prediction models
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π Diabetes Prediction | diabetes.csv, diabetes1.csv |
Classification for diabetes risk | Classification | π’ Beginner |
| π₯ Health Camp Data | Health_Care_Dataset/ |
Multi-camp attendance analysis | Analytics | π‘ Intermediate |
| β€οΈ Heart Disease | gfg_heart.csv, heart_disease_uci.csv |
Cardiology prediction | Classification | π’ Beginner |
| π° Medical Costs | medical_cost_gfg.csv |
Healthcare expense analysis | Regression | π’ Beginner |
| π Clothing Reviews | RNN_Clothing-Review.csv |
NLP sentiment analysis | NLP | π΄ Advanced |
π‘ Quick Start Code:
import pandas as pd
df = pd.read_csv('diabetes.csv')
print(df.shape) # View dimensions
df.describe() # Get statistics
df.isnull().sum() # Check for missing valuesπ¬ Entertainment & Media (8 datasets) β POPULAR
Streaming platforms, movies, and content analysis data
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π₯ Netflix | Netflix_titles.csv, Netflix_credits.csv |
Content analysis & trends | Analysis | π’ Beginner |
| πΊ HBO Content | HBO_titles.csv, HBO_credits.csv |
Streaming platform comparison | Comparison | π’ Beginner |
| π¬ IMDB Dataset | IMDB-Dataset.csv |
Movie database analysis | Analysis | π’ Beginner |
| π΅ Box Office | gfg_boxoffice.csv |
Revenue & performance metrics | Analysis | π’ Beginner |
| π₯ Trending Data | Trending/trending.csv |
Social media trends | TimeSeries | π‘ Intermediate |
π‘ Quick Start Code:
netflix = pd.read_csv('Netflix_titles.csv')
netflix['type'].value_counts() # Content distribution
netflix.groupby('country').size() # Country analysisπ Transportation & Mobility (3 datasets)
Vehicle data, traffic, and transportation analytics
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π Cars Dataset | Project_2_Cars_Dataset.csv |
Vehicle specs & pricing | Regression | π’ Beginner |
| π¨ Police Data | Project_3_Police Data.csv |
Traffic & incidents | Analysis | π‘ Intermediate |
| βοΈ Vehicle Failure | vehicle_failure.csv |
Maintenance prediction | Classification | π‘ Intermediate |
π Real Estate (2 datasets)
Housing market and property data
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π‘ Housing Data | Project_5_Housing_Data.csv, House_Price_India.csv |
Price prediction & analysis | Regression | π’ Beginner |
π Demographics & Census (2 datasets)
Population and demographic statistics
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π Census 2011 | Project_6_Census_2011.csv |
Population statistics | Analysis | π‘ Intermediate |
| π₯ Demographics | demographics.csv, dermographic data.csv |
Demographic analysis | Analysis | π’ Beginner |
π° Finance & Business (5 datasets)
Financial and business-related datasets
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π Loan Datasets | gfg_LoanDataset---LoansDatasest.csv, loan_approval_dataset.csv |
Loan approval prediction | Classification | π‘ Intermediate |
| π Churn Modeling | Churn_Modelling_gfg.csv |
Customer retention analysis | Classification | π‘ Intermediate |
| π Employee Attrition | MFG10YearTerminationData(EMPLOYEE-ATTRITION).csv |
Workforce analytics | Classification | π‘ Intermediate |
| π Stock Data | stock_data.csv |
Market analysis | TimeSeries | π‘ Intermediate |
π Education & Learning (3 datasets)
Educational resources and student data
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π― Udemy Courses | Project_7_Udemy_Dataset.csv, Udmey Data/ |
Course analysis & pricing | Analysis | π’ Beginner |
| π Student Performance | student-pass-fail-data.csv |
Academic prediction | Classification | π’ Beginner |
| ποΈ Mall Customers | gfg_Mall_Customers-.csv |
Customer segmentation | Clustering | π‘ Intermediate |
π¬ Science & Classic ML Datasets (5 datasets) β FOR BEGINNERS
Classic datasets perfect for learning and tutorials
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| πΈ Iris | IRIS.csv |
Classic classification | Classification | π’ Beginner |
| β Titanic | Titanic_dataset.csv, GFG_titanic.csv, Titanic_Dataset_SmartED.csv |
Survival prediction | Classification | π’ Beginner |
| π· Wine Quality | redwinequality.csv, whitewinequality.csv |
Quality classification | Regression | π’ Beginner |
π Forecasting & TimeSeries (4 datasets)
Time series and forecasting datasets
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π€οΈ Weather Data | Project_1_Weather_Dataset.csv, daily-min-temperatures.csv |
Temperature forecasting | TimeSeries | π‘ Intermediate |
| πΉ Sales Forecasting | sales_forecasting_dataset_SmartEd_Project.csv, stores_sales_forecasting_SmartED.csv |
Revenue prediction | TimeSeries | π‘ Intermediate |
| π IPL Data | ipl_data.csv |
Sports analytics | Analysis | π’ Beginner |
| π§οΈ Rainfall | Rainfall_dataset.csv |
Climate patterns | TimeSeries | π‘ Intermediate |
πΎ Environment & Special Topics (3 datasets)
Environmental and specialized datasets
| Dataset | File | Purpose | Type | Level |
|---|---|---|---|---|
| π¦ COVID-19 Data | Project_4_Covid_19_data.csv |
Pandemic analysis | TimeSeries | π‘ Intermediate |
| π¬ Amazon Prime | Amazone_titles.csv, Amazone_credits.csv |
Content analysis | Analysis | π’ Beginner |
| π Zomato Data | Zomato-data-.csv |
Restaurant trends | Analysis | π’ Beginner |
# Install required packages
pip install pandas numpy matplotlib seaborn scikit-learn jupyterimport pandas as pd
import numpy as np
# Load your chosen dataset
df = pd.read_csv('diabetes.csv')
# Quick exploration
print(df.info()) # Data types & missing values
print(df.describe()) # Statistical summary
print(df.head(10)) # First 10 rows
print(df.shape) # Dimensions (rows, columns)
# Visual inspection
import matplotlib.pyplot as plt
df.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()π 1. Exploratory Data Analysis (EDA)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
df = pd.read_csv('diabetes.csv')
# Basic exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Missing values:\n{df.isnull().sum()}")
# Statistical summary
print(df.describe())
# Distribution analysis
plt.figure(figsize=(12, 4))
df['diabetes'].value_counts().plot(kind='bar')
plt.title('Diabetes Distribution')
plt.show()
# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()π€ 2. Classification Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd
# Load data
df = pd.read_csv('diabetes.csv')
# Prepare features and target
X = df.drop(['diabetes'], axis=1)
y = df['diabetes']
# Handle categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X_encoded, y, test_size=0.2, random_state=42
)
# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Feature importance
feature_importance = pd.DataFrame({
'feature': X_encoded.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop Features:")
print(feature_importance.head(10))π 3. Regression Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np
# Load housing data
df = pd.read_csv('Project_5_Housing_Data.csv')
# Prepare data (adjust column names as needed)
X = df.drop('price', axis=1) # Features
y = df['price'] # Target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
print(f'RΒ² Score: {r2_score(y_test, y_pred):.4f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.4f}')
# Visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Regression: Actual vs Predicted')
plt.show()π€ 4. Text Classification & NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import pandas as pd
# Load spam dataset
df = pd.read_csv('spam_ham_dataset.csv')
# Create pipeline
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
('classifier', MultinomialNB())
])
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'], test_size=0.2, random_state=42
)
# Train
pipeline.fit(X_train, y_train)
# Evaluate
from sklearn.metrics import accuracy_score, classification_report
y_pred = pipeline.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')
print(classification_report(y_test, y_pred))
# Test on new data
test_messages = [
'Congratulations! You won a prize!',
'Hello, can we schedule a meeting tomorrow?'
]
predictions = pipeline.predict(test_messages)
for msg, pred in zip(test_messages, predictions):
print(f"'{msg}' -> {pred}")π― 5. Clustering & Segmentation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
# Load mall customers data
df = pd.read_csv('gfg_Mall_Customers-.csv')
# Select features for clustering
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].values
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine optimal k (elbow method)
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
# Apply clustering with optimal k (e.g., k=5)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)
# Visualization
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=100, alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
c='red', marker='X', s=200, edgecolors='black', linewidths=2)
plt.xlabel('Age')
plt.ylabel('Annual Income')
plt.title('Customer Segmentation (K-Means)')
plt.colorbar(scatter, label='Cluster')
plt.show()
# Cluster analysis
df['Cluster'] = clusters
print(df.groupby('Cluster')[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].mean())| Category | Count | Beginner Friendly | Best For |
|---|---|---|---|
| π₯ Healthcare | 8 | β YES | Classification, Health Analytics |
| π¬ Entertainment | 8 | β YES | EDA, Visualization, Trends |
| π Transportation | 3 | β YES | Regression, Analysis |
| π Real Estate | 2 | β YES | Regression, Price Prediction |
| π Demographics | 2 | β YES | Analysis, Population Studies |
| π° Finance | 5 | π‘ SOME | Classification, Forecasting |
| π Education | 3 | β YES | Analysis, Clustering |
| π¬ Science & ML | 5 | β YES | Learning, Tutorials |
| π Forecasting | 4 | π‘ SOME | TimeSeries, ARIMA, LSTM |
| πΎ Environment | 3 | π‘ SOME | Analysis, Trends |
| π Food & Dining | 1 | β YES | Analysis, Visualization |
π¦ Complete Setup Guide
# Data manipulation & analysis
pip install pandas numpy
# Visualization
pip install matplotlib seaborn plotly
# Machine Learning
pip install scikit-learn xgboost lightgbm catboost
# Deep Learning (Optional)
pip install tensorflow pytorch
# Statistical Analysis
pip install scipy statsmodels
# Jupyter Notebooks
pip install jupyter jupyterlab ipywidgets
# Data Quality
pip install pandas-profiling missingnopip install pandas numpy matplotlib seaborn scikit-learn jupyter plotly scipy statsmodels xgboostimport pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Scikit-Learn: {sklearn.__version__}")
print(f"β
All libraries installed successfully!")π Datasets/
βββ π README.md (This file)
βββ π LICENSE
β
βββ π©Ί HEALTHCARE DATASETS
β βββ diabetes.csv
β βββ diabetes1.csv
β βββ gfg_heart.csv
β βββ heart_disease_uci.csv
β βββ medical_cost_gfg.csv
β βββ RNN_Clothing-Review.csv
β βββ Health_Care_Dataset/
β βββ Patient_Profile.csv
β βββ Health_Camp_Detail.csv
β βββ First_Health_Camp_Attended.csv
β βββ Second_Health_Camp_Attended.csv
β βββ Third_Health_Camp_Attended.csv
β βββ Train.csv
β βββ test.csv
β βββ Cleaned_Data/
β
βββ π¬ ENTERTAINMENT DATASETS
β βββ Netflix_titles.csv
β βββ Netflix_credits.csv
β βββ HBO_titles.csv
β βββ HBO_credits.csv
β βββ Amazone_titles.csv
β βββ Amazone_credits.csv
β βββ IMDB-Dataset.csv
β βββ gfg_boxoffice.csv
β βββ Trending/
β βββ trending.csv
β βββ Cleaned data/
β
βββ π TRANSPORTATION & π FORECASTING
β βββ Project_1_Weather_Dataset.csv
β βββ Project_2_Cars_Dataset.csv
β βββ Project_3_Police Data.csv
β βββ daily-min-temperatures.csv
β βββ stock_data.csv
β βββ vehicle_failure.csv
β βββ ipl_data.csv
β
βββ π REAL ESTATE & πΌ BUSINESS
β βββ House_Price_India.csv
β βββ Project_5_Housing_Data.csv
β βββ gfg_LoanDataset---LoansDatasest.csv
β βββ loan_approval_dataset.csv
β βββ Churn_Modelling_gfg.csv
β βββ MFG10YearTerminationData(EMPLOYEE-ATTRITION).csv
β
βββ π EDUCATION & π DEMOGRAPHICS
β βββ Project_6_Census_2011.csv
β βββ Project_7_Udemy_Dataset.csv
β βββ demographics.csv
β βββ dermographic data.csv
β βββ student-pass-fail-data.csv
β βββ gfg_Mall_Customers-.csv
β βββ Udmey Data/
β
βββ π¬ CLASSIC ML & SCIENCE
β βββ IRIS.csv
β βββ Titanic_dataset.csv
β βββ GFG_titanic.csv
β βββ Titanic_Dataset_SmartED.csv
β βββ redwinequality.csv
β βββ whitewinequality.csv
β
βββ π TEXT & SPECIAL
β βββ spam_ham_dataset.csv
β βββ Project_Text_Classification_synthetic_text_data.csv
β βββ Zomato-data-.csv
β βββ Project_4_Covid_19_data.csv
β βββ Rainfall_dataset.csv
β βββ sales_forecasting_dataset_SmartEd_Project.csv
β
βββ π§Ή CLEANED & TEST DATA
βββ testdata.csv
βββ CleaneD_testdata_File.csv
βββ Naivs_diabetes.csv
βββ customer_purchase_behavior.csv
βββ Position_Salaries.csv
βββ stores_sales_forecasting_SmartED.csv
π’ Beginner Learning Path (Start Here!)
- Load & Explore: Start with
IRIS.csvorTitanic_dataset.csv - Practice: Use the EDA code examples above
- Visualize: Create plots with matplotlib & seaborn
- Classification: Try
diabetes.csvwith logistic regression - Regression: Use
House_Price_India.csvfor price prediction - Understand: Learn about train-test splits and model evaluation
- Ensemble Methods: Apply Random Forests to any dataset
- Clustering: Segment customers with
gfg_Mall_Customers-.csv - NLP Basics: Text classification with
spam_ham_dataset.csv
π‘ Intermediate Learning Path
- Feature Engineering: Work with
Titanic_dataset.csv - Time Series: Learn with
daily-min-temperatures.csv - Feature Selection: Apply to
Churn_Modelling_gfg.csv - Model Tuning: Hyperparameter optimization on any dataset
- Cross-Validation: Implement k-fold on classification problems
π΄ Advanced Learning Path
- Deep Learning: NLP with
RNN_Clothing-Review.csv - LSTM Models: Time series forecasting
- Ensemble Stacking: Combine multiple models
- Advanced NLP: Sentiment analysis & text generation
- Big Data Techniques: Handle large datasets efficiently
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch:
git checkout -b feature/add-dataset - Add your dataset with documentation
- Commit changes:
git commit -m "Add new dataset: [name]" - Push to branch:
git push origin feature/add-dataset - Submit a Pull Request
When adding a dataset, please include:
- β Clear description of dataset
- β Data dictionary/schema
- β Usage examples
- β Source attribution
- β Data quality assessment
- β Size and format information
- Pandas Documentation
- Scikit-Learn Guide
- Matplotlib Tutorials
- Kaggle Competitions
- Google Colab - Free cloud notebooks
This dataset collection is available under Open Source License.
- Use for educational and research purposes
- Use for commercial projects (with attribution)
- Modify and redistribute datasets
- Create derivative works
- Claim original ownership
- Remove attribution from original sources
Dataset Collection by itsluckysharma01
GitHub: https://github.com/itsluckysharma01/Datasets
- β Star this repository
- π₯ Fork or clone the repo
- π Read this README
- π» Install required packages
- π Pick a beginner dataset
- π Run the code examples
- π― Start your project!
- Issues: Open an issue on GitHub
- Discussions: Use GitHub Discussions for questions
- Email: itsluckysharma01@email.com
Made with β€οΈ for the Data Science Community