Skip to content

itsluckysharma01/Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Comprehensive Dataset Collection

πŸš€ Welcome to the Ultimate Datasets Repository! πŸš€

Quick Nav Examples Contribute


Datasets Badge License Badge Python Version Pandas Support Last Updated Maintained


✨ A curated collection of 70+ diverse datasets for data science, machine learning, and analytics projects

⬇️ Get Started β€’ πŸ“– View All Datasets β€’ 🎯 Find Your Dataset

🎨 Repository Statistics

πŸ“¦ Total Files πŸ† Categories ⭐ Beginner Friendly πŸ”₯ Updated
70+ 11 15+ Regularly

🎯 Overview

🌟 Perfect For:

╔═══════════════════════════════════════════════════════╗
β•‘                                                       β•‘
β•‘  πŸ”¬ Data Science       β”‚  πŸ€– Machine Learning      β•‘
β•‘  πŸ“Š Analytics          β”‚  πŸŽ“ Learning & Teaching   β•‘
β•‘  πŸ’Ό Business Projects  β”‚  πŸ† Competitions & Kaggle β•‘
β•‘                                                       β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

This repository contains a comprehensive collection of 70+ datasets spanning various domains including healthcare, entertainment, transportation, demographics, finance, and more. Each dataset is carefully organized and ready for analysis!

βœ… What You Get:

✨ 70+ Curated Datasets | 🎯 Well-Organized | πŸ“– Documented | πŸš€ Ready to Use | πŸ† Quality Verified


πŸš€ Quick Navigation

πŸ‘‡ Click Any Category Below to Explore:

πŸ₯ Healthcare 🎬 Entertainment πŸš— Transport 🏠 Real Estate 🌍 Demographics
8+ Datasets 8+ Datasets 3+ Datasets 2+ Datasets 2+ Datasets
πŸ“– Explore πŸ“– Explore πŸ“– Explore πŸ“– Explore πŸ“– Explore
πŸ’° Finance πŸŽ“ Education πŸ”¬ Science πŸ“Š Forecasting 🌾 Environment
5+ Datasets 3+ Datasets 5+ Datasets 4+ Datasets 3+ Datasets
πŸ“– Explore πŸ“– Explore πŸ“– Explore πŸ“– Explore πŸ“– Explore

🎯 Interactive Category Browser

πŸ“š Browse & Explore All Categories (Click Headers to Expand)

Each category includes dataset descriptions, file names, use cases, and difficulty levels!

πŸ₯ Healthcare & Medical (8 datasets) ⭐ POPULAR

Medical data for health analytics and prediction models

Dataset File Purpose Type Level
πŸ’Š Diabetes Prediction diabetes.csv, diabetes1.csv Classification for diabetes risk Classification 🟒 Beginner
πŸ₯ Health Camp Data Health_Care_Dataset/ Multi-camp attendance analysis Analytics 🟑 Intermediate
❀️ Heart Disease gfg_heart.csv, heart_disease_uci.csv Cardiology prediction Classification 🟒 Beginner
πŸ’° Medical Costs medical_cost_gfg.csv Healthcare expense analysis Regression 🟒 Beginner
πŸ‘• Clothing Reviews RNN_Clothing-Review.csv NLP sentiment analysis NLP πŸ”΄ Advanced

πŸ’‘ Quick Start Code:

import pandas as pd
df = pd.read_csv('diabetes.csv')
print(df.shape)      # View dimensions
df.describe()        # Get statistics
df.isnull().sum()    # Check for missing values
🎬 Entertainment & Media (8 datasets) ⭐ POPULAR

Streaming platforms, movies, and content analysis data

Dataset File Purpose Type Level
πŸŽ₯ Netflix Netflix_titles.csv, Netflix_credits.csv Content analysis & trends Analysis 🟒 Beginner
πŸ“Ί HBO Content HBO_titles.csv, HBO_credits.csv Streaming platform comparison Comparison 🟒 Beginner
🎬 IMDB Dataset IMDB-Dataset.csv Movie database analysis Analysis 🟒 Beginner
πŸ’΅ Box Office gfg_boxoffice.csv Revenue & performance metrics Analysis 🟒 Beginner
πŸ”₯ Trending Data Trending/trending.csv Social media trends TimeSeries 🟑 Intermediate

πŸ’‘ Quick Start Code:

netflix = pd.read_csv('Netflix_titles.csv')
netflix['type'].value_counts()     # Content distribution
netflix.groupby('country').size()  # Country analysis
πŸš— Transportation & Mobility (3 datasets)

Vehicle data, traffic, and transportation analytics

Dataset File Purpose Type Level
πŸš™ Cars Dataset Project_2_Cars_Dataset.csv Vehicle specs & pricing Regression 🟒 Beginner
🚨 Police Data Project_3_Police Data.csv Traffic & incidents Analysis 🟑 Intermediate
βš™οΈ Vehicle Failure vehicle_failure.csv Maintenance prediction Classification 🟑 Intermediate
🏠 Real Estate (2 datasets)

Housing market and property data

Dataset File Purpose Type Level
🏑 Housing Data Project_5_Housing_Data.csv, House_Price_India.csv Price prediction & analysis Regression 🟒 Beginner
🌍 Demographics & Census (2 datasets)

Population and demographic statistics

Dataset File Purpose Type Level
πŸ“Š Census 2011 Project_6_Census_2011.csv Population statistics Analysis 🟑 Intermediate
πŸ‘₯ Demographics demographics.csv, dermographic data.csv Demographic analysis Analysis 🟒 Beginner
πŸ’° Finance & Business (5 datasets)

Financial and business-related datasets

Dataset File Purpose Type Level
πŸ“‹ Loan Datasets gfg_LoanDataset---LoansDatasest.csv, loan_approval_dataset.csv Loan approval prediction Classification 🟑 Intermediate
πŸ“‰ Churn Modeling Churn_Modelling_gfg.csv Customer retention analysis Classification 🟑 Intermediate
πŸ‘” Employee Attrition MFG10YearTerminationData(EMPLOYEE-ATTRITION).csv Workforce analytics Classification 🟑 Intermediate
πŸ“ˆ Stock Data stock_data.csv Market analysis TimeSeries 🟑 Intermediate
πŸŽ“ Education & Learning (3 datasets)

Educational resources and student data

Dataset File Purpose Type Level
🎯 Udemy Courses Project_7_Udemy_Dataset.csv, Udmey Data/ Course analysis & pricing Analysis 🟒 Beginner
πŸ“š Student Performance student-pass-fail-data.csv Academic prediction Classification 🟒 Beginner
πŸ›οΈ Mall Customers gfg_Mall_Customers-.csv Customer segmentation Clustering 🟑 Intermediate
πŸ”¬ Science & Classic ML Datasets (5 datasets) ⭐ FOR BEGINNERS

Classic datasets perfect for learning and tutorials

Dataset File Purpose Type Level
🌸 Iris IRIS.csv Classic classification Classification 🟒 Beginner
βš“ Titanic Titanic_dataset.csv, GFG_titanic.csv, Titanic_Dataset_SmartED.csv Survival prediction Classification 🟒 Beginner
🍷 Wine Quality redwinequality.csv, whitewinequality.csv Quality classification Regression 🟒 Beginner
πŸ“Š Forecasting & TimeSeries (4 datasets)

Time series and forecasting datasets

Dataset File Purpose Type Level
🌀️ Weather Data Project_1_Weather_Dataset.csv, daily-min-temperatures.csv Temperature forecasting TimeSeries 🟑 Intermediate
πŸ’Ή Sales Forecasting sales_forecasting_dataset_SmartEd_Project.csv, stores_sales_forecasting_SmartED.csv Revenue prediction TimeSeries 🟑 Intermediate
🏏 IPL Data ipl_data.csv Sports analytics Analysis 🟒 Beginner
🌧️ Rainfall Rainfall_dataset.csv Climate patterns TimeSeries 🟑 Intermediate
🌾 Environment & Special Topics (3 datasets)

Environmental and specialized datasets

Dataset File Purpose Type Level
🦠 COVID-19 Data Project_4_Covid_19_data.csv Pandemic analysis TimeSeries 🟑 Intermediate
🎬 Amazon Prime Amazone_titles.csv, Amazone_credits.csv Content analysis Analysis 🟒 Beginner
πŸ” Zomato Data Zomato-data-.csv Restaurant trends Analysis 🟒 Beginner

πŸš€ Quick Start Guide

πŸ“‹ Prerequisites

# Install required packages
pip install pandas numpy matplotlib seaborn scikit-learn jupyter

⚑ Load & Explore Any Dataset (30 seconds)

import pandas as pd
import numpy as np

# Load your chosen dataset
df = pd.read_csv('diabetes.csv')

# Quick exploration
print(df.info())        # Data types & missing values
print(df.describe())    # Statistical summary
print(df.head(10))      # First 10 rows
print(df.shape)         # Dimensions (rows, columns)

# Visual inspection
import matplotlib.pyplot as plt
df.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()

πŸ’‘ Common Usage Patterns

🎯 Choose Your Use Case:

πŸ“Š 1. Exploratory Data Analysis (EDA)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('diabetes.csv')

# Basic exploration
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Missing values:\n{df.isnull().sum()}")

# Statistical summary
print(df.describe())

# Distribution analysis
plt.figure(figsize=(12, 4))
df['diabetes'].value_counts().plot(kind='bar')
plt.title('Diabetes Distribution')
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
πŸ€– 2. Classification Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd

# Load data
df = pd.read_csv('diabetes.csv')

# Prepare features and target
X = df.drop(['diabetes'], axis=1)
y = df['diabetes']

# Handle categorical variables
X_encoded = pd.get_dummies(X, drop_first=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# Train Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X_encoded.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nTop Features:")
print(feature_importance.head(10))
πŸ“ˆ 3. Regression Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np

# Load housing data
df = pd.read_csv('Project_5_Housing_Data.csv')

# Prepare data (adjust column names as needed)
X = df.drop('price', axis=1)  # Features
y = df['price']               # Target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print(f'RΒ² Score: {r2_score(y_test, y_pred):.4f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.4f}')

# Visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Regression: Actual vs Predicted')
plt.show()
πŸ”€ 4. Text Classification & NLP
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import pandas as pd

# Load spam dataset
df = pd.read_csv('spam_ham_dataset.csv')

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, stop_words='english')),
    ('classifier', MultinomialNB())
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42
)

# Train
pipeline.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import accuracy_score, classification_report
y_pred = pipeline.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.4f}')
print(classification_report(y_test, y_pred))

# Test on new data
test_messages = [
    'Congratulations! You won a prize!',
    'Hello, can we schedule a meeting tomorrow?'
]
predictions = pipeline.predict(test_messages)
for msg, pred in zip(test_messages, predictions):
    print(f"'{msg}' -> {pred}")
🎯 5. Clustering & Segmentation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

# Load mall customers data
df = pd.read_csv('gfg_Mall_Customers-.csv')

# Select features for clustering
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine optimal k (elbow method)
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

# Apply clustering with optimal k (e.g., k=5)
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# Visualization
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', s=100, alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, edgecolors='black', linewidths=2)
plt.xlabel('Age')
plt.ylabel('Annual Income')
plt.title('Customer Segmentation (K-Means)')
plt.colorbar(scatter, label='Cluster')
plt.show()

# Cluster analysis
df['Cluster'] = clusters
print(df.groupby('Cluster')[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']].mean())

πŸ“Š Dataset Quick Reference

Category Count Beginner Friendly Best For
πŸ₯ Healthcare 8 βœ… YES Classification, Health Analytics
🎬 Entertainment 8 βœ… YES EDA, Visualization, Trends
πŸš— Transportation 3 βœ… YES Regression, Analysis
🏠 Real Estate 2 βœ… YES Regression, Price Prediction
🌍 Demographics 2 βœ… YES Analysis, Population Studies
πŸ’° Finance 5 🟑 SOME Classification, Forecasting
πŸŽ“ Education 3 βœ… YES Analysis, Clustering
πŸ”¬ Science & ML 5 βœ… YES Learning, Tutorials
πŸ“Š Forecasting 4 🟑 SOME TimeSeries, ARIMA, LSTM
🌾 Environment 3 🟑 SOME Analysis, Trends
πŸ” Food & Dining 1 βœ… YES Analysis, Visualization

πŸ› οΈ Recommended Tools & Libraries

πŸ“¦ Complete Setup Guide

Essential Stack

# Data manipulation & analysis
pip install pandas numpy

# Visualization
pip install matplotlib seaborn plotly

# Machine Learning
pip install scikit-learn xgboost lightgbm catboost

# Deep Learning (Optional)
pip install tensorflow pytorch

# Statistical Analysis
pip install scipy statsmodels

# Jupyter Notebooks
pip install jupyter jupyterlab ipywidgets

# Data Quality
pip install pandas-profiling missingno

All-in-One Installation

pip install pandas numpy matplotlib seaborn scikit-learn jupyter plotly scipy statsmodels xgboost

Verify Installation

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Scikit-Learn: {sklearn.__version__}")
print(f"βœ… All libraries installed successfully!")

πŸ“ Directory Structure

πŸ“ Datasets/
β”œβ”€β”€ πŸ“„ README.md                           (This file)
β”œβ”€β”€ πŸ“„ LICENSE
β”‚
β”œβ”€β”€ 🩺 HEALTHCARE DATASETS
β”‚   β”œβ”€β”€ diabetes.csv
β”‚   β”œβ”€β”€ diabetes1.csv
β”‚   β”œβ”€β”€ gfg_heart.csv
β”‚   β”œβ”€β”€ heart_disease_uci.csv
β”‚   β”œβ”€β”€ medical_cost_gfg.csv
β”‚   β”œβ”€β”€ RNN_Clothing-Review.csv
β”‚   └── Health_Care_Dataset/
β”‚       β”œβ”€β”€ Patient_Profile.csv
β”‚       β”œβ”€β”€ Health_Camp_Detail.csv
β”‚       β”œβ”€β”€ First_Health_Camp_Attended.csv
β”‚       β”œβ”€β”€ Second_Health_Camp_Attended.csv
β”‚       β”œβ”€β”€ Third_Health_Camp_Attended.csv
β”‚       β”œβ”€β”€ Train.csv
β”‚       β”œβ”€β”€ test.csv
β”‚       └── Cleaned_Data/
β”‚
β”œβ”€β”€ 🎬 ENTERTAINMENT DATASETS
β”‚   β”œβ”€β”€ Netflix_titles.csv
β”‚   β”œβ”€β”€ Netflix_credits.csv
β”‚   β”œβ”€β”€ HBO_titles.csv
β”‚   β”œβ”€β”€ HBO_credits.csv
β”‚   β”œβ”€β”€ Amazone_titles.csv
β”‚   β”œβ”€β”€ Amazone_credits.csv
β”‚   β”œβ”€β”€ IMDB-Dataset.csv
β”‚   β”œβ”€β”€ gfg_boxoffice.csv
β”‚   └── Trending/
β”‚       β”œβ”€β”€ trending.csv
β”‚       └── Cleaned data/
β”‚
β”œβ”€β”€ πŸš— TRANSPORTATION & πŸ“Š FORECASTING
β”‚   β”œβ”€β”€ Project_1_Weather_Dataset.csv
β”‚   β”œβ”€β”€ Project_2_Cars_Dataset.csv
β”‚   β”œβ”€β”€ Project_3_Police Data.csv
β”‚   β”œβ”€β”€ daily-min-temperatures.csv
β”‚   β”œβ”€β”€ stock_data.csv
β”‚   β”œβ”€β”€ vehicle_failure.csv
β”‚   └── ipl_data.csv
β”‚
β”œβ”€β”€ 🏠 REAL ESTATE & πŸ’Ό BUSINESS
β”‚   β”œβ”€β”€ House_Price_India.csv
β”‚   β”œβ”€β”€ Project_5_Housing_Data.csv
β”‚   β”œβ”€β”€ gfg_LoanDataset---LoansDatasest.csv
β”‚   β”œβ”€β”€ loan_approval_dataset.csv
β”‚   β”œβ”€β”€ Churn_Modelling_gfg.csv
β”‚   └── MFG10YearTerminationData(EMPLOYEE-ATTRITION).csv
β”‚
β”œβ”€β”€ πŸŽ“ EDUCATION & 🌍 DEMOGRAPHICS
β”‚   β”œβ”€β”€ Project_6_Census_2011.csv
β”‚   β”œβ”€β”€ Project_7_Udemy_Dataset.csv
β”‚   β”œβ”€β”€ demographics.csv
β”‚   β”œβ”€β”€ dermographic data.csv
β”‚   β”œβ”€β”€ student-pass-fail-data.csv
β”‚   β”œβ”€β”€ gfg_Mall_Customers-.csv
β”‚   └── Udmey Data/
β”‚
β”œβ”€β”€ πŸ”¬ CLASSIC ML & SCIENCE
β”‚   β”œβ”€β”€ IRIS.csv
β”‚   β”œβ”€β”€ Titanic_dataset.csv
β”‚   β”œβ”€β”€ GFG_titanic.csv
β”‚   β”œβ”€β”€ Titanic_Dataset_SmartED.csv
β”‚   β”œβ”€β”€ redwinequality.csv
β”‚   └── whitewinequality.csv
β”‚
β”œβ”€β”€ πŸ“ TEXT & SPECIAL
β”‚   β”œβ”€β”€ spam_ham_dataset.csv
β”‚   β”œβ”€β”€ Project_Text_Classification_synthetic_text_data.csv
β”‚   β”œβ”€β”€ Zomato-data-.csv
β”‚   β”œβ”€β”€ Project_4_Covid_19_data.csv
β”‚   β”œβ”€β”€ Rainfall_dataset.csv
β”‚   └── sales_forecasting_dataset_SmartEd_Project.csv
β”‚
└── 🧹 CLEANED & TEST DATA
    β”œβ”€β”€ testdata.csv
    β”œβ”€β”€ CleaneD_testdata_File.csv
    β”œβ”€β”€ Naivs_diabetes.csv
    β”œβ”€β”€ customer_purchase_behavior.csv
    β”œβ”€β”€ Position_Salaries.csv
    └── stores_sales_forecasting_SmartED.csv

πŸŽ“ Learning Paths

🟒 Beginner Learning Path (Start Here!)

Week 1-2: Basics

  1. Load & Explore: Start with IRIS.csv or Titanic_dataset.csv
  2. Practice: Use the EDA code examples above
  3. Visualize: Create plots with matplotlib & seaborn

Week 3-4: Simple Models

  1. Classification: Try diabetes.csv with logistic regression
  2. Regression: Use House_Price_India.csv for price prediction
  3. Understand: Learn about train-test splits and model evaluation

Week 5-6: Advanced Concepts

  1. Ensemble Methods: Apply Random Forests to any dataset
  2. Clustering: Segment customers with gfg_Mall_Customers-.csv
  3. NLP Basics: Text classification with spam_ham_dataset.csv
🟑 Intermediate Learning Path
  1. Feature Engineering: Work with Titanic_dataset.csv
  2. Time Series: Learn with daily-min-temperatures.csv
  3. Feature Selection: Apply to Churn_Modelling_gfg.csv
  4. Model Tuning: Hyperparameter optimization on any dataset
  5. Cross-Validation: Implement k-fold on classification problems
πŸ”΄ Advanced Learning Path
  1. Deep Learning: NLP with RNN_Clothing-Review.csv
  2. LSTM Models: Time series forecasting
  3. Ensemble Stacking: Combine multiple models
  4. Advanced NLP: Sentiment analysis & text generation
  5. Big Data Techniques: Handle large datasets efficiently

🀝 Contributing

We welcome contributions! Here's how you can help:

πŸ“‹ Contribution Guidelines

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/add-dataset
  3. Add your dataset with documentation
  4. Commit changes: git commit -m "Add new dataset: [name]"
  5. Push to branch: git push origin feature/add-dataset
  6. Submit a Pull Request

πŸ“ Dataset Submission Requirements

When adding a dataset, please include:

  • βœ… Clear description of dataset
  • βœ… Data dictionary/schema
  • βœ… Usage examples
  • βœ… Source attribution
  • βœ… Data quality assessment
  • βœ… Size and format information

πŸ“š Resources & Links


βš–οΈ License & Usage

This dataset collection is available under Open Source License.

βœ… You Can:

  • Use for educational and research purposes
  • Use for commercial projects (with attribution)
  • Modify and redistribute datasets
  • Create derivative works

❌ You Cannot:

  • Claim original ownership
  • Remove attribution from original sources

Citation Format:

Dataset Collection by itsluckysharma01
GitHub: https://github.com/itsluckysharma01/Datasets

πŸŽ‰ Getting Started Today!

Quick Checklist:

  • ⭐ Star this repository
  • πŸ“₯ Fork or clone the repo
  • πŸ“– Read this README
  • πŸ’» Install required packages
  • πŸš€ Pick a beginner dataset
  • πŸ“ Run the code examples
  • 🎯 Start your project!

πŸ“§ Support & Questions


🌟 Star This Repository If You Find It Helpful!

Made with ❀️ for the Data Science Community

GitHub Stars GitHub Forks

⬆ Back to Top

About

πŸ“ŠThis repository contains a comprehensive collection of πŸ“‹50+ datasets spanning various domains including healthcare, πŸ”entertainment, transportation, demographics, and more.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors