Locality Sensitive Hashing for Document Similarity

A repository for finding similar documents using Locality Sensitive Hashing (LSH). This project involves reading a dataset of 48,505 articles, applying shingling and minhashing, and using band hashing to efficiently detect similar documents.

Project Structure

LSH-Document-Similarity.ipynb: Jupyter Notebook implementing the full pipeline, including data preprocessing, hashing, and similarity detection.
Datasets:
- articles.json: Contains 48,505 articles with their content and metadata.
- submissions.csv: Final output file with nearest neighbors for each document.
Report.pdf: Explanation of the approach, results, and findings.

Technologies Used

Python
Pandas & NumPy
NLTK (for n-grams processing)
SciPy (for sparse matrix representation)
Matplotlib & Seaborn (for visualization)

Steps to Implement

Read the Data
- Load JSON articles.
- Clean text (remove punctuation, convert to lowercase).
Shingle the Documents
- Generate 2-grams (bigrams) for each article (can experiment with other n-values).
- Convert n-grams into a binary vector representation.
Convert n-grams to Binary Vectors
- Select the top 10,000 most frequent n-grams (optional optimization).
- Represent documents as sparse matrices using scipy.sparse.csr_matrix.
Generate Hash Functions
- Create multiple hash functions to map n-grams into buckets.
- Use different techniques: hash(value) % num_buckets, XOR with random integers, etc.
Compute MinHash Signatures
- Use a faster algorithm from the lecture notes.
- Generate a signature matrix for all documents.
Band Hashing
- Divide MinHash signatures into bands.
- Hash each band into buckets to group candidate similar documents.
Tune Parameters
- Experiment with different band sizes and hash functions.
- Plot the probability of similar articles falling into the same bucket.
Find Nearest Neighbors
- Use Jaccard Similarity to refine candidate pairs.
- Store nearest neighbors in submissions.csv.

How to Run

Clone the repository:

git clone https://github.com/YOUR_USERNAME/LSH-Document-Similarity.git
cd LSH-Document-Similarity

Install dependencies:
```
pip install -r requirements.txt
```

Run the Jupyter Notebook:

jupyter notebook LSH-Document-Similarity.ipynb

Results & Report

The final output file, submissions.csv, contains each article's nearest neighbors.
The Report.pdf explains the methodology, optimizations, and results.

Acknowledgments

This project follows the Locality Sensitive Hashing approach from the lecture materials.
Thanks to the course instructor for providing the dataset and guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LSH-Document-Similarity.ipynb		LSH-Document-Similarity.ipynb
README.md		README.md
submissions.csv		submissions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Locality Sensitive Hashing for Document Similarity

Project Structure

Technologies Used

Steps to Implement

How to Run

Results & Report

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Locality Sensitive Hashing for Document Similarity

Project Structure

Technologies Used

Steps to Implement

How to Run

Results & Report

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages