A repository for finding similar documents using Locality Sensitive Hashing (LSH). This project involves reading a dataset of 48,505 articles, applying shingling and minhashing, and using band hashing to efficiently detect similar documents.
- LSH-Document-Similarity.ipynb: Jupyter Notebook implementing the full pipeline, including data preprocessing, hashing, and similarity detection.
- Datasets:
articles.json: Contains 48,505 articles with their content and metadata.submissions.csv: Final output file with nearest neighbors for each document.
- Report.pdf: Explanation of the approach, results, and findings.
- Python
- Pandas & NumPy
- NLTK (for n-grams processing)
- SciPy (for sparse matrix representation)
- Matplotlib & Seaborn (for visualization)
-
Read the Data
- Load JSON articles.
- Clean text (remove punctuation, convert to lowercase).
-
Shingle the Documents
- Generate 2-grams (bigrams) for each article (can experiment with other n-values).
- Convert n-grams into a binary vector representation.
-
Convert n-grams to Binary Vectors
- Select the top 10,000 most frequent n-grams (optional optimization).
- Represent documents as sparse matrices using
scipy.sparse.csr_matrix.
-
Generate Hash Functions
- Create multiple hash functions to map n-grams into buckets.
- Use different techniques:
hash(value) % num_buckets, XOR with random integers, etc.
-
Compute MinHash Signatures
- Use a faster algorithm from the lecture notes.
- Generate a signature matrix for all documents.
-
Band Hashing
- Divide MinHash signatures into bands.
- Hash each band into buckets to group candidate similar documents.
-
Tune Parameters
- Experiment with different band sizes and hash functions.
- Plot the probability of similar articles falling into the same bucket.
-
Find Nearest Neighbors
- Use Jaccard Similarity to refine candidate pairs.
- Store nearest neighbors in
submissions.csv.
- Clone the repository:
git clone https://github.com/YOUR_USERNAME/LSH-Document-Similarity.git cd LSH-Document-Similarity - Install dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook LSH-Document-Similarity.ipynb
- The final output file,
submissions.csv, contains each article's nearest neighbors. - The Report.pdf explains the methodology, optimizations, and results.
- This project follows the Locality Sensitive Hashing approach from the lecture materials.
- Thanks to the course instructor for providing the dataset and guidance.