A high-performance, scalable, multi-threaded text processing system built in Python for compliance scoring, search, storage, and analytics.
Process large volumes of text, apply rule-based scoring, store results with deduplication, and visualize analytics—all with one command.
Get the app running in 3 minutes:
# 1. Clone the repository
git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-Processor
# 2. Run automated setup (Linux/Mac)
bash setup.sh
# 3. Run the Streamlit dashboard
source .venv/bin/activate
streamlit run streamlit_app.pyWindows users: Use setup.bat instead of setup.sh
That's it! The dashboard opens at http://localhost:8501 with sample data ready to explore.
- ✅ Upload & process text files — Batch or individual
- ✅ Apply custom scoring rules — Regex, keywords, patterns
- ✅ Auto-deduplicate chunks — SHA-256 hashing
- ✅ Search results — Keyword & regex search
- ✅ Export to CSV — For analysis in Excel/Sheets
- ✅ Generate PDF reports — With charts and word clouds
- ✅ Auto-suggest rules — AI-like pattern detection
- ✅ View analytics — Charts, histograms, metrics
- ✅ Send email summaries — Optional alerts
| Feature | Details |
|---|---|
| ⚡ Parallel Processing | Multi-threaded chunk scoring (configurable workers) |
| 📚 Rule Engine | 7+ pre-configured rule types (keyword, regex, length, etc.) |
| 🗃️ Smart Storage | SQLite with hash-based deduplication |
| 🔍 Full Search | Keyword and regex pattern matching |
| 📊 Dashboard | Streamlit UI with file upload, analytics, rule editor |
| 🤖 Smart Rules | Auto-generates new rules from frequent patterns |
| 📈 Reporting | PDF generation with charts and statistics |
| 📤 Export | CSV export with full metadata |
| 📧 Email Alerts | SMTP integration for notifications |
| 🛠 Extensible | Clean architecture — easy to add custom rules |
Perfect for exploration, rule testing, and visualization:
streamlit run streamlit_app.pyFeatures:
- File upload & management
- Live pipeline execution with progress
- Rule editor with backup
- Search & filter records
- Analytics charts (scores, wordcloud, rule hits)
- PDF report builder
- Storage improver suggestions
Perfect for automation and large-scale processing:
python run.pyAutomatically:
- Loads all
.txtfiles fromdata/support_text_files/ - Applies rules from
data/rules1.json - Chunks and scores in parallel
- Deduplicates based on text hash
- Saves to SQLite database
- Runs storage improver (suggests new rules)
- Exports search results to CSV
- Generates email summary (optional)
- Python 3.9 or higher
- Git
git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-ProcessorLinux/Mac:
bash setup.shWindows:
setup.batThis will:
- Create Python virtual environment
- Install all dependencies from
requirements.txt - Create necessary folders
# Create virtual environment
python3 -m venv .venv
# Activate it
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt.
├── app/ # Core application modules
│ ├── checker/
│ │ ├── checker.py # Rule evaluation engine
│ │ └── rules.py # Rule definitions & evaluators
│ ├── storage/
│ │ ├── storage.py # SQLite database layer
│ │ └── storage_improver.py # Auto-rule generator
│ ├── text_processing/
│ │ ├── text_breaker.py # Text cleaning & chunking
│ │ ├── text_loader.py # File loading
│ │ └── parallel_break_loader.py # Full pipeline orchestrator
│ ├── search_export/
│ │ ├── search_save.py # Search & CSV export
│ │ └── emailer.py # Email notifications
│ └── utils.py # Shared utilities
│
├── data/
│ ├── rules1.json # ✨ Pre-configured scoring rules
│ └── support_text_files/
│ ├── sample1.txt # ✨ Sample urgent ticket
│ ├── sample2.txt # ✨ Sample routine inquiry
│ └── sample3.txt # ✨ Sample critical alert
│
├── output/ # Generated CSV exports
├── improver_output/ # Auto-generated rule suggestions
│
├── streamlit_app.py # 🎨 Dashboard UI
├── run.py # 🤖 Batch pipeline
├── requirements.txt # ✨ Python dependencies
├── .env # ✨ Configuration (with defaults)
├── setup.sh / setup.bat # ✨ Automated setup scripts
├── SETUP.md # ✨ Detailed setup guide
├── INSTALL_COMPLETE.md # ✨ Installation summary
├── LICENSE # MIT License
└── README.md # This file
✨ = Newly added permanent files
Default configuration file included. Customize as needed:
# Database
DB_PATH=checks.db
# Folders
TEXT_FOLDER=data/support_text_files
RULES_PATH=data/rules1.json
EXPORT_DIR=output
# Email (optional)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_ADDRESS=your_email@gmail.com
EMAIL_PASSWORD=your_app_password
EMAIL_FROM=your_email@gmail.com
EMAIL_TO=recipient@example.com
# Logging
LOG_LEVEL=INFOEach rule has:
id: Unique identifiertype: Rule type (see below)score: Points awarded if matched- Custom parameters based on type
Available Rule Types:
keyword_any- Matches any keyword in a listcontains_phrase- Matches an exact phraseregex_match- Regex pattern matchingword_count_min- Minimum word countlength_min- Minimum character lengthuppercase_ratio- Uppercase character ratiostarts_with/ends_with- Text boundaries
Example rule:
{
"id": 1,
"type": "keyword_any",
"keywords": ["urgent", "critical", "important"],
"score": 10,
"description": "Urgent keywords"
}Input Text Files
↓
[Text Loader] - Load & clean raw text
↓
[Text Breaker] - Split into word chunks
↓
[Deduplicator] - Hash-based dedup check
↓
[Rule Checker] - Apply scoring rules (parallel)
↓
[Storage Layer] - Save to SQLite database
↓
[Outputs] - CSV, Reports, Emails
- Text Ingestion - Loads
.txtfiles, normalizes whitespace - Chunking - Splits text into fixed-size word groups, assigns UIDs
- Hashing - Computes SHA-256 for deduplication
- Scoring - Applies rules in parallel threads
- Storage - SQLite with indexes for fast queries
- Analysis - Searches, exports, generates reports
- Intelligence - Auto-suggests new rules from patterns
Place .txt files in data/support_text_files/:
data/support_text_files/
├── ticket_1.txt
├── ticket_2.txt
└── customer_feedback.txt
Run the pipeline or upload via dashboard.
Edit data/rules1.json to add new rules:
{
"id": 8,
"type": "keyword_any",
"keywords": ["password", "authentication", "login"],
"score": 15,
"description": "Security-related keywords"
}Dashboard has a built-in rules editor with backup!
Dashboard provides:
- CSV export for Excel analysis
- PDF reports with charts
- Rule hit summaries
- Wordcloud visualizations
Each chunk gets a score based on rules:
- 0-20: Low priority/normal
- 20-40: Medium priority/attention needed
- 40+: High priority/urgent
Auto-generates new rules from frequent patterns:
python run.pyOutput → improver_output/suggestions.json
Example suggestion:
{
"type": "keyword_any",
"keywords": ["customer"],
"score": 1,
"source": "auto-generated"
}You can manually review and add these to rules1.json.
Optional SMTP integration sends email alerts:
- Add valid Gmail/SMTP credentials to
.env - Set
SEND_EMAIL = Trueinrun.py - Pipeline will email summaries with:
- Total chunks processed
- Average score
- Top high-scoring items
- Alerts for low-score items
Note: For Gmail, use App Passwords.
Dashboard search supports:
- Keywords:
find all chunks containing "error" - Regex:
find chunks matching pattern \d{4} - Score ranges:
find chunks with score 20-40
Results export directly to CSV.
Dashboard provides:
- Score distribution - Histogram of all scores
- Top rules - Most frequently triggered rules
- Wordcloud - Visualize high-frequency terms
- Timeline - Chunks processed over time
- Statistics - Mean, median, min, max scores
| Component | Technology |
|---|---|
| Language | Python 3.9+ |
| Database | SQLite3 |
| UI | Streamlit |
| Parallelism | ThreadPoolExecutor |
| Data | Pandas |
| Visualization | Plotly, Matplotlib, WordCloud |
| Reports | ReportLab |
| SMTP | |
| Configuration | python-dotenv |
- SETUP.md - Detailed setup and troubleshooting
- INSTALL_COMPLETE.md - Installation summary
- LICENSE - MIT License
Tested with:
- ✅ 10,000+ text chunks
- ✅ 100+ scoring rules
- ✅ Parallel processing (6 workers)
- ✅ SHA-256 hashing for 10,000+ items
Typical performance:
- Chunk processing: ~100 chunks/second
- Rule evaluation: ~50ms per chunk
- Deduplication: <1ms per chunk
- ML-based scoring (BERT, spaCy)
- FastAPI REST endpoints
- Vector embeddings & semantic search
- Rule auto-learning with feedback
- Docker containerization
- Postgres support (scale beyond SQLite)
- Real-time streaming pipeline
- Advanced visualization (3D plots, networks)
Port 8501 already in use?
streamlit run streamlit_app.py --server.port 8502Need to reset the database?
rm checks.db
python run.py # Recreates freshDependencies installation failed?
pip install --upgrade pip
pip install -r requirements.txtEmail not sending?
- Verify credentials in
.env - Check Gmail App Passwords
- Check SMTP settings and firewall
See SETUP.md for more troubleshooting.
Contributions welcome! Areas to improve:
- Additional rule types
- Machine learning integration
- API layer
- Performance optimization
- Documentation
Please open an issue or pull request.
Project Lead:
- Charan Teja Mangali — Lead Developer, System Architect & Mentor
Contributors:
- Ayush Kumar — Full-stack implementation
MIT License — See LICENSE for details.
You are free to:
- ✅ Use for commercial and private purposes
- ✅ Modify and distribute
- ✅ Include in projects
With the condition:
- Include original license and copyright notice
If this project helped you, please consider:
- ⭐ Giving it a star on GitHub
- 🔗 Sharing with others
- 💬 Leaving feedback
- 🐛 Reporting issues
- 📖 Check SETUP.md and INSTALL_COMPLETE.md
- 🐛 Open an issue
- 📧 Contact maintainers
Happy Text Processing! 🚀