Skip to content

ayush1k/Python-Parallel-Text-Handling-Processor

Repository files navigation

Python Parallel Text Handling Processor

A high-performance, scalable, multi-threaded text processing system built in Python for compliance scoring, search, storage, and analytics.

Python 3.9+ License: MIT Streamlit App

Process large volumes of text, apply rule-based scoring, store results with deduplication, and visualize analytics—all with one command.


Quick Start

Get the app running in 3 minutes:

# 1. Clone the repository
git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-Processor

# 2. Run automated setup (Linux/Mac)
bash setup.sh

# 3. Run the Streamlit dashboard
source .venv/bin/activate
streamlit run streamlit_app.py

Windows users: Use setup.bat instead of setup.sh

That's it! The dashboard opens at http://localhost:8501 with sample data ready to explore.


What You Can Do

  • Upload & process text files — Batch or individual
  • Apply custom scoring rules — Regex, keywords, patterns
  • Auto-deduplicate chunks — SHA-256 hashing
  • Search results — Keyword & regex search
  • Export to CSV — For analysis in Excel/Sheets
  • Generate PDF reports — With charts and word clouds
  • Auto-suggest rules — AI-like pattern detection
  • View analytics — Charts, histograms, metrics
  • Send email summaries — Optional alerts

Key Features

Feature Details
⚡ Parallel Processing Multi-threaded chunk scoring (configurable workers)
📚 Rule Engine 7+ pre-configured rule types (keyword, regex, length, etc.)
🗃️ Smart Storage SQLite with hash-based deduplication
🔍 Full Search Keyword and regex pattern matching
📊 Dashboard Streamlit UI with file upload, analytics, rule editor
🤖 Smart Rules Auto-generates new rules from frequent patterns
📈 Reporting PDF generation with charts and statistics
📤 Export CSV export with full metadata
📧 Email Alerts SMTP integration for notifications
🛠 Extensible Clean architecture — easy to add custom rules

Two Ways to Run

Option 1: Interactive Dashboard (Recommended)

Perfect for exploration, rule testing, and visualization:

streamlit run streamlit_app.py

Features:

  • File upload & management
  • Live pipeline execution with progress
  • Rule editor with backup
  • Search & filter records
  • Analytics charts (scores, wordcloud, rule hits)
  • PDF report builder
  • Storage improver suggestions

Option 2: Batch Pipeline

Perfect for automation and large-scale processing:

python run.py

Automatically:

  1. Loads all .txt files from data/support_text_files/
  2. Applies rules from data/rules1.json
  3. Chunks and scores in parallel
  4. Deduplicates based on text hash
  5. Saves to SQLite database
  6. Runs storage improver (suggests new rules)
  7. Exports search results to CSV
  8. Generates email summary (optional)

Installation

Prerequisites

  • Python 3.9 or higher
  • Git

Step 1: Clone Repository

git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-Processor

Step 2: Automated Setup (Recommended)

Linux/Mac:

bash setup.sh

Windows:

setup.bat

This will:

  • Create Python virtual environment
  • Install all dependencies from requirements.txt
  • Create necessary folders

Step 3: Manual Setup (Alternative)

# Create virtual environment
python3 -m venv .venv

# Activate it
source .venv/bin/activate    # Linux/Mac
# or
.venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

Project Structure

.
├── app/                              # Core application modules
│   ├── checker/
│   │   ├── checker.py               # Rule evaluation engine
│   │   └── rules.py                 # Rule definitions & evaluators
│   ├── storage/
│   │   ├── storage.py               # SQLite database layer
│   │   └── storage_improver.py      # Auto-rule generator
│   ├── text_processing/
│   │   ├── text_breaker.py          # Text cleaning & chunking
│   │   ├── text_loader.py           # File loading
│   │   └── parallel_break_loader.py # Full pipeline orchestrator
│   ├── search_export/
│   │   ├── search_save.py           # Search & CSV export
│   │   └── emailer.py               # Email notifications
│   └── utils.py                     # Shared utilities
│
├── data/
│   ├── rules1.json                  # ✨ Pre-configured scoring rules
│   └── support_text_files/
│       ├── sample1.txt              # ✨ Sample urgent ticket
│       ├── sample2.txt              # ✨ Sample routine inquiry
│       └── sample3.txt              # ✨ Sample critical alert
│
├── output/                          # Generated CSV exports
├── improver_output/                 # Auto-generated rule suggestions
│
├── streamlit_app.py                 # 🎨 Dashboard UI
├── run.py                           # 🤖 Batch pipeline
├── requirements.txt                 # ✨ Python dependencies
├── .env                             # ✨ Configuration (with defaults)
├── setup.sh / setup.bat             # ✨ Automated setup scripts
├── SETUP.md                         # ✨ Detailed setup guide
├── INSTALL_COMPLETE.md              # ✨ Installation summary
├── LICENSE                          # MIT License
└── README.md                        # This file

✨ = Newly added permanent files


⚙️ Configuration

.env File

Default configuration file included. Customize as needed:

# Database
DB_PATH=checks.db

# Folders
TEXT_FOLDER=data/support_text_files
RULES_PATH=data/rules1.json
EXPORT_DIR=output

# Email (optional)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_ADDRESS=your_email@gmail.com
EMAIL_PASSWORD=your_app_password
EMAIL_FROM=your_email@gmail.com
EMAIL_TO=recipient@example.com

# Logging
LOG_LEVEL=INFO

data/rules1.json Format

Each rule has:

  • id: Unique identifier
  • type: Rule type (see below)
  • score: Points awarded if matched
  • Custom parameters based on type

Available Rule Types:

  • keyword_any - Matches any keyword in a list
  • contains_phrase - Matches an exact phrase
  • regex_match - Regex pattern matching
  • word_count_min - Minimum word count
  • length_min - Minimum character length
  • uppercase_ratio - Uppercase character ratio
  • starts_with / ends_with - Text boundaries

Example rule:

{
  "id": 1,
  "type": "keyword_any",
  "keywords": ["urgent", "critical", "important"],
  "score": 10,
  "description": "Urgent keywords"
}

System Architecture

Processing Pipeline

Input Text Files
      ↓
[Text Loader] - Load & clean raw text
      ↓
[Text Breaker] - Split into word chunks
      ↓
[Deduplicator] - Hash-based dedup check
      ↓
[Rule Checker] - Apply scoring rules (parallel)
      ↓
[Storage Layer] - Save to SQLite database
      ↓
[Outputs] - CSV, Reports, Emails

Key Components

  1. Text Ingestion - Loads .txt files, normalizes whitespace
  2. Chunking - Splits text into fixed-size word groups, assigns UIDs
  3. Hashing - Computes SHA-256 for deduplication
  4. Scoring - Applies rules in parallel threads
  5. Storage - SQLite with indexes for fast queries
  6. Analysis - Searches, exports, generates reports
  7. Intelligence - Auto-suggests new rules from patterns

Usage Examples

Add Custom Text Files

Place .txt files in data/support_text_files/:

data/support_text_files/
├── ticket_1.txt
├── ticket_2.txt
└── customer_feedback.txt

Run the pipeline or upload via dashboard.

Create Custom Rules

Edit data/rules1.json to add new rules:

{
  "id": 8,
  "type": "keyword_any",
  "keywords": ["password", "authentication", "login"],
  "score": 15,
  "description": "Security-related keywords"
}

Dashboard has a built-in rules editor with backup!

Export Results

Dashboard provides:

  • CSV export for Excel analysis
  • PDF reports with charts
  • Rule hit summaries
  • Wordcloud visualizations

Interpret Scores

Each chunk gets a score based on rules:

  • 0-20: Low priority/normal
  • 20-40: Medium priority/attention needed
  • 40+: High priority/urgent

Storage Improver

Auto-generates new rules from frequent patterns:

python run.py

Output → improver_output/suggestions.json

Example suggestion:

{
  "type": "keyword_any",
  "keywords": ["customer"],
  "score": 1,
  "source": "auto-generated"
}

You can manually review and add these to rules1.json.


Email Summaries

Optional SMTP integration sends email alerts:

  1. Add valid Gmail/SMTP credentials to .env
  2. Set SEND_EMAIL = True in run.py
  3. Pipeline will email summaries with:
    • Total chunks processed
    • Average score
    • Top high-scoring items
    • Alerts for low-score items

Note: For Gmail, use App Passwords.


Search & Filter

Dashboard search supports:

  • Keywords: find all chunks containing "error"
  • Regex: find chunks matching pattern \d{4}
  • Score ranges: find chunks with score 20-40

Results export directly to CSV.


Analytics

Dashboard provides:

  • Score distribution - Histogram of all scores
  • Top rules - Most frequently triggered rules
  • Wordcloud - Visualize high-frequency terms
  • Timeline - Chunks processed over time
  • Statistics - Mean, median, min, max scores

Tech Stack

Component Technology
Language Python 3.9+
Database SQLite3
UI Streamlit
Parallelism ThreadPoolExecutor
Data Pandas
Visualization Plotly, Matplotlib, WordCloud
Reports ReportLab
Email SMTP
Configuration python-dotenv

Documentation


🚀 Performance

Tested with:

  • ✅ 10,000+ text chunks
  • ✅ 100+ scoring rules
  • ✅ Parallel processing (6 workers)
  • ✅ SHA-256 hashing for 10,000+ items

Typical performance:

  • Chunk processing: ~100 chunks/second
  • Rule evaluation: ~50ms per chunk
  • Deduplication: <1ms per chunk

🔮 Future Roadmap

  • ML-based scoring (BERT, spaCy)
  • FastAPI REST endpoints
  • Vector embeddings & semantic search
  • Rule auto-learning with feedback
  • Docker containerization
  • Postgres support (scale beyond SQLite)
  • Real-time streaming pipeline
  • Advanced visualization (3D plots, networks)

Troubleshooting

Port 8501 already in use?

streamlit run streamlit_app.py --server.port 8502

Need to reset the database?

rm checks.db
python run.py  # Recreates fresh

Dependencies installation failed?

pip install --upgrade pip
pip install -r requirements.txt

Email not sending?

  • Verify credentials in .env
  • Check Gmail App Passwords
  • Check SMTP settings and firewall

See SETUP.md for more troubleshooting.


Contributing

Contributions welcome! Areas to improve:

  • Additional rule types
  • Machine learning integration
  • API layer
  • Performance optimization
  • Documentation

Please open an issue or pull request.


Credits

Project Lead:

  • Charan Teja Mangali — Lead Developer, System Architect & Mentor

Contributors:

  • Ayush Kumar — Full-stack implementation

License

MIT License — See LICENSE for details.

You are free to:

  • ✅ Use for commercial and private purposes
  • ✅ Modify and distribute
  • ✅ Include in projects

With the condition:

  • Include original license and copyright notice

Show Your Support

If this project helped you, please consider:

  • ⭐ Giving it a star on GitHub
  • 🔗 Sharing with others
  • 💬 Leaving feedback
  • 🐛 Reporting issues

Questions or Support?


Happy Text Processing! 🚀

About

Python-Parallel-Text-Handling-Processor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors