Python Parallel Text Handling Processor

A high-performance, scalable, multi-threaded text processing system built in Python for compliance scoring, search, storage, and analytics.

Process large volumes of text, apply rule-based scoring, store results with deduplication, and visualize analytics—all with one command.

Quick Start

Get the app running in 3 minutes:

# 1. Clone the repository
git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-Processor

# 2. Run automated setup (Linux/Mac)
bash setup.sh

# 3. Run the Streamlit dashboard
source .venv/bin/activate
streamlit run streamlit_app.py

Windows users: Use setup.bat instead of setup.sh

That's it! The dashboard opens at http://localhost:8501 with sample data ready to explore.

What You Can Do

✅ Upload & process text files — Batch or individual
✅ Apply custom scoring rules — Regex, keywords, patterns
✅ Auto-deduplicate chunks — SHA-256 hashing
✅ Search results — Keyword & regex search
✅ Export to CSV — For analysis in Excel/Sheets
✅ Generate PDF reports — With charts and word clouds
✅ Auto-suggest rules — AI-like pattern detection
✅ View analytics — Charts, histograms, metrics
✅ Send email summaries — Optional alerts

Key Features

Feature	Details
⚡ Parallel Processing	Multi-threaded chunk scoring (configurable workers)
📚 Rule Engine	7+ pre-configured rule types (keyword, regex, length, etc.)
🗃️ Smart Storage	SQLite with hash-based deduplication
🔍 Full Search	Keyword and regex pattern matching
📊 Dashboard	Streamlit UI with file upload, analytics, rule editor
🤖 Smart Rules	Auto-generates new rules from frequent patterns
📈 Reporting	PDF generation with charts and statistics
📤 Export	CSV export with full metadata
📧 Email Alerts	SMTP integration for notifications
🛠 Extensible	Clean architecture — easy to add custom rules

Two Ways to Run

Option 1: Interactive Dashboard (Recommended)

Perfect for exploration, rule testing, and visualization:

streamlit run streamlit_app.py

Features:

File upload & management
Live pipeline execution with progress
Rule editor with backup
Search & filter records
Analytics charts (scores, wordcloud, rule hits)
PDF report builder
Storage improver suggestions

Option 2: Batch Pipeline

Perfect for automation and large-scale processing:

python run.py

Automatically:

Loads all .txt files from data/support_text_files/
Applies rules from data/rules1.json
Chunks and scores in parallel
Deduplicates based on text hash
Saves to SQLite database
Runs storage improver (suggests new rules)
Exports search results to CSV
Generates email summary (optional)

Installation

Prerequisites

Python 3.9 or higher
Git

Step 1: Clone Repository

git clone https://github.com/ayush1k/Python-Parallel-Text-Handling-Processor.git
cd Python-Parallel-Text-Handling-Processor

Step 2: Automated Setup (Recommended)

Linux/Mac:

bash setup.sh

Windows:

setup.bat

This will:

Create Python virtual environment
Install all dependencies from requirements.txt
Create necessary folders

Step 3: Manual Setup (Alternative)

# Create virtual environment
python3 -m venv .venv

# Activate it
source .venv/bin/activate    # Linux/Mac
# or
.venv\Scripts\activate       # Windows

# Install dependencies
pip install -r requirements.txt

Project Structure

.
├── app/                              # Core application modules
│   ├── checker/
│   │   ├── checker.py               # Rule evaluation engine
│   │   └── rules.py                 # Rule definitions & evaluators
│   ├── storage/
│   │   ├── storage.py               # SQLite database layer
│   │   └── storage_improver.py      # Auto-rule generator
│   ├── text_processing/
│   │   ├── text_breaker.py          # Text cleaning & chunking
│   │   ├── text_loader.py           # File loading
│   │   └── parallel_break_loader.py # Full pipeline orchestrator
│   ├── search_export/
│   │   ├── search_save.py           # Search & CSV export
│   │   └── emailer.py               # Email notifications
│   └── utils.py                     # Shared utilities
│
├── data/
│   ├── rules1.json                  # ✨ Pre-configured scoring rules
│   └── support_text_files/
│       ├── sample1.txt              # ✨ Sample urgent ticket
│       ├── sample2.txt              # ✨ Sample routine inquiry
│       └── sample3.txt              # ✨ Sample critical alert
│
├── output/                          # Generated CSV exports
├── improver_output/                 # Auto-generated rule suggestions
│
├── streamlit_app.py                 # 🎨 Dashboard UI
├── run.py                           # 🤖 Batch pipeline
├── requirements.txt                 # ✨ Python dependencies
├── .env                             # ✨ Configuration (with defaults)
├── setup.sh / setup.bat             # ✨ Automated setup scripts
├── SETUP.md                         # ✨ Detailed setup guide
├── INSTALL_COMPLETE.md              # ✨ Installation summary
├── LICENSE                          # MIT License
└── README.md                        # This file

✨ = Newly added permanent files

⚙️ Configuration

`.env` File

Default configuration file included. Customize as needed:

# Database
DB_PATH=checks.db

# Folders
TEXT_FOLDER=data/support_text_files
RULES_PATH=data/rules1.json
EXPORT_DIR=output

# Email (optional)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_ADDRESS=your_email@gmail.com
EMAIL_PASSWORD=your_app_password
EMAIL_FROM=your_email@gmail.com
EMAIL_TO=recipient@example.com

# Logging
LOG_LEVEL=INFO

`data/rules1.json` Format

Each rule has:

id: Unique identifier
type: Rule type (see below)
score: Points awarded if matched
Custom parameters based on type

Available Rule Types:

keyword_any - Matches any keyword in a list
contains_phrase - Matches an exact phrase
regex_match - Regex pattern matching
word_count_min - Minimum word count
length_min - Minimum character length
uppercase_ratio - Uppercase character ratio
starts_with / ends_with - Text boundaries

Example rule:

{
  "id": 1,
  "type": "keyword_any",
  "keywords": ["urgent", "critical", "important"],
  "score": 10,
  "description": "Urgent keywords"
}

System Architecture

Processing Pipeline

Input Text Files
      ↓
[Text Loader] - Load & clean raw text
      ↓
[Text Breaker] - Split into word chunks
      ↓
[Deduplicator] - Hash-based dedup check
      ↓
[Rule Checker] - Apply scoring rules (parallel)
      ↓
[Storage Layer] - Save to SQLite database
      ↓
[Outputs] - CSV, Reports, Emails

Key Components

Text Ingestion - Loads .txt files, normalizes whitespace
Chunking - Splits text into fixed-size word groups, assigns UIDs
Hashing - Computes SHA-256 for deduplication
Scoring - Applies rules in parallel threads
Storage - SQLite with indexes for fast queries
Analysis - Searches, exports, generates reports
Intelligence - Auto-suggests new rules from patterns

Usage Examples

Add Custom Text Files

Place .txt files in data/support_text_files/:

data/support_text_files/
├── ticket_1.txt
├── ticket_2.txt
└── customer_feedback.txt

Run the pipeline or upload via dashboard.

Create Custom Rules

Edit data/rules1.json to add new rules:

{
  "id": 8,
  "type": "keyword_any",
  "keywords": ["password", "authentication", "login"],
  "score": 15,
  "description": "Security-related keywords"
}

Dashboard has a built-in rules editor with backup!

Export Results

Dashboard provides:

CSV export for Excel analysis
PDF reports with charts
Rule hit summaries
Wordcloud visualizations

Interpret Scores

Each chunk gets a score based on rules:

0-20: Low priority/normal
20-40: Medium priority/attention needed
40+: High priority/urgent

Storage Improver

Auto-generates new rules from frequent patterns:

python run.py

Output → improver_output/suggestions.json

Example suggestion:

{
  "type": "keyword_any",
  "keywords": ["customer"],
  "score": 1,
  "source": "auto-generated"
}

You can manually review and add these to rules1.json.

Email Summaries

Optional SMTP integration sends email alerts:

Add valid Gmail/SMTP credentials to .env
Set SEND_EMAIL = True in run.py
Pipeline will email summaries with:
- Total chunks processed
- Average score
- Top high-scoring items
- Alerts for low-score items

Note: For Gmail, use App Passwords.

Search & Filter

Dashboard search supports:

Keywords: find all chunks containing "error"
Regex: find chunks matching pattern \d{4}
Score ranges: find chunks with score 20-40

Results export directly to CSV.

Analytics

Dashboard provides:

Score distribution - Histogram of all scores
Top rules - Most frequently triggered rules
Wordcloud - Visualize high-frequency terms
Timeline - Chunks processed over time
Statistics - Mean, median, min, max scores

Tech Stack

Component	Technology
Language	Python 3.9+
Database	SQLite3
UI	Streamlit
Parallelism	ThreadPoolExecutor
Data	Pandas
Visualization	Plotly, Matplotlib, WordCloud
Reports	ReportLab
Email	SMTP
Configuration	python-dotenv

Documentation

SETUP.md - Detailed setup and troubleshooting
INSTALL_COMPLETE.md - Installation summary
LICENSE - MIT License

🚀 Performance

Tested with:

✅ 10,000+ text chunks
✅ 100+ scoring rules
✅ Parallel processing (6 workers)
✅ SHA-256 hashing for 10,000+ items

Typical performance:

Chunk processing: ~100 chunks/second
Rule evaluation: ~50ms per chunk
Deduplication: <1ms per chunk

🔮 Future Roadmap

ML-based scoring (BERT, spaCy)
FastAPI REST endpoints
Vector embeddings & semantic search
Rule auto-learning with feedback
Docker containerization
Postgres support (scale beyond SQLite)
Real-time streaming pipeline
Advanced visualization (3D plots, networks)

Troubleshooting

Port 8501 already in use?

streamlit run streamlit_app.py --server.port 8502

Need to reset the database?

rm checks.db
python run.py  # Recreates fresh

Dependencies installation failed?

pip install --upgrade pip
pip install -r requirements.txt

Email not sending?

Verify credentials in .env
Check Gmail App Passwords
Check SMTP settings and firewall

See SETUP.md for more troubleshooting.

Contributing

Contributions welcome! Areas to improve:

Additional rule types
Machine learning integration
API layer
Performance optimization
Documentation

Please open an issue or pull request.

Credits

Project Lead:

Charan Teja Mangali — Lead Developer, System Architect & Mentor

Contributors:

Ayush Kumar — Full-stack implementation

License

MIT License — See LICENSE for details.

You are free to:

✅ Use for commercial and private purposes
✅ Modify and distribute
✅ Include in projects

With the condition:

Include original license and copyright notice

Show Your Support

If this project helped you, please consider:

⭐ Giving it a star on GitHub
🔗 Sharing with others
💬 Leaving feedback
🐛 Reporting issues

Questions or Support?

📖 Check SETUP.md and INSTALL_COMPLETE.md
🐛 Open an issue
📧 Contact maintainers

Happy Text Processing! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
app		app
data		data
.env		.env
.gitignore		.gitignore
Agile_Sheet.xlsx		Agile_Sheet.xlsx
INSTALL_COMPLETE.md		INSTALL_COMPLETE.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
requirements.txt		requirements.txt
run.py		run.py
setup.bat		setup.bat
setup.sh		setup.sh
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Python Parallel Text Handling Processor

Quick Start

What You Can Do

Key Features

Two Ways to Run

Option 1: Interactive Dashboard (Recommended)

Option 2: Batch Pipeline

Installation

Prerequisites

Step 1: Clone Repository

Step 2: Automated Setup (Recommended)

Step 3: Manual Setup (Alternative)

Project Structure

⚙️ Configuration

.env File

data/rules1.json Format

System Architecture

Processing Pipeline

Key Components

Usage Examples

Add Custom Text Files

Create Custom Rules

Export Results

Interpret Scores

Storage Improver

Email Summaries

Search & Filter

Analytics

Tech Stack

Documentation

🚀 Performance

🔮 Future Roadmap

Troubleshooting

Contributing

Credits

License

Show Your Support

Questions or Support?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`.env` File

`data/rules1.json` Format

Packages