Comprehensive system architecture for the AI-Augmented Security Operations Center (AI-SOC) platform.
The AI-SOC platform implements a microservices-based architecture designed for scalability, resilience, and operational intelligence. The system integrates traditional SIEM capabilities with cutting-edge machine learning and large language models to provide autonomous threat detection, analysis, and response capabilities.
Core Design Principles:
- Microservices Architecture: Independent, loosely-coupled services enable fault isolation and horizontal scaling
- Defense in Depth: Multi-layered security with network segmentation and zero-trust principles
- API-First Design: RESTful interfaces enable integration and extensibility
- Observable by Default: Comprehensive metrics, logs, and traces for operational visibility
- Infrastructure as Code: Complete configuration management via Docker Compose
┌────────────────────────────────────────────────────────────────────┐
│ External Data Sources │
│ Network Traffic, System Logs, Security Events, Threat Intelligence│
└───────────────────────────────┬────────────────────────────────────┘
│
┌───────────────────────┴────────────────────────┐
│ │
▼ ▼
┌──────────────────────┐ ┌─────────────────────────┐
│ Network Analysis │ │ External Log Sources │
│ ───────────────── │ │ ────────────────── │
│ • Suricata IDS/IPS │ │ • System Logs │
│ • Zeek Monitor │ │ • Application Logs │
│ • Packet Capture │ │ • Cloud Security Logs │
└──────────┬───────────┘ └────────────┬────────────┘
│ │
└─────────────────┬───────────────────────────┘
│
▼
┌────────────────────────────────┐
│ SIEM Core (Phase 1) │
│ ───────────────────────── │
│ • Wazuh Manager (Ingestion) │
│ • Wazuh Indexer (Storage) │
│ • Wazuh Dashboard (UI) │
└───────────┬────────────────────┘
│
┌───────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ AI Services │ │ SOAR Stack │ │ Monitoring │
│ ─────────── │ │ ──────────── │ │ ────────── │
│ • ML Models │ │ • TheHive │ │ • Prometheus │
│ • LLM Agent │ │ • Cortex │ │ • Grafana │
│ • RAG/CTI │ │ • Shuffle │ │ • AlertManager │
└──────────────┘ └──────────────┘ └──────────────────┘
│ │ │
└───────────────┴─────────────────┘
│
▼
┌───────────────────────────────┐
│ Orchestration & Response │
│ ─────────────────────── │
│ • Automated Playbooks │
│ • Case Management │
│ • Incident Response │
└───────────────────────────────┘
Purpose: Collect and normalize security telemetry from diverse sources.
Components:
- Suricata IDS/IPS - Network-based intrusion detection using signature and anomaly detection
- Zeek Network Monitor - Passive network traffic analysis and metadata extraction
- Filebeat - Log shipping agent for centralized log collection
- Wazuh Agents - Host-based security monitoring and file integrity
Design Rationale:
- Multi-source ingestion provides comprehensive visibility across network and host layers
- Standard log formats (JSON, CEF, Syslog) enable interoperability
- Buffering and retry mechanisms ensure reliable data delivery
Performance Characteristics:
- Throughput: 10,000+ events/second sustained
- Latency: <100ms from event generation to indexing
- Reliability: 99.9% delivery guarantee with persistent queues
Purpose: Centralized log aggregation, correlation, and persistent storage.
Components:
- Wazuh Manager - Event processing, correlation engine, API gateway
- Wazuh Indexer - OpenSearch-based distributed search and analytics engine
- Wazuh Dashboard - Web-based visualization and investigation interface
Technology Stack:
- OpenSearch 2.x (distributed search engine)
- Wazuh 4.8.2 (security information management)
- Kibana fork (visualization framework)
Design Rationale:
- OpenSearch provides horizontal scalability for petabyte-scale log storage
- Wazuh's rule-based correlation enables real-time threat detection
- RESTful API enables programmatic access for automation
Data Flow:
Event → Wazuh Manager → Rule Engine → Correlation → Indexer → Storage
↓
Alert Generation → Webhook → SOAR
Performance Characteristics:
- Indexing Rate: 50,000 events/second (3-node cluster)
- Query Latency: <500ms for 90th percentile
- Retention: 30 days hot storage, 365 days warm/cold tiers
- Storage Efficiency: 10:1 compression ratio
Purpose: Autonomous threat detection, classification, and contextual analysis using machine learning and large language models.
Architecture:
┌──────────────────────────────────────────────────────┐
│ AI Services Layer │
├──────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌──────────────────┐ │
│ │ ML Inference │◄────►│ Alert Triage │ │
│ │ Engine │ │ Service │ │
│ ├───────────────┤ ├──────────────────┤ │
│ │ Random Forest │ │ LLaMA 3.1:8b │ │
│ │ XGBoost │ │ Risk Scoring │ │
│ │ Decision Tree │ │ Prioritization │ │
│ └───────────────┘ └─────────┬────────┘ │
│ │ │
│ ┌──────────▼────────┐ │
│ │ RAG Service │ │
│ ├───────────────────┤ │
│ │ MITRE ATT&CK DB │ │
│ │ Threat Intel │ │
│ │ ChromaDB Vector │ │
│ └───────────────────┘ │
└──────────────────────────────────────────────────────┘
Components:
1. ML Inference Engine
- Models: Random Forest (primary), XGBoost (low-FP), Decision Tree (interpretable)
- Performance: 99.28% accuracy, 0.8ms inference latency
- API: FastAPI with automatic OpenAPI documentation
- Deployment: Docker containerized with health checks
2. Alert Triage Service
- LLM: LLaMA 3.1:8b via Ollama runtime
- Function: Natural language analysis of security alerts
- Capabilities:
- Risk scoring (0-100 scale)
- Attack classification
- Recommended response actions
- Executive summaries
3. RAG Service
- Knowledge Base: 823 MITRE ATT&CK techniques
- Vector Database: ChromaDB for semantic search
- Retrieval: Top-k context retrieval for LLM augmentation
- Latency: <50ms for 5 nearest neighbors
Design Rationale:
- Ensemble Approach: Multiple ML models provide redundancy and complementary strengths
- Hybrid Intelligence: Traditional ML (fast, deterministic) + LLM (contextual, adaptive)
- Offline-First: Models deployed locally, no external API dependencies
- Explainability: Decision tree model provides full transparency for compliance
Data Flow:
Alert → ML Classification → Prediction (BENIGN/ATTACK)
↓
Alert Triage
↓
┌───────────┴──────────┐
▼ ▼
RAG Retrieval LLM Analysis
(MITRE Techniques) (Natural Language)
│ │
└───────────┬───────────┘
▼
Enriched Alert (Risk Score,
Classification, Context)
▼
TheHive
Purpose: Security orchestration, automation, and response.
Components:
- TheHive - Collaborative case management platform
- Cortex - Observable analysis engine with 100+ analyzers
- Shuffle - Workflow automation and playbook execution
Integration Points:
- Wazuh → TheHive (webhook-based alert ingestion)
- TheHive → Cortex (automated IOC enrichment)
- TheHive → Shuffle (workflow triggers)
- Shuffle → Response Actions (firewall rules, EDR isolation, notifications)
Design Rationale:
- TheHive provides centralized case management for multi-analyst collaboration
- Cortex automates repetitive analysis tasks (IP reputation, file hashing, threat intel)
- Shuffle enables no-code playbook development for rapid response
Workflow Example:
Wazuh Alert → TheHive Case
↓
Cortex Analysis (IP reputation, geolocation)
↓
Shuffle Playbook Execution
↓
┌──────────┴──────────┐
▼ ▼
Block IP (Firewall) Notify SOC Team
Purpose: Real-time health monitoring, performance metrics, and alerting.
Components:
- Prometheus - Time-series metrics database
- Grafana - Visualization and dashboards
- AlertManager - Alert routing and deduplication
- Loki - Log aggregation for troubleshooting
- cAdvisor + Node Exporter - Container and host metrics
Metrics Collection:
- 13 scrape targets across all services
- 15-second scrape interval
- 30-day retention for high-resolution data
Dashboards:
- SIEM Stack Health (Wazuh Manager, Indexer, Dashboard)
- ML Model Performance (inference latency, prediction distribution)
- AI Services Metrics (LLM response times, RAG retrieval accuracy)
- Infrastructure Resources (CPU, RAM, disk, network)
Alerting Rules:
- Service down detection (<30 seconds)
- Resource exhaustion (CPU >80%, RAM >90%)
- ML model drift detection
- Abnormal false positive rates
Design Rationale:
- Prometheus provides industry-standard metrics format (compatible with all major tools)
- Grafana enables custom dashboards for different stakeholder personas (SOC analyst, engineer, executive)
- AlertManager prevents alert fatigue through intelligent grouping and inhibition
Isolation Strategy: Backend/Frontend network separation per stack.
| Network | Subnet | Purpose | Security Posture |
|---|---|---|---|
| siem-backend | 172.20.0.0/24 | SIEM internal comms | No external exposure |
| siem-frontend | 172.21.0.0/24 | SIEM web UI | HTTPS only |
| soar-backend | 172.26.0.0/24 | SOAR databases | No external exposure |
| soar-frontend | 172.27.0.0/24 | SOAR web UIs | HTTP (reverse proxy recommended) |
| monitoring | 172.28.0.0/24 | Observability stack | Internal only |
| ai-network | 172.30.0.0/24 | AI/ML services | API gateway protected |
Benefits:
- Compromised web UI cannot directly access backend databases
- Lateral movement requires crossing network boundaries
- Simplified firewall rule management
- Clear trust boundaries for security policies
Externally Accessible:
- 443 (Wazuh Dashboard - HTTPS)
- 3000 (Grafana)
- 9010 (TheHive)
- 9011 (Cortex)
- 3001 (Shuffle)
- 8500 (ML Inference API)
- 8100 (Alert Triage API)
- 8300 (RAG Service API)
Internal Only:
- 9200 (Wazuh Indexer - OpenSearch)
- 55000 (Wazuh Manager API)
- 9042 (Cassandra)
- 8200 (ChromaDB)
- 11434 (Ollama LLM)
See Network Topology for complete port mapping.
| Component | Technology | Version | Justification |
|---|---|---|---|
| SIEM | Wazuh | 4.8.2 | Open-source, MITRE ATT&CK mapping, active community |
| Search Engine | OpenSearch | 2.x | Elasticsearch fork, scalable, no licensing restrictions |
| Case Management | TheHive | 5.2.9 | Purpose-built for SOC workflows, Cortex integration |
| Orchestration | Shuffle | 1.4.0 | Open-source SOAR, drag-drop workflows |
| Database | Cassandra | 4.1.3 | Distributed, fault-tolerant, scales horizontally |
| Vector DB | ChromaDB | Latest | AI-native, embedding support, simple API |
| Object Storage | MinIO | Latest | S3-compatible, self-hosted |
| Component | Technology | Version | Justification |
|---|---|---|---|
| ML Framework | scikit-learn | 1.3+ | Industry standard, battle-tested algorithms |
| LLM Runtime | Ollama | Latest | Local inference, model management, OpenAI-compatible API |
| LLM Model | LLaMA 3.1 | 8B params | State-of-the-art open-source, optimal size/performance |
| API Framework | FastAPI | 0.100+ | Async support, automatic docs, type safety |
| Vector Embeddings | sentence-transformers | Latest | Pre-trained models, semantic similarity |
| Component | Technology | Version | Justification |
|---|---|---|---|
| Container Runtime | Docker | 24.0+ | Industry standard, mature ecosystem |
| Orchestration | Docker Compose | V2 | Simplified multi-container management |
| Monitoring | Prometheus | 2.48+ | De facto standard, extensive integrations |
| Visualization | Grafana | 10.2+ | Powerful dashboards, alerting, multi-datasource |
| Log Aggregation | Loki | 2.9+ | Prometheus-style log queries, low storage overhead |
SIEM Stack:
- Wazuh Manager: Multi-node cluster with load balancing
- Wazuh Indexer: OpenSearch cluster (3+ nodes for HA)
- Capacity: 100,000+ events/second with 5-node indexer cluster
AI Services:
- ML Inference: Stateless, add replicas behind load balancer
- Alert Triage: Horizontal scaling limited by Ollama GPU availability
- RAG Service: Stateless, ChromaDB supports distributed deployment
SOAR Stack:
- TheHive: Multi-master cluster with Cassandra ring
- Shuffle: Worker scaling for parallel workflow execution
Resource Limits (per service):
- Wazuh Indexer: 16GB RAM (configurable JVM heap)
- ML Inference: 1GB RAM, 1 CPU (sufficient for 1,000 req/sec)
- Ollama LLM: 8GB RAM minimum (16GB for larger models)
- ChromaDB: 4GB RAM for 100K vectors
| Metric | Small Deployment | Medium | Large |
|---|---|---|---|
| Event Throughput | 1,000/sec | 10,000/sec | 100,000/sec |
| Concurrent Analysts | 5 | 25 | 100 |
| Data Retention | 30 days | 90 days | 365 days |
| Query Response (p95) | <1s | <500ms | <200ms |
| ML Inference Latency | <5ms | <2ms | <1ms |
Critical Services (require 99.9% uptime):
- Wazuh Manager: 2+ nodes with failover
- Wazuh Indexer: 3+ nodes (quorum-based)
- Cassandra: 3+ nodes (RF=3)
Non-Critical Services (tolerate brief downtime):
- Grafana: Single instance acceptable (read-only impact)
- Shuffle: Workflow queue prevents data loss
Volumes:
- All stateful services use named Docker volumes
- Volume backup strategy: daily snapshots
- Retention: 30 days for volume backups
Backup Procedures:
# Wazuh Indexer snapshot
docker exec wazuh-indexer curl -X PUT "localhost:9200/_snapshot/backup"
# Cassandra backup
docker exec cassandra nodetool snapshot
# ChromaDB export
docker exec chromadb curl "http://localhost:8000/api/v1/export"Layer 1: Network Segmentation
- Isolated Docker networks per stack
- No direct backend exposure to internet
- Firewall rules restrict inter-service communication
Layer 2: Authentication & Authorization
- API key authentication for service-to-service
- OAuth2/SAML for user authentication
- Role-based access control (RBAC) in TheHive
Layer 3: Encryption
- TLS 1.3 for all external communication
- Self-signed certificates (development)
- Let's Encrypt integration (production)
Layer 4: Secrets Management
- Environment variable injection
- Docker secrets for production
- HashiCorp Vault integration (future)
Layer 5: Audit Logging
- All API calls logged to Wazuh
- Immutable audit trail
- Retention: 365 days minimum
Assumed Threats:
- External network attackers
- Compromised web application
- Insider threats (malicious analyst)
- Supply chain attacks (vulnerable dependencies)
Mitigations:
- Web Application Firewall (WAF) recommended
- Principle of least privilege
- Audit logging and anomaly detection
- Dependency scanning (Dependabot, Snyk)
See Security Guide for detailed hardening procedures.
Webhooks:
- Wazuh → TheHive: Alert creation on rule match
- TheHive → Shuffle: Case status changes trigger workflows
- AlertManager → Shuffle: Infrastructure alerts trigger remediation
Benefits:
- Loose coupling between services
- Asynchronous processing prevents blocking
- Retry mechanisms handle transient failures
RESTful APIs:
- All services expose standardized REST endpoints
- OpenAPI/Swagger documentation auto-generated
- Consistent error handling (RFC 7807 Problem Details)
Example API Flow:
POST /triage
→ GET /ml-inference/predict (ML classification)
→ GET /rag-service/retrieve (MITRE context)
→ POST /ollama/api/generate (LLM analysis)
→ Response: Enriched alert
Code Commit → GitHub Actions
↓
Unit Tests
↓
Docker Build
↓
Integration Tests
↓
Deploy to Staging
↓
Smoke Tests
↓
Production Deployment
Environment Variables:
.envfile for local development- Docker Compose env_file directive
- Secrets injected at runtime
Infrastructure as Code:
- All configurations version-controlled
- Declarative Docker Compose specifications
- Idempotent deployment scripts
- Multi-class ML classification (24 attack types)
- Reverse proxy (Nginx/Traefik) for HTTPS termination
- Secrets management (HashiCorp Vault)
- Automated backups
- Kubernetes migration for production deployments
- Multi-region deployment for disaster recovery
- Advanced ML models (deep learning, transformers)
- Custom Cortex analyzers
- Multi-agent collaboration framework
- Automated playbook generation via LLM
- Predictive threat modeling
- Zero-trust network architecture
Wazuh Dashboard → Wazuh Manager → Wazuh Indexer
TheHive → Cassandra + MinIO
Cortex → Cassandra + TheHive
Shuffle → OpenSearch
Alert Triage → ML Inference + RAG Service + Ollama
RAG Service → ChromaDB
Grafana → Prometheus + Loki
AlertManager → Prometheus
Minimum (Development/Testing):
- CPU: 4 cores (8 threads)
- RAM: 16GB
- Disk: 50GB SSD
- Network: 100Mbps
Recommended (Production):
- CPU: 8 cores (16 threads)
- RAM: 32GB
- Disk: 250GB NVMe SSD
- Network: 1Gbps
See System Requirements for detailed specifications.
- SIEM: Security Information and Event Management
- SOAR: Security Orchestration, Automation, and Response
- RAG: Retrieval-Augmented Generation
- CTI: Cyber Threat Intelligence
- MITRE ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge framework
- IOC: Indicator of Compromise
- EDR: Endpoint Detection and Response
Architecture Documentation Version: 1.0 Last Updated: October 24, 2025 Maintained By: AI-SOC Architecture Team