Architecture Overview

Comprehensive system architecture for the AI-Augmented Security Operations Center (AI-SOC) platform.

Executive Summary

The AI-SOC platform implements a microservices-based architecture designed for scalability, resilience, and operational intelligence. The system integrates traditional SIEM capabilities with cutting-edge machine learning and large language models to provide autonomous threat detection, analysis, and response capabilities.

Core Design Principles:

Microservices Architecture: Independent, loosely-coupled services enable fault isolation and horizontal scaling
Defense in Depth: Multi-layered security with network segmentation and zero-trust principles
API-First Design: RESTful interfaces enable integration and extensibility
Observable by Default: Comprehensive metrics, logs, and traces for operational visibility
Infrastructure as Code: Complete configuration management via Docker Compose

System Architecture

High-Level Architecture

┌────────────────────────────────────────────────────────────────────┐
│                        External Data Sources                        │
│  Network Traffic, System Logs, Security Events, Threat Intelligence│
└───────────────────────────────┬────────────────────────────────────┘
                                │
        ┌───────────────────────┴────────────────────────┐
        │                                                  │
        ▼                                                  ▼
┌──────────────────────┐                    ┌─────────────────────────┐
│  Network Analysis    │                    │   External Log Sources  │
│  ─────────────────   │                    │   ──────────────────    │
│  • Suricata IDS/IPS  │                    │   • System Logs         │
│  • Zeek Monitor      │                    │   • Application Logs    │
│  • Packet Capture    │                    │   • Cloud Security Logs │
└──────────┬───────────┘                    └────────────┬────────────┘
           │                                             │
           └─────────────────┬───────────────────────────┘
                             │
                             ▼
            ┌────────────────────────────────┐
            │      SIEM Core (Phase 1)       │
            │  ─────────────────────────     │
            │  • Wazuh Manager (Ingestion)   │
            │  • Wazuh Indexer (Storage)     │
            │  • Wazuh Dashboard (UI)        │
            └───────────┬────────────────────┘
                        │
        ┌───────────────┼────────────────┐
        │               │                 │
        ▼               ▼                 ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│  AI Services │ │ SOAR Stack   │ │   Monitoring     │
│  ─────────── │ │ ──────────── │ │   ──────────     │
│  • ML Models │ │ • TheHive    │ │   • Prometheus   │
│  • LLM Agent │ │ • Cortex     │ │   • Grafana      │
│  • RAG/CTI   │ │ • Shuffle    │ │   • AlertManager │
└──────────────┘ └──────────────┘ └──────────────────┘
        │               │                 │
        └───────────────┴─────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │   Orchestration & Response    │
        │   ───────────────────────     │
        │   • Automated Playbooks       │
        │   • Case Management           │
        │   • Incident Response         │
        └───────────────────────────────┘

Architectural Layers

Layer 1: Data Ingestion

Purpose: Collect and normalize security telemetry from diverse sources.

Components:

Suricata IDS/IPS - Network-based intrusion detection using signature and anomaly detection
Zeek Network Monitor - Passive network traffic analysis and metadata extraction
Filebeat - Log shipping agent for centralized log collection
Wazuh Agents - Host-based security monitoring and file integrity

Design Rationale:

Multi-source ingestion provides comprehensive visibility across network and host layers
Standard log formats (JSON, CEF, Syslog) enable interoperability
Buffering and retry mechanisms ensure reliable data delivery

Performance Characteristics:

Throughput: 10,000+ events/second sustained
Latency: <100ms from event generation to indexing
Reliability: 99.9% delivery guarantee with persistent queues

Layer 2: SIEM Core

Purpose: Centralized log aggregation, correlation, and persistent storage.

Components:

Wazuh Manager - Event processing, correlation engine, API gateway
Wazuh Indexer - OpenSearch-based distributed search and analytics engine
Wazuh Dashboard - Web-based visualization and investigation interface

Technology Stack:

OpenSearch 2.x (distributed search engine)
Wazuh 4.8.2 (security information management)
Kibana fork (visualization framework)

Design Rationale:

OpenSearch provides horizontal scalability for petabyte-scale log storage
Wazuh's rule-based correlation enables real-time threat detection
RESTful API enables programmatic access for automation

Data Flow:

Event → Wazuh Manager → Rule Engine → Correlation → Indexer → Storage
                ↓
          Alert Generation → Webhook → SOAR

Performance Characteristics:

Indexing Rate: 50,000 events/second (3-node cluster)
Query Latency: <500ms for 90th percentile
Retention: 30 days hot storage, 365 days warm/cold tiers
Storage Efficiency: 10:1 compression ratio

Layer 3: AI Services

Purpose: Autonomous threat detection, classification, and contextual analysis using machine learning and large language models.

Architecture:

┌──────────────────────────────────────────────────────┐
│              AI Services Layer                        │
├──────────────────────────────────────────────────────┤
│                                                       │
│  ┌───────────────┐      ┌──────────────────┐        │
│  │ ML Inference  │◄────►│  Alert Triage    │        │
│  │    Engine     │      │    Service       │        │
│  ├───────────────┤      ├──────────────────┤        │
│  │ Random Forest │      │ LLaMA 3.1:8b     │        │
│  │ XGBoost       │      │ Risk Scoring     │        │
│  │ Decision Tree │      │ Prioritization   │        │
│  └───────────────┘      └─────────┬────────┘        │
│                                    │                  │
│                         ┌──────────▼────────┐        │
│                         │  RAG Service      │        │
│                         ├───────────────────┤        │
│                         │ MITRE ATT&CK DB   │        │
│                         │ Threat Intel      │        │
│                         │ ChromaDB Vector   │        │
│                         └───────────────────┘        │
└──────────────────────────────────────────────────────┘

Components:

1. ML Inference Engine

Models: Random Forest (primary), XGBoost (low-FP), Decision Tree (interpretable)
Performance: 99.28% accuracy, 0.8ms inference latency
API: FastAPI with automatic OpenAPI documentation
Deployment: Docker containerized with health checks

2. Alert Triage Service

LLM: LLaMA 3.1:8b via Ollama runtime
Function: Natural language analysis of security alerts
Capabilities:
- Risk scoring (0-100 scale)
- Attack classification
- Recommended response actions
- Executive summaries

3. RAG Service

Knowledge Base: 823 MITRE ATT&CK techniques
Vector Database: ChromaDB for semantic search
Retrieval: Top-k context retrieval for LLM augmentation
Latency: <50ms for 5 nearest neighbors

Design Rationale:

Ensemble Approach: Multiple ML models provide redundancy and complementary strengths
Hybrid Intelligence: Traditional ML (fast, deterministic) + LLM (contextual, adaptive)
Offline-First: Models deployed locally, no external API dependencies
Explainability: Decision tree model provides full transparency for compliance

Data Flow:

Alert → ML Classification → Prediction (BENIGN/ATTACK)
                          ↓
                    Alert Triage
                          ↓
              ┌───────────┴──────────┐
              ▼                       ▼
        RAG Retrieval           LLM Analysis
    (MITRE Techniques)       (Natural Language)
              │                       │
              └───────────┬───────────┘
                          ▼
              Enriched Alert (Risk Score,
               Classification, Context)
                          ▼
                      TheHive

Layer 4: SOAR Stack

Purpose: Security orchestration, automation, and response.

Components:

TheHive - Collaborative case management platform
Cortex - Observable analysis engine with 100+ analyzers
Shuffle - Workflow automation and playbook execution

Integration Points:

Wazuh → TheHive (webhook-based alert ingestion)
TheHive → Cortex (automated IOC enrichment)
TheHive → Shuffle (workflow triggers)
Shuffle → Response Actions (firewall rules, EDR isolation, notifications)

Design Rationale:

TheHive provides centralized case management for multi-analyst collaboration
Cortex automates repetitive analysis tasks (IP reputation, file hashing, threat intel)
Shuffle enables no-code playbook development for rapid response

Workflow Example:

Wazuh Alert → TheHive Case
                    ↓
          Cortex Analysis (IP reputation, geolocation)
                    ↓
         Shuffle Playbook Execution
                    ↓
         ┌──────────┴──────────┐
         ▼                      ▼
   Block IP (Firewall)    Notify SOC Team

Layer 5: Monitoring & Observability

Purpose: Real-time health monitoring, performance metrics, and alerting.

Components:

Prometheus - Time-series metrics database
Grafana - Visualization and dashboards
AlertManager - Alert routing and deduplication
Loki - Log aggregation for troubleshooting
cAdvisor + Node Exporter - Container and host metrics

Metrics Collection:

13 scrape targets across all services
15-second scrape interval
30-day retention for high-resolution data

Dashboards:

SIEM Stack Health (Wazuh Manager, Indexer, Dashboard)
ML Model Performance (inference latency, prediction distribution)
AI Services Metrics (LLM response times, RAG retrieval accuracy)
Infrastructure Resources (CPU, RAM, disk, network)

Alerting Rules:

Service down detection (<30 seconds)
Resource exhaustion (CPU >80%, RAM >90%)
ML model drift detection
Abnormal false positive rates

Design Rationale:

Prometheus provides industry-standard metrics format (compatible with all major tools)
Grafana enables custom dashboards for different stakeholder personas (SOC analyst, engineer, executive)
AlertManager prevents alert fatigue through intelligent grouping and inhibition

Network Architecture

Network Segmentation

Isolation Strategy: Backend/Frontend network separation per stack.

Network	Subnet	Purpose	Security Posture
siem-backend	172.20.0.0/24	SIEM internal comms	No external exposure
siem-frontend	172.21.0.0/24	SIEM web UI	HTTPS only
soar-backend	172.26.0.0/24	SOAR databases	No external exposure
soar-frontend	172.27.0.0/24	SOAR web UIs	HTTP (reverse proxy recommended)
monitoring	172.28.0.0/24	Observability stack	Internal only
ai-network	172.30.0.0/24	AI/ML services	API gateway protected

Benefits:

Compromised web UI cannot directly access backend databases
Lateral movement requires crossing network boundaries
Simplified firewall rule management
Clear trust boundaries for security policies

Port Allocation

Externally Accessible:

443 (Wazuh Dashboard - HTTPS)
3000 (Grafana)
9010 (TheHive)
9011 (Cortex)
3001 (Shuffle)
8500 (ML Inference API)
8100 (Alert Triage API)
8300 (RAG Service API)

Internal Only:

9200 (Wazuh Indexer - OpenSearch)
55000 (Wazuh Manager API)
9042 (Cassandra)
8200 (ChromaDB)
11434 (Ollama LLM)

See Network Topology for complete port mapping.

Technology Stack

Backend Services

Component	Technology	Version	Justification
SIEM	Wazuh	4.8.2	Open-source, MITRE ATT&CK mapping, active community
Search Engine	OpenSearch	2.x	Elasticsearch fork, scalable, no licensing restrictions
Case Management	TheHive	5.2.9	Purpose-built for SOC workflows, Cortex integration
Orchestration	Shuffle	1.4.0	Open-source SOAR, drag-drop workflows
Database	Cassandra	4.1.3	Distributed, fault-tolerant, scales horizontally
Vector DB	ChromaDB	Latest	AI-native, embedding support, simple API
Object Storage	MinIO	Latest	S3-compatible, self-hosted

AI/ML Stack

Component	Technology	Version	Justification
ML Framework	scikit-learn	1.3+	Industry standard, battle-tested algorithms
LLM Runtime	Ollama	Latest	Local inference, model management, OpenAI-compatible API
LLM Model	LLaMA 3.1	8B params	State-of-the-art open-source, optimal size/performance
API Framework	FastAPI	0.100+	Async support, automatic docs, type safety
Vector Embeddings	sentence-transformers	Latest	Pre-trained models, semantic similarity

Infrastructure

Component	Technology	Version	Justification
Container Runtime	Docker	24.0+	Industry standard, mature ecosystem
Orchestration	Docker Compose	V2	Simplified multi-container management
Monitoring	Prometheus	2.48+	De facto standard, extensive integrations
Visualization	Grafana	10.2+	Powerful dashboards, alerting, multi-datasource
Log Aggregation	Loki	2.9+	Prometheus-style log queries, low storage overhead

Scalability Considerations

Horizontal Scaling

SIEM Stack:

Wazuh Manager: Multi-node cluster with load balancing
Wazuh Indexer: OpenSearch cluster (3+ nodes for HA)
Capacity: 100,000+ events/second with 5-node indexer cluster

AI Services:

ML Inference: Stateless, add replicas behind load balancer
Alert Triage: Horizontal scaling limited by Ollama GPU availability
RAG Service: Stateless, ChromaDB supports distributed deployment

SOAR Stack:

TheHive: Multi-master cluster with Cassandra ring
Shuffle: Worker scaling for parallel workflow execution

Vertical Scaling

Resource Limits (per service):

Wazuh Indexer: 16GB RAM (configurable JVM heap)
ML Inference: 1GB RAM, 1 CPU (sufficient for 1,000 req/sec)
Ollama LLM: 8GB RAM minimum (16GB for larger models)
ChromaDB: 4GB RAM for 100K vectors

Performance Targets

Metric	Small Deployment	Medium	Large
Event Throughput	1,000/sec	10,000/sec	100,000/sec
Concurrent Analysts	5	25	100
Data Retention	30 days	90 days	365 days
Query Response (p95)	<1s	<500ms	<200ms
ML Inference Latency	<5ms	<2ms	<1ms

High Availability Design

Service Redundancy

Critical Services (require 99.9% uptime):

Wazuh Manager: 2+ nodes with failover
Wazuh Indexer: 3+ nodes (quorum-based)
Cassandra: 3+ nodes (RF=3)

Non-Critical Services (tolerate brief downtime):

Grafana: Single instance acceptable (read-only impact)
Shuffle: Workflow queue prevents data loss

Data Persistence

Volumes:

All stateful services use named Docker volumes
Volume backup strategy: daily snapshots
Retention: 30 days for volume backups

Backup Procedures:

# Wazuh Indexer snapshot
docker exec wazuh-indexer curl -X PUT "localhost:9200/_snapshot/backup"

# Cassandra backup
docker exec cassandra nodetool snapshot

# ChromaDB export
docker exec chromadb curl "http://localhost:8000/api/v1/export"

Security Architecture

Defense in Depth

Layer 1: Network Segmentation

Isolated Docker networks per stack
No direct backend exposure to internet
Firewall rules restrict inter-service communication

Layer 2: Authentication & Authorization

API key authentication for service-to-service
OAuth2/SAML for user authentication
Role-based access control (RBAC) in TheHive

Layer 3: Encryption

TLS 1.3 for all external communication
Self-signed certificates (development)
Let's Encrypt integration (production)

Layer 4: Secrets Management

Environment variable injection
Docker secrets for production
HashiCorp Vault integration (future)

Layer 5: Audit Logging

All API calls logged to Wazuh
Immutable audit trail
Retention: 365 days minimum

Threat Model

Assumed Threats:

External network attackers
Compromised web application
Insider threats (malicious analyst)
Supply chain attacks (vulnerable dependencies)

Mitigations:

Web Application Firewall (WAF) recommended
Principle of least privilege
Audit logging and anomaly detection
Dependency scanning (Dependabot, Snyk)

See Security Guide for detailed hardening procedures.

Integration Patterns

Event-Driven Architecture

Webhooks:

Wazuh → TheHive: Alert creation on rule match
TheHive → Shuffle: Case status changes trigger workflows
AlertManager → Shuffle: Infrastructure alerts trigger remediation

Benefits:

Loose coupling between services
Asynchronous processing prevents blocking
Retry mechanisms handle transient failures

API-First Design

RESTful APIs:

All services expose standardized REST endpoints
OpenAPI/Swagger documentation auto-generated
Consistent error handling (RFC 7807 Problem Details)

Example API Flow:

POST /triage
  → GET /ml-inference/predict (ML classification)
  → GET /rag-service/retrieve (MITRE context)
  → POST /ollama/api/generate (LLM analysis)
  → Response: Enriched alert

Development & Deployment

CI/CD Pipeline (Planned)

Code Commit → GitHub Actions
                    ↓
              Unit Tests
                    ↓
              Docker Build
                    ↓
         Integration Tests
                    ↓
      Deploy to Staging
                    ↓
         Smoke Tests
                    ↓
    Production Deployment

Configuration Management

Environment Variables:

.env file for local development
Docker Compose env_file directive
Secrets injected at runtime

Infrastructure as Code:

All configurations version-controlled
Declarative Docker Compose specifications
Idempotent deployment scripts

Future Architecture Enhancements

Short-term (Weeks 3-4)

Multi-class ML classification (24 attack types)
Reverse proxy (Nginx/Traefik) for HTTPS termination
Secrets management (HashiCorp Vault)
Automated backups

Medium-term (Months 2-3)

Kubernetes migration for production deployments
Multi-region deployment for disaster recovery
Advanced ML models (deep learning, transformers)
Custom Cortex analyzers

Long-term (Months 4-6)

Multi-agent collaboration framework
Automated playbook generation via LLM
Predictive threat modeling
Zero-trust network architecture

Appendices

A. Service Dependencies

Wazuh Dashboard → Wazuh Manager → Wazuh Indexer
TheHive → Cassandra + MinIO
Cortex → Cassandra + TheHive
Shuffle → OpenSearch
Alert Triage → ML Inference + RAG Service + Ollama
RAG Service → ChromaDB
Grafana → Prometheus + Loki
AlertManager → Prometheus

B. Resource Requirements

Minimum (Development/Testing):

CPU: 4 cores (8 threads)
RAM: 16GB
Disk: 50GB SSD
Network: 100Mbps

Recommended (Production):

CPU: 8 cores (16 threads)
RAM: 32GB
Disk: 250GB NVMe SSD
Network: 1Gbps

See System Requirements for detailed specifications.

C. Glossary

SIEM: Security Information and Event Management
SOAR: Security Orchestration, Automation, and Response
RAG: Retrieval-Augmented Generation
CTI: Cyber Threat Intelligence
MITRE ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge framework
IOC: Indicator of Compromise
EDR: Endpoint Detection and Response

Architecture Documentation Version: 1.0 Last Updated: October 24, 2025 Maintained By: AI-SOC Architecture Team

FilesExpand file tree

overview.md

Latest commit

History

overview.md

File metadata and controls

Architecture Overview

Executive Summary

System Architecture

High-Level Architecture

Architectural Layers

Layer 1: Data Ingestion

Layer 2: SIEM Core

Layer 3: AI Services

Layer 4: SOAR Stack

Layer 5: Monitoring & Observability

Network Architecture

Network Segmentation

Port Allocation

Technology Stack

Backend Services

AI/ML Stack

Infrastructure

Scalability Considerations

Horizontal Scaling

Vertical Scaling

Performance Targets

High Availability Design

Service Redundancy

Data Persistence

Security Architecture

Defense in Depth

Threat Model

Integration Patterns

Event-Driven Architecture

API-First Design

Development & Deployment

CI/CD Pipeline (Planned)

Configuration Management

Future Architecture Enhancements

Short-term (Weeks 3-4)

Medium-term (Months 2-3)

Long-term (Months 4-6)

Appendices

A. Service Dependencies

B. Resource Requirements

C. Glossary