Week 8: Inference Frameworks in Practice
Part of the LLM Engineering & Deployment Certification Program
This repository contains code examples for deploying and benchmarking LLM inference frameworks. The module covers:
- Baseline Inference - Hugging Face Transformers + FastAPI serving
- vLLM - High-throughput serving with PagedAttention and continuous batching
- TGI - Hugging Face's Text Generation Inference server
- SGLang - Structured generation and prefix caching
- GPU Quantization - GPTQ and AWQ for faster inference
- GGUF & llama.cpp - CPU/local inference with quantized models
- Python 3.10+
- CUDA-capable GPU (required for vLLM, TGI, SGLang)
- ~8GB+ GPU memory for Llama 3.2 1B experiments
- Docker (optional, for TGI deployment)
Create a virtual environment:
python -m venv venvActivate the virtual environment:
# On Windows:
venv\Scripts\activate
# On Mac/Linux:
source venv/bin/activateFor baseline experiments:
pip install torch transformers peft accelerate fastapi uvicornFor vLLM:
pip install vllmFor llama.cpp (CPU inference):
pip install llama-cpp-pythonSome models (like Llama) require authentication:
huggingface-cli loginAccept the model license on the Hugging Face model page before downloading.
The baseline FastAPI server demonstrates naive inference with Hugging Face Transformers:
cd code
uvicorn deploy_baseline:app --host 0.0.0.0 --port 8000Endpoints:
| Endpoint | Description |
|---|---|
POST /generate |
Non-streaming text generation |
POST /generate_stream |
Streaming text generation |
POST /ttft_itl_batched |
Benchmark TTFT/ITL under load |
Test generation:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Summarize: The quick brown fox...", "max_new_tokens": 100}'Run benchmarks:
curl -X POST http://localhost:8000/ttft_itl_batched \
-H "Content-Type: application/json" \
-d '{"input_tokens": 512, "generated_tokens": 256, "num_prompts": 20, "batch_size": 4}'Start a vLLM server with LoRA support:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--enable-lora \
--lora-modules tuned_model=moo3030/Llama-3.2-1B-QLoRA-Summarizer-adapters \
--gpu-memory-utilization 0.7 \
--max-model-len 2048Test with OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'Benchmark with vLLM's built-in tool:
vllm bench serve \
--backend vllm \
--model meta-llama/Llama-3.2-1B-Instruct \
--dataset-name random \
--random-input-len 512 \
--random-output-len 256 \
--num-prompts 20 \
--max-concurrency 4| Notebook | Description |
|---|---|
inference_baseline.ipynb |
Run baseline server and benchmark TTFT/ITL |
inference_vllm.ipynb |
Deploy vLLM, test streaming, concurrent requests |
| Lesson | Topic | Key Concepts |
|---|---|---|
| 1 | Inference Basics | Autoregressive generation, prefill vs decode, bottlenecks |
| 2 | Benchmarking | TTFT, ITL, E2E, throughput, warmup, percentile reporting |
| 3 | KV Cache | Cache structure, memory cost, fragmentation |
| 4 | Attention Optimizations | Flash Attention, Paged Attention |
| 5 | Quantization | INT8/INT4, GPTQ, AWQ, GGUF formats |
| 6 | Scheduling | Continuous batching, speculative decoding |
| Lesson | Topic | Key Concepts |
|---|---|---|
| 0 | HF Baseline | FastAPI serving, streaming, baseline metrics |
| 1 | vLLM | PagedAttention, continuous batching, OpenAI API |
| 2 | TGI | Hugging Face inference server, Docker deployment |
| 3 | SGLang | RadixAttention, structured output, prefix caching |
| 4 | GPU Quantization | Loading GPTQ/AWQ models in vLLM |
| 5 | GGUF & llama.cpp | CPU inference, quantization levels |
| 6 | NIM Evaluation | End-to-end framework comparison |
rt-llm-eng-cert-week8/
├── code/
│ ├── deploy_baseline.py # Baseline FastAPI server
│ ├── inference_baseline.ipynb # Baseline benchmarking notebook
│ └── inference_vllm.ipynb # vLLM deployment notebook
├── .gitignore
├── requirements.txt
└── README.md
| Framework | Best For | Key Feature |
|---|---|---|
| HF Baseline | Prototyping, simple deployments | Simplicity, flexibility |
| vLLM | High-throughput production serving | PagedAttention, continuous batching |
| TGI | HuggingFace ecosystem integration | Docker deployment, safety features |
| SGLang | Structured output, agentic workloads | RadixAttention, constrained decoding |
| llama.cpp | CPU/local inference | GGUF quantization, no GPU required |
| Metric | What It Measures | What Drives It |
|---|---|---|
| TTFT | Time to first token | Prefill computation + scheduling |
| ITL | Time between tokens | Decode efficiency, memory bandwidth |
| TPOT | Time per output token | Average decode latency |
| Throughput | Tokens/second | Batching, hardware utilization |
| RPS | Requests/second | Overall system capacity |
- GPU Memory: Llama 3.2 1B requires ~4GB in FP16. Larger models need more or quantization.
- vLLM: Requires CUDA. Use
--gpu-memory-utilizationto control memory allocation. - Quantization: 4-bit models reduce memory ~4x with minimal quality loss.
- CPU Inference: llama.cpp with GGUF models works on any hardware but is slower.
This work is licensed under CC BY-NC-SA 4.0.
You are free to:
- Share and adapt this material for non-commercial purposes
- Must give appropriate credit and indicate changes made
- Must distribute adaptations under the same license
See LICENSE for full terms.
For questions or issues related to this repository, please refer to the course materials or contact your instructor.