Running AI Models in Docker: A Practical Guide

Running AI models directly on your system can be messy—different Python versions, CUDA libraries, and dependencies often clash. Docker solves this by packaging everything in containers, giving you a clean, portable way to deploy models.
Why Use Docker for AI?
Isolation: No host pollution with Python/CUDA installs.
Portability: Run the same container on laptop, server, or cloud.
Reproducibility: Pin versions for consistent results.
Minimum Requirements
CPU (small models): 8–16 GB RAM, 4+ cores.
GPU (fast inference): NVIDIA GPU (24 GB VRAM recommended for 7B+ models).
Disk: At least 20–30 GB for Docker + models.
Setup
Install Docker (Linux, macOS, Windows supported).
For GPUs: Install NVIDIA driver and NVIDIA Container Toolkit.
Verify GPU access:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Run a Model (Examples)
CPU (llama.cpp)
mkdir ~/models && cd ~/models
curl -L -o model.gguf "https://example.com/my-model.gguf"
docker run --rm -it -p 8080:8080 \
-v "$PWD:/models" \
ghcr.io/ggerganov/llama.cpp:full \
--model /models/model.gguf --port 8080
Test:
curl -s http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello AI!","n_predict":50}'
GPU (vLLM)
docker run --rm -it --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct
Query:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
"messages":[{"role":"user","content":"Give me 3 Docker tips"}]}'
Simple Option (Ollama)
docker run -d --name ollama -p 11434:11434 --gpus all ollama/ollama:latest
docker exec -it ollama ollama pull llama3.1
docker exec -it ollama ollama run llama3.1 "Summarize CI/CD in 3 lines"
Pros & Cons
Pros
Clean setup
Easy GPU/CPU switching
Works locally & in the cloud
Cons
Needs disk and RAM
GPU setup can be tricky
Large model downloads
Final Thoughts
For quick experiments, Ollama is the easiest. For production APIs, vLLM is best. And if you’re running lightweight quantized models on CPU, llama.cpp is a great choice. Docker makes all of them simple, reproducible, and portable.
Comments (1)
- AAbhishek Kumar•Nice Post, abhigyan keep the good work up