Running AI Models in Docker: A Practical Guide

Running AI models directly on your system can be messy—different Python versions, CUDA libraries, and dependencies often clash. Docker solves this by packaging everything in containers, giving you a clean, portable way to deploy models.

Why Use Docker for AI?

Isolation: No host pollution with Python/CUDA installs.

Portability: Run the same container on laptop, server, or cloud.

Reproducibility: Pin versions for consistent results.

Minimum Requirements

CPU (small models): 8–16 GB RAM, 4+ cores.

GPU (fast inference): NVIDIA GPU (24 GB VRAM recommended for 7B+ models).

Disk: At least 20–30 GB for Docker + models.

Setup

Install Docker (Linux, macOS, Windows supported).

For GPUs: Install NVIDIA driver and NVIDIA Container Toolkit.

Verify GPU access:

docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi

Run a Model (Examples)

CPU (llama.cpp)

mkdir ~/models && cd ~/models
curl -L -o model.gguf "https://example.com/my-model.gguf"
docker run --rm -it -p 8080:8080 \
  -v "$PWD:/models" \
  ghcr.io/ggerganov/llama.cpp:full \
  --model /models/model.gguf --port 8080

Test:

curl -s http://localhost:8080/completion \
 -H "Content-Type: application/json" \
 -d '{"prompt":"Hello AI!","n_predict":50}'

GPU (vLLM)

docker run --rm -it --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

Query:

curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
      "messages":[{"role":"user","content":"Give me 3 Docker tips"}]}'

Simple Option (Ollama)

docker run -d --name ollama -p 11434:11434 --gpus all ollama/ollama:latest
docker exec -it ollama ollama pull llama3.1
docker exec -it ollama ollama run llama3.1 "Summarize CI/CD in 3 lines"

Pros & Cons

Pros

Clean setup

Easy GPU/CPU switching

Works locally & in the cloud

Cons

Needs disk and RAM

GPU setup can be tricky

Large model downloads

Final Thoughts

For quick experiments, Ollama is the easiest. For production APIs, vLLM is best. And if you’re running lightweight quantized models on CPU, llama.cpp is a great choice. Docker makes all of them simple, reproducible, and portable.