ProjectDevOps

Running AI Models in Docker: A Practical Guide

Aabhigyan709👁️ 12
Running AI Models in Docker: A Practical Guide

Running AI models directly on your system can be messy—different Python versions, CUDA libraries, and dependencies often clash. Docker solves this by packaging everything in containers, giving you a clean, portable way to deploy models.

Why Use Docker for AI?

  • Isolation: No host pollution with Python/CUDA installs.

  • Portability: Run the same container on laptop, server, or cloud.

  • Reproducibility: Pin versions for consistent results.

  • Minimum Requirements

    • CPU (small models): 8–16 GB RAM, 4+ cores.

  • GPU (fast inference): NVIDIA GPU (24 GB VRAM recommended for 7B+ models).

  • Disk: At least 20–30 GB for Docker + models.

  • Setup

    1. Install Docker (Linux, macOS, Windows supported).

  • For GPUs: Install NVIDIA driver and NVIDIA Container Toolkit.

  • Verify GPU access:

  • docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
    

    Run a Model (Examples)

    CPU (llama.cpp)

    mkdir ~/models && cd ~/models
    curl -L -o model.gguf "https://example.com/my-model.gguf"
    docker run --rm -it -p 8080:8080 \
      -v "$PWD:/models" \
      ghcr.io/ggerganov/llama.cpp:full \
      --model /models/model.gguf --port 8080
    

    Test:

    curl -s http://localhost:8080/completion \
     -H "Content-Type: application/json" \
     -d '{"prompt":"Hello AI!","n_predict":50}'
    

    GPU (vLLM)

    docker run --rm -it --gpus all -p 8000:8000 \
      vllm/vllm-openai:latest \
      --model meta-llama/Llama-3.1-8B-Instruct
    

    Query:

    curl http://localhost:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{"model":"meta-llama/Llama-3.1-8B-Instruct",
          "messages":[{"role":"user","content":"Give me 3 Docker tips"}]}'
    

    Simple Option (Ollama)

    docker run -d --name ollama -p 11434:11434 --gpus all ollama/ollama:latest
    docker exec -it ollama ollama pull llama3.1
    docker exec -it ollama ollama run llama3.1 "Summarize CI/CD in 3 lines"
    

    Pros & Cons

    Pros

    • Clean setup

  • Easy GPU/CPU switching

  • Works locally & in the cloud

  • Cons

    • Needs disk and RAM

  • GPU setup can be tricky

  • Large model downloads

  • Final Thoughts

    For quick experiments, Ollama is the easiest. For production APIs, vLLM is best. And if you’re running lightweight quantized models on CPU, llama.cpp is a great choice. Docker makes all of them simple, reproducible, and portable.

    Comments (1)

    • AAbhishek Kumar
      Nice Post, abhigyan keep the good work up