ProjectDevOps

Ship Your ML Model Like a Pro End-to-End CI/CD to AWS (ECR and ECS Fargate) with FastAPI & GitHub Actions

Aabhigyan709👁️ 8
Ship Your ML Model Like a Pro End-to-End CI/CD to AWS (ECR and ECS Fargate) with FastAPI & GitHub Actions

Want your ML model to go from “works on my laptop” to “serving real users behind a load balancer”? This tutorial walks you end-to-end: project layout, training a model, packaging an API, writing tests, containerizing, and setting up a production-grade CI/CD pipeline that builds, scans, pushes, and deploys to AWS.

You’ll finish with:

  • A working /predict API backed by your model

  • Automated tests + linting

  • A secure GitHub Actions pipeline using AWS OIDC, Amazon ECR and ECS Fargate

  • Zero server management (fully managed containers)


  • What we’re building

    Flow:
    Commit → GitHub Actions: test → build → scan → push image to ECR → deploy ECS task → behind ALB → /predict live.

    Stack: Python, scikit-learn, FastAPI, Docker, pytest, GitHub Actions, AWS ECR, ECS Fargate, ALB.


    1) Repo structure

    ml-cicd-aws/
    ├─ app/
    │  ├─ main.py
    │  ├─ model.py
    │  ├─ schemas.py
    │  ├─ __init__.py
    ├─ data/
    │  └─ iris.csv              # (optional; we’ll use sklearn’s dataset)
    ├─ models/
    │  └─ model.pkl             # produced by training job
    ├─ tests/
    │  ├─ test_api.py
    │  └─ test_model.py
    ├─ training/
    │  └─ train.py
    ├─ Dockerfile
    ├─ requirements.txt
    ├─ runtime.txt              # optional pin for build tooling
    ├─ .dockerignore
    ├─ .gitignore
    ├─ Makefile
    └─ .github/
       └─ workflows/
          └─ ci-cd.yml

    2) Minimal ML model (training script)

    We’ll train a simple classifier (Iris) and persist it to models/model.pkl.

    training/train.py

    import os
    import joblib
    from pathlib import Path
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    
    OUT_DIR = Path(__file__).resolve().parents[1] / "models"
    OUT_DIR.mkdir(parents=True, exist_ok=True)
    MODEL_PATH = OUT_DIR / "model.pkl"
    
    def train():
        iris = load_iris()
        X, y = iris.data, iris.target
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, stratify=y, random_state=42
        )
        clf = RandomForestClassifier(n_estimators=120, random_state=42)
        clf.fit(X_train, y_train)
        joblib.dump({"model": clf, "target_names": iris.target_names}, MODEL_PATH)
        print(f"Saved: {MODEL_PATH}")
    
    if __name__ == "__main__":
        train()

    Run it locally:

    python -m venv .venv && source .venv/bin/activate
    pip install -r requirements.txt
    python training/train.py

    3) FastAPI app to serve predictions

    app/schemas.py

    from pydantic import BaseModel, Field
    from typing import List
    
    class PredictRequest(BaseModel):
        # For iris: sepal_length, sepal_width, petal_length, petal_width
        instances: List[List[float]] = Field(..., example=[[5.1, 3.5, 1.4, 0.2]])

    app/model.py

    import joblib
    from pathlib import Path
    
    class ModelService:
        def __init__(self, path: str = "models/model.pkl"):
            p = Path(path)
            if not p.exists():
                raise FileNotFoundError(f"Model not found at {p.resolve()}")
            blob = joblib.load(p)
            self.model = blob["model"]
            self.target_names = blob["target_names"]
    
        def predict(self, X):
            labels = self.model.predict(X)
            return [self.target_names[i] for i in labels]

    app/main.py

    from fastapi import FastAPI
    from app.schemas import PredictRequest
    from app.model import ModelService
    
    app = FastAPI(title="ML Iris Predictor")
    svc = ModelService()  # loads on startup
    
    @app.get("/health")
    def health():
        return {"status": "ok"}
    
    @app.post("/predict")
    def predict(payload: PredictRequest):
        preds = svc.predict(payload.instances)
        return {"predictions": preds}

    4) Tests

    tests/test_model.py

    import os
    from app.model import ModelService
    
    def test_model_loads():
        assert os.path.exists("models/model.pkl"), "Run training/train.py first"
        svc = ModelService()
        assert svc.model is not None

    tests/test_api.py

    from fastapi.testclient import TestClient
    from app.main import app
    
    client = TestClient(app)
    
    def test_health():
        r = client.get("/health")
        assert r.status_code == 200
        assert r.json()["status"] == "ok"
    
    def test_predict():
        payload = {"instances": [[5.1, 3.5, 1.4, 0.2]]}
        r = client.post("/predict", json=payload)
        assert r.status_code == 200
        assert "predictions" in r.json()
        assert isinstance(r.json()["predictions"], list)

    5) Requirements

    requirements.txt

    fastapi==0.115.2
    uvicorn[standard]==0.30.6
    scikit-learn==1.5.2
    joblib==1.4.2
    pydantic==2.9.2
    pytest==8.3.3

    6) Containerization

    .dockerignore

    .venv
    __pycache__
    *.pyc
    *.pyo
    .git
    .gitignore
    *.ipynb
    data/

    Dockerfile

    FROM python:3.11-slim
    
    # System deps
    RUN apt-get update && apt-get install -y --no-install-recommends \
        gcc build-essential && \
        rm -rf /var/lib/apt/lists/*
    
    WORKDIR /app
    
    # Pre-copy requirements for better layer caching
    COPY requirements.txt .
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Copy the rest
    COPY app/ app/
    COPY models/ models/
    
    # FastAPI runs on 8000
    EXPOSE 8000
    
    # Healthcheck (ECS respects container exit, ALB health checks the /health)
    HEALTHCHECK --interval=30s --timeout=3s \
     CMD curl -f http://localhost:8000/health || exit 1
    
    # Start the API
    CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

    Build & run locally:

    docker build -t ml-iris:local .
    docker run -p 8000:8000 ml-iris:local
    # test
    curl -s http://localhost:8000/health
    curl -s -X POST http://localhost:8000/predict -H "Content-Type: application/json" \
      -d '{"instances": [[5.1, 3.5, 1.4, 0.2],[6.0, 2.7, 5.1, 1.6]]}'

    7) Makefile (handy shortcuts)

    Makefile

    venv:
    	python -m venv .venv && . .venv/bin/activate && pip install -r requirements.txt
    
    train:
    	. .venv/bin/activate || true; python training/train.py
    
    test:
    	pytest -q
    
    build:
    	docker build -t $(IMAGE):$(TAG) .
    
    run:
    	docker run -p 8000:8000 $(IMAGE):$(TAG)
    
    .PHONY: venv train test build run

    8) AWS setup (one-time)

    8.1 Create ECR repository

    aws ecr create-repository --repository-name ml-iris --image-scanning-configuration scanOnPush=true

    Note the ECR URI: ACCOUNT_ID.dkr.ecr.ap-south-1.amazonaws.com/ml-iris

    8.2 Create ECS Cluster + Fargate Service (with ALB)

    You can click-through in the console (ECS → Create cluster → Fargate), or use IaC later. For the first pass, console is fine:

    • Cluster: ml-cluster

  • Task Definition: Fargate, CPU 0.25 vCPU, memory 0.5–1 GB, container port 8000

  • Load Balancer: Application Load Balancer, health check path /health, target type IP

  • Desired tasks: 1 (scale later)

  • Keep AWSVPC networking; pick public subnets (or private with NAT).


    9) GitHub Actions with OIDC (no long-lived AWS keys)

    9.1 AWS IAM Role for GitHub OIDC

    1. Create OIDC provider (usually already exists if you’ve done it before):

    • Provider URL: https://token.actions.githubusercontent.com

  • Audience: sts.amazonaws.com

    1. Create IAM role (trusted by that provider), attach policy to push to ECR and deploy to ECS.

    Trust policy (example):

    {
      "Version": "2012-10-17",
      "Statement": [{
        "Effect": "Allow",
        "Principal": { "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com" },
        "Action": "sts:AssumeRoleWithWebIdentity",
        "Condition": {
          "StringEquals": {
            "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
          },
          "StringLike": {
            "token.actions.githubusercontent.com:sub": "repo:<GITHUB_OWNER>/<REPO_NAME>:*"
          }
        }
      }]
    }

    Permissions policy (minimal to ECR/ECS):

    {
      "Version": "2012-10-17",
      "Statement": [
        {"Effect": "Allow","Action": ["ecr:GetAuthorizationToken"],"Resource": "*"},
        {"Effect": "Allow","Action": ["ecr:BatchCheckLayerAvailability","ecr:CompleteLayerUpload","ecr:UploadLayerPart","ecr:InitiateLayerUpload","ecr:PutImage","ecr:DescribeRepositories"],"Resource": "arn:aws:ecr:ap-south-1:<ACCOUNT_ID>:repository/ml-iris"},
        {"Effect": "Allow","Action": ["ecs:DescribeServices","ecs:DescribeTaskDefinition","ecs:RegisterTaskDefinition","ecs:UpdateService"],"Resource": "*"},
        {"Effect": "Allow","Action": ["iam:PassRole"],"Resource": "*","Condition": {"StringEquals":{"iam:PassedToService":"ecs-tasks.amazonaws.com"}}}
      ]
    }

    Save the role ARN as a GitHub Actions secret: AWS_ROLE_TO_ASSUME.

    Also add these Actions secrets:

    • AWS_REGION = ap-south-1 (or your region)

  • ECR_REPO = ml-iris

  • ECS_CLUSTER= ml-cluster

  • ECS_SERVICE= ml-iris-service (the service you’ll create)

  • (Optional) IMAGE_TAG defaults to github.sha in workflow


  • 10) CI/CD pipeline

    .github/workflows/ci-cd.yml

    name: CI/CD - ML to AWS Fargate
    
    on:
      push:
        branches: [ "main" ]
      workflow_dispatch:
    
    permissions:
      id-token: write   # for OIDC
      contents: read
    
    env:
      AWS_REGION: ${{ secrets.AWS_REGION }}
      ECR_REPO: ${{ secrets.ECR_REPO }}
      ECS_CLUSTER: ${{ secrets.ECS_CLUSTER }}
      ECS_SERVICE: ${{ secrets.ECS_SERVICE }}
      IMAGE_TAG: ${{ github.sha }}
    
    jobs:
      build-test:
        runs-on: ubuntu-latest
        steps:
          - name: Checkout
            uses: actions/checkout@v4
    
          - name: Set up Python
            uses: actions/setup-python@v5
            with:
              python-version: "3.11"
    
          - name: Install deps
            run: pip install -r requirements.txt
    
          - name: Train model (fresh artifact)
            run: python training/train.py
    
          - name: Run tests
            run: pytest -q
    
          - name: Build Docker image
            run: |
              IMAGE_URI=${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.AWS_REGION }}.amazonaws.com/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}
              docker build -t $IMAGE_URI .
    
          - name: Trivy scan (optional but recommended)
            uses: aquasecurity/trivy-action@0.28.0
            with:
              image-ref: ${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.AWS_REGION }}.amazonaws.com/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}
              format: 'table'
              vuln-type: 'os,library'
              exit-code: '0'  # don't fail build initially; tighten later
    
          - name: Configure AWS credentials (OIDC)
            uses: aws-actions/configure-aws-credentials@v4
            with:
              role-to-assume: ${{ secrets.AWS_ROLE_TO_ASSUME }}
              aws-region: ${{ env.AWS_REGION }}
    
          - name: Login to ECR
            id: ecr
            uses: aws-actions/amazon-ecr-login@v2
    
          - name: Push to ECR
            run: |
              IMAGE_URI=${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.AWS_REGION }}.amazonaws.com/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}
              docker push $IMAGE_URI
    
          - name: Render task definition
            id: taskdef
            run: |
              IMAGE_URI=${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ env.AWS_REGION }}.amazonaws.com/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}
              cat > taskdef.json << 'JSON'
              {
                "family": "ml-iris-td",
                "networkMode": "awsvpc",
                "requiresCompatibilities": ["FARGATE"],
                "cpu": "256",
                "memory": "512",
                "executionRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/ecsTaskExecutionRole",
                "taskRoleArn": "arn:aws:iam::${ACCOUNT_ID}:role/ecsTaskRole",
                "containerDefinitions": [{
                  "name": "ml-iris",
                  "image": "${IMAGE_URI}",
                  "essential": true,
                  "portMappings": [{"containerPort": 8000, "protocol": "tcp"}],
                  "logConfiguration": {
                    "logDriver": "awslogs",
                    "options": {
                      "awslogs-group": "/ecs/ml-iris",
                      "awslogs-region": "${REGION}",
                      "awslogs-stream-prefix": "ecs"
                    }
                  },
                  "healthCheck": {
                    "command": ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"],
                    "interval": 30,
                    "timeout": 5,
                    "retries": 3,
                    "startPeriod": 10
                  }
                }]
              }
              JSON
              sed -i "s|\${IMAGE_URI}|$IMAGE_URI|g" taskdef.json
              sed -i "s|\${ACCOUNT_ID}|${{ secrets.AWS_ACCOUNT_ID }}|g" taskdef.json
              sed -i "s|\${REGION}|${{ env.AWS_REGION }}|g" taskdef.json
              cat taskdef.json
    
          - name: Register new task definition
            id: register
            run: |
              ARN=$(aws ecs register-task-definition --cli-input-json file://taskdef.json --query 'taskDefinition.taskDefinitionArn' --output text)
              echo "TASK_DEF_ARN=$ARN" >> $GITHUB_OUTPUT
    
          - name: Deploy service (rolling update)
            run: |
              aws ecs update-service \
                --cluster "${{ env.ECS_CLUSTER }}" \
                --service "${{ env.ECS_SERVICE }}" \
                --task-definition "${{ steps.register.outputs.TASK_DEF_ARN }}" \
                --force-new-deployment
    
          - name: Wait for stability
            run: |
              aws ecs wait services-stable \
                --cluster "${{ env.ECS_CLUSTER }}" \
                --services "${{ env.ECS_SERVICE }}"

    Replace secrets.AWS_ACCOUNT_ID, roles, cluster & service names to match your account.


    11) Create the ECS service (once)

    If you didn’t do the console wizard, you can create the service after the first task definition registers:

    # Example (adjust subnets/securitygroups/ALB target group ARNs accordingly)
    aws ecs create-service \
      --cluster ml-cluster \
      --service-name ml-iris-service \
      --task-definition ml-iris-td \
      --desired-count 1 \
      --launch-type FARGATE \
      --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx,subnet-yyy],securityGroups=[sg-zzz],assignPublicIp=ENABLED}" \
      --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:ap-south-1:<ACCOUNT_ID>:targetgroup/ml-tg/abc123,containerName=ml-iris,containerPort=8000"

    Point your ALB listener (port 80/443) to that target group. Health check path: /health.


    12) Try it live

    Once the workflow finishes, hit the ALB DNS:

    curl http://<your-alb-dns>/health
    curl -X POST http://<your-alb-dns}/predict -H "Content-Type: application/json" \
      -d '{"instances": [[6.2, 2.8, 4.8, 1.8]]}'

    You should see a species prediction (e.g., virginica).


    13) Production tips

    • Reproducible training: Move training into a separate job/artifact, version artifacts (model-YYYYMMDD.pkl) and pin them in releases.

  • Model registry: Store artifacts in S3 with versioning; pass S3 URL via ECS task env var.

  • Secrets: Use SSM Parameter Store or Secrets Manager; mount via task definition.

  • Observability: Enable AWS Logs Insights for /ecs/ml-iris, add request metrics via a sidecar (e.g., Prometheus exporter) or API middleware.

  • Autoscaling: Configure Target Tracking on ECS service (CPU/Memory or ALB RequestCount).

  • Blue/Green: Use CodeDeploy for zero-downtime if you need canaries/linear rollouts.

  • Security: Restrict IAM to least privilege, use private subnets + NAT in production.


  • 14) Local developer experience (optional)

    docker-compose.yml for quick spins:

    version: "3.9"
    services:
      api:
        build: .
        ports:
          - "8000:8000"

    15) Common pitfalls & fixes

    • Model file missing in container
      Make sure models/model.pkl exists before building the image (run training in CI first).

  • ALB health check failing
    Confirm containerPort 8000, target group health check path /health, security groups allow ALB → ECS.

  • Permission errors in CI
    Recheck IAM role trust policy sub matches repo:OWNER/REPO:*, and id-token: write is enabled.

  • In this tutorial, you trained a scikit-learn model, exposed it via a FastAPI endpoint, containerized it with Docker, and built a secure CI/CD pipeline using GitHub Actions’ OIDC to deploy on AWS ECS Fargate behind an Application Load Balancer. Every push to main tests your code, scans your image, pushes to ECR, and rolls out a new task—giving you a reproducible, auditable path from experimentation to real-world traffic.

    Comments (0)

    No comments yet. Be the first to share your thoughts!