Kubernetes’ Third Era: Production Reality

We’ve lived through two Kubernetes ages already. First came survival: get containers onto nodes and keep them alive. Then came scale: stretch clusters to thousands of nodes and millions of pods.
Scale, however, exposed the seams. Real infrastructure is messy. In 2025—with AI jobs everywhere—those seams rip open at 3 a.m. when a training run dies because the scheduler ignores GPU topology. Finance flags a bill spike from cross-zone egress nobody meant to pay for. Autoscalers yo-yo on noisy signals.
Kubernetes 1.34 (“Of Wind & Will”) is the turn toward production reality. Instead of new layers, it sharpens the layers we already have—making them hardware-, network-, and security-aware—so the platform’s assumptions finally match the operator’s world.
Where Project DevOps fits: Kubernetes 1.34 hands you smarter primitives; Project DevOps closes the loop—turning kernel and device signals (PSI, DRA, etc.) into continuous rightsizing, topology-aware placement, and calm, cost-sane autoscaling.
Advancements in Resource Intelligence
Until now, Kubernetes often treated diverse hardware like a single commodity pool. 1.34 changes that with features that understand the shape of resources, not just their counts.
1) Dynamic Resource Allocation (DRA) — GA
Problem: nvidia.com/gpu: 1
can’t express which 1—A100 vs. L4, with or without NVLink—so pods misplace or fail to bind.
What’s new: With DRA, you declare ResourceClaims against DeviceClasses. The scheduler becomes topology-aware and matches claims to concrete devices, not opaque integers. Allocations are AllocateOnce and immutable, so ML/HPC jobs get predictable access.
Prereqs:
Turn on
DynamicResourceAllocation
on apiserver, controller-manager, scheduler, and kubelet.Deploy a DRA-capable device plugin (e.g., NVIDIA GPU Operator with DRA support). Without it, claims won’t bind and pods won’t schedule.
Why it matters: Tuned GPU placement can take utilization from ~30% to ~60%—more throughput per dollar.
Example ResourceClaim for A100s with NVLink
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: ml-training-claim
spec:
devices:
requests:
- name: gpu-req
exactly:
deviceClassName: nvidia-gpu
allocationMode: ExactCount
count: 2
selectors:
- cel:
expression: |
device.capacity["nvidia.com"].memory.compareTo(quantity("40Gi")) >= 0 &&
device.attributes["nvidia.com"].nvlink == "true" &&
device.attributes["nvidia.com"].product == "A100"
2) Swap Memory Support — Beta
Sudden memory spikes cause OOMKills. Swap support arrives—with guardrails:
Guaranteed pods: never swapped (latency-critical).
BestEffort pods: still no swap (no fair share basis).
Burstable pods: can use swap proportional to their requests.
Enablement: configure OS swap first, then kubelet:
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
failSwapOn: false
memorySwap:
swapBehavior: LimitedSwap
Ops tips: start with a small, tainted node pool; watch node_swap_usage_bytes
/ pod_swap_usage_bytes
; keep system daemons off swap (memory.swap.max=0
for system.slice
); control plane should remain swap-free. Think of swap as an airbag, not extra RAM.
3) Pressure Stall Information (PSI) — Beta
CPU/memory “utilization OK” while users feel lag? The app isn’t busy—it’s stalled. PSI (from Linux 4.20) measures time spent waiting on CPU, memory reclaim, or I/O, exposed per cgroup with cgroups v2.
In 1.34: enable KubeletPSI=true
. PSI becomes a first-class signal you can export to Prometheus and surface via the Prometheus Adapter.
Example: HPA on memory stall time (external metric)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
# ...
spec:
metrics:
- type: External
external:
metric:
name: psi_memory_stall_seconds_total
target:
type: AverageValue
averageValue: "0.1" # 100ms stalled per second
Caution: the work is in interpretation. Short I/O spikes shouldn’t trigger hour-long scale-ups. Project DevOps correlates PSI with historical patterns to distinguish blips from real capacity needs.
Networking That’s Cheaper and Smarter
trafficDistribution for Services — Beta, on by default
Multi-AZ clusters often hairpin traffic across zones, inflating tail latency and egress costs. The Service trafficDistribution
setting (born as “PreferClose”) is now ready for prod.
PreferSameZone: prefer zone-local endpoints.
PreferSameNode: prefer node-local, then zone-local, then cluster-wide; great for per-node agents (DNS, log shippers, metrics).
If your dataplane doesn’t understand hints, it falls back to standard behavior—no breakage, just no locality preference.
Example: keep DNS node-local
apiVersion: v1
kind: Service
metadata:
name: local-dns
spec:
selector:
app: dns
ports:
- port: 53
targetPort: 53
protocol: UDP
trafficDistribution: PreferSameNode
Result: steadier P99s and lower inter-AZ bills—finally aligned with your wallet.
Security Built-In (Not Bolted-On)
4) Mutating Admission Policies — Beta
Webhooks worked, but added latency and a point of failure. Now you can mutate in-process using CEL—no sidecar service, no TLS to rotate.
Example: default Pods to non-root
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingAdmissionPolicy
metadata:
name: enforce-non-root
spec:
matchConstraints:
resourceRules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
matchConditions:
- name: must-have-non-root
expression: "object.spec.securityContext == null || object.spec.securityContext.runAsNonRoot != true"
failurePolicy: Fail
reinvocationPolicy: IfNeeded
mutations:
- patchType: ApplyConfiguration
applyConfiguration:
expression: >
Object{
spec: Object.spec{
securityContext: Object.spec.securityContext{
runAsNonRoot: true
}
}
}
---
apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingAdmissionPolicyBinding
metadata:
name: enforce-non-root-binding
spec:
policyName: enforce-non-root
Enable:--feature-gates=MutatingAdmissionPolicy=true
and --runtime-config=admissionregistration.k8s.io/v1beta1=true
on the apiserver.
Use this for defaults (labels, topology keys, resource floors, sidecar stubs). Keep policy engines (Kyverno/Gatekeeper) for higher-order validation and compliance.
5) Selector-Based Authorization — GA
Kubelets used to need broad list permissions. Now the authorizer understands fieldSelectors and labelSelectors, so a kubelet can only list pods for its node.
Try it:
# Denied: unscoped list
kubectl auth can-i list pods --as=system:node:<node> --as-group=system:nodes
# Allowed: scoped to its node
kubectl auth can-i list pods --as=system:node:<node> --as-group=system:nodes \
--field-selector spec.nodeName=<node>
Webhooks and CEL authorizer can also use selectors, enabling tighter least-privilege boundaries.
6) ServiceAccount Token Image Pulls — Beta
Static imagePullSecrets
linger forever. With kubelet credential providers and ServiceAccount-scoped, short-lived tokens, pulls become ephemeral and auto-rotated.
Kubelet snippet:
# /etc/systemd/system/kubelet.service.d/10-credential-provider.conf
[Service]
Environment="KUBELET_EXTRA_ARGS=\
--image-credential-provider-config=/etc/kubernetes/credential-provider-config.yaml \
--image-credential-provider-bin-dir=/etc/kubernetes/credential-provider \
--feature-gates=KubeletServiceAccountTokenForCredentialProviders=true"
Minimal provider config (AWS ECR):
# /etc/kubernetes/credential-provider-config.yaml
apiVersion: credentialprovider.kubelet.k8s.io/v1
kind: CredentialProviderConfig
providers:
- name: ecr-credential-provider
matchImages:
- "*.dkr.ecr.*.amazonaws.com"
defaultCacheDuration: 12h
apiVersion: credentialprovider.kubelet.k8s.io/v1
args: ["get-credentials"]
env:
- name: AWS_REGION
value: us-west-2
Drop the provider binary, restart kubelet, and you’ve traded static secrets for per-workload, expiring credentials.
7) Anonymous Auth, Narrowed — GA
Keep /healthz
, /readyz
, /livez
, /version
open to unauthenticated probes—lock everything else down with RBAC.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: unauthenticated-healthz
rules:
- nonResourceURLs: ["/healthz", "/readyz", "/livez", "/version"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: unauthenticated-healthz-binding
subjects:
- kind: Group
name: system:unauthenticated
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: unauthenticated-healthz
apiGroup: rbac.authorization.k8s.io
Operator & Developer Experience
8) VolumeAttributesClass — GA (live tuning, no remount)
Change PVC performance attributes in place (driver-dependent via CSI ModifyVolume
). No detach, no eviction, no downtime.
apiVersion: storage.k8s.io/v1
kind: VolumeAttributesClass
metadata:
name: gold-performance
driverName: csi.vendor.com
parameters:
provisioned-iops: "5000"
throughput: "200MiB/s"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: db-pvc
spec:
storageClassName: fast-ssd
volumeAttributesClassName: gold-performance
resources:
requests:
storage: 100Gi
Start frugal, bump to “gold” during EoQ crunch, and drop back later—if your CSI supports it.
9) Job podReplacementPolicy
— GA
Batch jobs used to double allocate during retries (one pod terminating, another starting). With podReplacementPolicy
, the controller waits for the failed pod to actually disappear before creating a new one—smoother resource curves, fewer OOM alarms.
apiVersion: batch/v1
kind: Job
metadata:
name: cleanup-job
spec:
template:
spec:
containers:
- name: cleanup
image: busybox
command: ["sh", "-c", "run-something"]
restartPolicy: OnFailure
podReplacementPolicy: Failed
Kubernetes Has Grown Up
Kubernetes 1.34 isn’t about “more magic.” It’s about closing gaps: acknowledging resource contention, topology quirks, cost traps, and security blast radii—and giving you direct, typed controls to manage them. DRA, PSI-driven decisions, node-local routing, in-process admission, and live storage tuning all aim at one thing: make the platform behave like production actually behaves.
That’s the third era: not survival, not just scale—production reality.
And getting value from these primitives is an ongoing practice. Signals like PSI and DRA matter only if they drive stable placement, right-sized requests, and non-thrashy autoscaling. Project DevOps does exactly that—turning these smarter building blocks into steady clusters, predictable P99s, and cloud bills that reflect efficiency rather than guesswork.
Comments (0)
No comments yet. Be the first to share your thoughts!