Workload-Variant-Autoscaler (WVA)
The Workload Variant Autoscaler (WVA) is a Kubernetes-based global autoscaler for inference model servers serving LLMs. WVA works alongside standard Kubernetes HPA autoscaler and external autoscalers like KEDA to scale the object supporting scale subresource. The high-level details of the algorithm are here. It determines optimal replica counts for given request traffic loads for inference servers by considering constraints such as GPU count (cluster resources), energy-budget and performance-budget (latency/throughput).
What is a variant?
In WVA, a variant is a way of serving a given model: a scale target (Deployment, StatefulSet, or LWS) with a particular combination of hardware, runtimes, and serving approach. Variants for the same model share the same base model (e.g. meta/llama-3.1-8b); LoRA adapters can differ per variant. Each variant is a distinct setup—e.g. different accelerators (A100, H100, L4), parallelism, or performance requirements. Create one VariantAutoscaling per variant; when several variants serve the same model, WVA chooses which to scale (e.g. add capacity on the cheapest variant, remove it from the most expensive). See Configuration and Saturation Analyzer for details.
Key Features
- Intelligent Autoscaling: Optimizes replica count by observing the current state of the system
- Cost Optimization: Minimizes infrastructure costs by picking the correct accelerator variant
Documentation
User Guide
Integrations
How It Works
-
Platform admin deploys llm-d infrastructure (including model servers) and waits for servers to warm up and start serving requests
-
Platform admin creates a
VariantAutoscalingCR for the running deployment -
WVA continuously monitors request rates and server performance via Prometheus metrics
-
Capacity model obtains KV cache utilization and queue depth of inference servers with slack capacity to determine replicas
-
Actuator emits optimization metrics to Prometheus and updates VariantAutoscaling status
-
External autoscaler (HPA/KEDA) reads the metrics and scales the deployment accordingly
Important Notes:
- WVA handles the creation order gracefully - you can create the VA before or after the deployment
- If a deployment is deleted, the VA status is immediately updated to reflect the missing deployment
- When the deployment is recreated, the VA automatically resumes operation
- Configure HPA stabilization window (recommend 120s+) for gradual scaling behavior
- WVA updates the VA status with current and desired allocations every reconciliation cycle
Example
apiVersion: llmd.ai/v1alpha1
kind: VariantAutoscaling
metadata:
name: llama-8b-autoscaler
namespace: llm-inference
spec:
scaleTargetRef:
kind: Deployment
name: llama-8b
modelID: "meta/llama-3.1-8b"
variantCost: "10.0" # Optional, defaults to "10.0"
More examples in config/samples/.
Upgrading
CRD Updates
Important: Helm does not automatically update CRDs during helm upgrade. When upgrading WVA to a new version with CRD changes, you must manually apply the updated CRDs first:
# Apply the latest CRDs before upgrading
kubectl apply -f charts/workload-variant-autoscaler/crds/
# Then upgrade the Helm release
helm upgrade workload-variant-autoscaler ./charts/workload-variant-autoscaler \
--namespace workload-variant-autoscaler-system \
[your-values...]
Breaking Changes
v0.5.1
- VariantAutoscaling CRD: Added
scaleTargetReffield as required. v0.4.1 VariantAutoscaling resources withoutscaleTargetRefmust be updated before upgrading:- Impact on Scale-to-Zero: VAs without
scaleTargetRefwill not scale to zero properly, even with HPAScaleToZero enabled and HPAminReplicas: 0, because the HPA cannot reference the target deployment. - Migration: Update existing VAs to include
scaleTargetRef:spec:
scaleTargetRef:
kind: Deployment
name: <your-deployment-name> - Validation: After CRD update, VAs without
scaleTargetRefwill fail validation.
- Impact on Scale-to-Zero: VAs without
Verifying CRD Version
To check if your cluster has the latest CRD schema:
# Check the CRD fields
kubectl get crd variantautoscalings.llmd.ai -o jsonpath=``{.spec.versions[0].schema.openAPIV3Schema.properties.spec.properties}`` | jq 'keys'
Contributing
We welcome contributions! See the llm-d Contributing Guide for guidelines.
Join the llm-d autoscaling community meetings to get involved.
License
Apache 2.0 - see LICENSE for details.
Related Projects
References
For detailed documentation, visit the docs directory.
This content is automatically synced from README.md on the main branch of the llm-d-incubation/workload-variant-autoscaler repository.
📝 To suggest changes, please edit the source file or create an issue.