Ray Serve: Python-Native Scaling
Dec 30, 2025 • 20 min read
Kubernetes' Horizontal Pod Autoscaler works well for stateless web services but fails for LLM serving. A model like Llama 3.1 70B takes 3-5 minutes to load into GPU VRAM — by the time HPA spins up a new pod and loads the model, your users have long since timed out. Ray Serve solves this with a Python-native approach: it manages model replicas inside already-running Ray workers, avoiding cold-start delays and enabling fractional GPU allocation, continuous batching, and multi-model pipelines.
1. Why Ray Serve Instead of Standard K8s Deployments
| Feature | Standard K8s + HPA | Ray Serve + KubeRay |
|---|---|---|
| Scale-up latency | 3-10 minutes (pod start + model load) | Seconds (model pre-loaded in warm workers) |
| GPU allocation | Whole GPU per pod | Fractional GPU (0.1 to 1.0) per replica |
| Request batching | Not built-in | Built-in continuous batching for LLMs |
| Multi-model pipelines | Complex service mesh needed | First-class DAG support in Python |
| Resource sharing | Per-pod isolation | Actors share a GPU for smaller models |
2. Basic Ray Serve Application
pip install "ray[serve]" vllm
# llama_service.py
from ray import serve
from vllm import LLM, SamplingParams
import ray
@serve.deployment(
name="llama-3-8b",
num_replicas=2, # Start with 2 replicas
ray_actor_options={
"num_gpus": 1.0, # Each replica gets 1 full GPU
"num_cpus": 4,
},
autoscaling_config={
"min_replicas": 1, # Always keep 1 replica warm
"max_replicas": 8, # Scale up to 8 under load
"target_num_ongoing_requests_per_replica": 10, # Scale trigger
"upscale_delay_s": 10, # Wait 10s before scaling up
"downscale_delay_s": 60, # Wait 60s before scaling down
},
health_check_period_s=10,
health_check_timeout_s=30,
)
class LlamaPredictor:
def __init__(self):
# Model loaded ONCE when the replica starts
# Subsequent requests share the loaded model — no cold start!
self.llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
max_model_len=8192,
dtype="bfloat16",
)
print("Model loaded and ready")
async def __call__(self, request):
data = await request.json()
prompt = data["prompt"]
max_tokens = data.get("max_tokens", 500)
sampling_params = SamplingParams(
temperature=data.get("temperature", 0.7),
max_tokens=max_tokens,
)
outputs = self.llm.generate([prompt], sampling_params)
return {
"generated_text": outputs[0].outputs[0].text,
"tokens_generated": len(outputs[0].outputs[0].token_ids),
}
# Bind to HTTP endpoint
app = LlamaPredictor.bind()
if __name__ == "__main__":
serve.run(app, route_prefix="/")
print("Serving at http://localhost:8000")3. Request Batching for Higher Throughput
from ray import serve
@serve.deployment(ray_actor_options={"num_gpus": 1})
class BatchedEmbeddings:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Batch requests that arrive within max_batch_size or batch_wait_timeout_s
@serve.batch(max_batch_size=64, batch_wait_timeout_s=0.05)
async def embed_batch(self, texts: list[str]) -> list[list[float]]:
"""Ray Serve collects requests and calls this with a batch."""
embeddings = self.model.encode(texts, normalize_embeddings=True)
return embeddings.tolist() # Returns one embedding per input text
async def __call__(self, request):
data = await request.json()
embedding = await self.embed_batch(data["text"])
return {"embedding": embedding}
# Throughput improvement from batching:
# Without batching: 50 requests/second (each processed individually)
# With batching (64): 800 requests/second (GPU processes 64 simultaneously)4. Multi-Model Pipeline (DAG)
# Chain multiple deployments into a pipeline
@serve.deployment(ray_actor_options={"num_gpus": 0.2}) # Lightweight model shares GPU
class Classifier:
def __init__(self):
self.classifier = load_classifier("toxic-content-classifier")
async def classify(self, text: str) -> str:
return self.classifier.predict(text) # "safe" or "toxic"
@serve.deployment(ray_actor_options={"num_gpus": 1})
class Generator:
def __init__(self):
self.llm = load_llm("mistral-7b")
async def generate(self, prompt: str) -> str:
return self.llm.generate(prompt)
@serve.deployment
class Pipeline:
def __init__(self, classifier, generator):
self.classifier = classifier
self.generator = generator
async def __call__(self, request):
data = await request.json()
prompt = data["prompt"]
# Route to classifier first (safety check)
classification = await self.classifier.classify.remote(prompt)
if classification == "toxic":
return {"error": "Input violates content policy", "classification": "toxic"}
# Generate response
response = await self.generator.generate.remote(prompt)
return {"response": response, "classification": "safe"}
# Bind with dependencies
classifier = Classifier.bind()
generator = Generator.bind()
pipeline = Pipeline.bind(classifier, generator) # Dependencies injected automatically
serve.run(pipeline)5. KubeRay: Deploying Ray on Kubernetes
# Install KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
# ray-service.yaml
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama-service
spec:
serviceUnhealthySecondThreshold: 120 # Allow 2 min for model load
serveConfigV2: |
applications:
- name: llama_app
import_path: llama_service:app
route_prefix: /
runtime_env:
pip: ["vllm==0.3.0", "ray[serve]"]
rayClusterConfig:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: rayproject/ray:2.9.0-gpu
resources:
requests:
cpu: "4"
memory: "16Gi"
workerGroupSpecs:
- groupName: gpu-workers
replicas: 2 # 2 GPU nodes
minReplicas: 1
maxReplicas: 8
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray:2.9.0-gpu
resources:
limits:
nvidia.com/gpu: "1"
cpu: "12"
memory: "48Gi"Frequently Asked Questions
How does Ray Serve handle GPU memory between replicas?
Each replica is a separate Ray Actor with exclusive GPU memory allocation. If you specify num_gpus: 0.5, two replicas can share one physical GPU (each gets 50% of VRAM). For small embedding models (1-2 GB), you can run 8+ replicas on a single A100. For large models like Llama 70B, you need tensor parallelism across multiple GPUs within a single replica using the placement_group API.
How do I monitor Ray Serve in production?
Ray provides a built-in dashboard at localhost:8265 (or via KubeRay port-forwarding) showing replica health, request queue depth, and per-replica metrics. Integrate with Prometheus by configuring Ray's metrics export endpoint — it exposes standard OTLP metrics for Grafana dashboards.
Conclusion
Ray Serve on Kubernetes is the production-grade solution for teams that need to serve LLMs at scale without the prohibitive cold-start latencies of standard pod autoscaling. The autoscaling configuration — keeping minimum replicas warm while scaling up based on queue depth rather than CPU — is specifically designed for the bursty, GPU-memory-intensive patterns of AI workloads. For teams looking to self-host LLMs in production, KubeRay with vLLM is the most complete, battle-tested stack available.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.