⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Ray Serve: Python-Native Scaling

Dec 30, 2025 • 20 min read

Kubernetes' Horizontal Pod Autoscaler works well for stateless web services but fails for LLM serving. A model like Llama 3.1 70B takes 3-5 minutes to load into GPU VRAM — by the time HPA spins up a new pod and loads the model, your users have long since timed out. Ray Serve solves this with a Python-native approach: it manages model replicas inside already-running Ray workers, avoiding cold-start delays and enabling fractional GPU allocation, continuous batching, and multi-model pipelines.

1. Why Ray Serve Instead of Standard K8s Deployments

Feature	Standard K8s + HPA	Ray Serve + KubeRay
Scale-up latency	3-10 minutes (pod start + model load)	Seconds (model pre-loaded in warm workers)
GPU allocation	Whole GPU per pod	Fractional GPU (0.1 to 1.0) per replica
Request batching	Not built-in	Built-in continuous batching for LLMs
Multi-model pipelines	Complex service mesh needed	First-class DAG support in Python
Resource sharing	Per-pod isolation	Actors share a GPU for smaller models

2. Basic Ray Serve Application

pip install "ray[serve]" vllm

# llama_service.py
from ray import serve
from vllm import LLM, SamplingParams
import ray

@serve.deployment(
    name="llama-3-8b",
    num_replicas=2,           # Start with 2 replicas
    ray_actor_options={
        "num_gpus": 1.0,      # Each replica gets 1 full GPU
        "num_cpus": 4,
    },
    autoscaling_config={
        "min_replicas": 1,    # Always keep 1 replica warm
        "max_replicas": 8,    # Scale up to 8 under load
        "target_num_ongoing_requests_per_replica": 10,  # Scale trigger
        "upscale_delay_s": 10,   # Wait 10s before scaling up
        "downscale_delay_s": 60, # Wait 60s before scaling down
    },
    health_check_period_s=10,
    health_check_timeout_s=30,
)
class LlamaPredictor:
    def __init__(self):
        # Model loaded ONCE when the replica starts
        # Subsequent requests share the loaded model — no cold start!
        self.llm = LLM(
            model="meta-llama/Meta-Llama-3-8B-Instruct",
            max_model_len=8192,
            dtype="bfloat16",
        )
        print("Model loaded and ready")
    
    async def __call__(self, request):
        data = await request.json()
        prompt = data["prompt"]
        max_tokens = data.get("max_tokens", 500)
        
        sampling_params = SamplingParams(
            temperature=data.get("temperature", 0.7),
            max_tokens=max_tokens,
        )
        outputs = self.llm.generate([prompt], sampling_params)
        
        return {
            "generated_text": outputs[0].outputs[0].text,
            "tokens_generated": len(outputs[0].outputs[0].token_ids),
        }

# Bind to HTTP endpoint
app = LlamaPredictor.bind()

if __name__ == "__main__":
    serve.run(app, route_prefix="/")
    print("Serving at http://localhost:8000")

3. Request Batching for Higher Throughput

from ray import serve

@serve.deployment(ray_actor_options={"num_gpus": 1})
class BatchedEmbeddings:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("BAAI/bge-large-en-v1.5")
    
    # Batch requests that arrive within max_batch_size or batch_wait_timeout_s
    @serve.batch(max_batch_size=64, batch_wait_timeout_s=0.05)
    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        """Ray Serve collects requests and calls this with a batch."""
        embeddings = self.model.encode(texts, normalize_embeddings=True)
        return embeddings.tolist()  # Returns one embedding per input text
    
    async def __call__(self, request):
        data = await request.json()
        embedding = await self.embed_batch(data["text"])
        return {"embedding": embedding}

# Throughput improvement from batching:
# Without batching: 50 requests/second (each processed individually)
# With batching (64): 800 requests/second (GPU processes 64 simultaneously)

4. Multi-Model Pipeline (DAG)

# Chain multiple deployments into a pipeline
@serve.deployment(ray_actor_options={"num_gpus": 0.2})  # Lightweight model shares GPU
class Classifier:
    def __init__(self):
        self.classifier = load_classifier("toxic-content-classifier")
    
    async def classify(self, text: str) -> str:
        return self.classifier.predict(text)  # "safe" or "toxic"

@serve.deployment(ray_actor_options={"num_gpus": 1})
class Generator:
    def __init__(self):
        self.llm = load_llm("mistral-7b")
    
    async def generate(self, prompt: str) -> str:
        return self.llm.generate(prompt)

@serve.deployment
class Pipeline:
    def __init__(self, classifier, generator):
        self.classifier = classifier
        self.generator = generator
    
    async def __call__(self, request):
        data = await request.json()
        prompt = data["prompt"]
        
        # Route to classifier first (safety check)
        classification = await self.classifier.classify.remote(prompt)
        
        if classification == "toxic":
            return {"error": "Input violates content policy", "classification": "toxic"}
        
        # Generate response
        response = await self.generator.generate.remote(prompt)
        return {"response": response, "classification": "safe"}

# Bind with dependencies
classifier = Classifier.bind()
generator = Generator.bind()
pipeline = Pipeline.bind(classifier, generator)  # Dependencies injected automatically

serve.run(pipeline)

5. KubeRay: Deploying Ray on Kubernetes

# Install KubeRay operator
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1

# ray-service.yaml
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama-service
spec:
  serviceUnhealthySecondThreshold: 120  # Allow 2 min for model load
  serveConfigV2: |
    applications:
      - name: llama_app
        import_path: llama_service:app
        route_prefix: /
        runtime_env:
          pip: ["vllm==0.3.0", "ray[serve]"]
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.9.0-gpu
            resources:
              requests:
                cpu: "4"
                memory: "16Gi"
    workerGroupSpecs:
    - groupName: gpu-workers
      replicas: 2          # 2 GPU nodes
      minReplicas: 1
      maxReplicas: 8
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.9.0-gpu
            resources:
              limits:
                nvidia.com/gpu: "1"
                cpu: "12"
                memory: "48Gi"

Frequently Asked Questions

How does Ray Serve handle GPU memory between replicas?

Each replica is a separate Ray Actor with exclusive GPU memory allocation. If you specify num_gpus: 0.5, two replicas can share one physical GPU (each gets 50% of VRAM). For small embedding models (1-2 GB), you can run 8+ replicas on a single A100. For large models like Llama 70B, you need tensor parallelism across multiple GPUs within a single replica using the placement_group API.

How do I monitor Ray Serve in production?

Ray provides a built-in dashboard at localhost:8265 (or via KubeRay port-forwarding) showing replica health, request queue depth, and per-replica metrics. Integrate with Prometheus by configuring Ray's metrics export endpoint — it exposes standard OTLP metrics for Grafana dashboards.

Conclusion

Ray Serve on Kubernetes is the production-grade solution for teams that need to serve LLMs at scale without the prohibitive cold-start latencies of standard pod autoscaling. The autoscaling configuration — keeping minimum replicas warm while scaling up based on queue depth rather than CPU — is specifically designed for the bursty, GPU-memory-intensive patterns of AI workloads. For teams looking to self-host LLMs in production, KubeRay with vLLM is the most complete, battle-tested stack available.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact