⏱ 9–14 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Serverless Agents: The Time Bomb

Dec 30, 2025 • 18 min read

Serverless (AWS Lambda, Vercel Functions, Cloudflare Workers) is appealing for AI applications: scales to zero, no servers to manage, pay-per-invocation. But serverless has a fatal flaw for agentic workloads: timeout limits. A ReAct agent that searches the web, reads pages, and synthesizes answers might take 45-90 seconds. Vercel's Hobby plan kills requests after 10 seconds. AWS Lambda defaults to a 3-second timeout. This guide covers the patterns to work within — and around — serverless constraints.

1. Timeout Limits Across Platforms

Platform	Default Timeout	Max Timeout	Notes
Vercel Hobby	10s	10s	Hard limit — no override
Vercel Pro	15s	300s (5 min)	Set maxDuration in route config
Vercel Edge Functions	No limit*	No limit*	*Limited CPU time, not wall time
AWS Lambda	3s	900s (15 min)	Configure in function settings
Cloudflare Workers	10ms CPU	10ms CPU	Billing by CPU time, not wall time
Cloudflare Durable Objects	30s	30s	Better for stateful agent sessions

2. Pattern 1: Streaming Responses

Streaming is the most important technique: if you return the first byte immediately, many platforms don't start the timeout counter until the stream closes. More importantly, users see partial output immediately rather than waiting for the full generation:

// app/api/agent/route.ts (Next.js App Router)
import OpenAI from 'openai';
import { OpenAIStream, StreamingTextResponse } from 'ai'; // Vercel AI SDK

export const maxDuration = 60; // Vercel Pro: extend to 60 seconds
export const dynamic = 'force-dynamic'; // Don't cache this route

const openai = new OpenAI();

export async function POST(req: Request) {
    const { messages } = await req.json();
    
    // Stream starts returning data immediately
    const response = await openai.chat.completions.create({
        model: 'gpt-4o',
        messages,
        stream: true,
        tools: myAgentTools,
    });
    
    // OpenAIStream handles tool calls + streaming automatically
    const stream = OpenAIStream(response, {
        experimental_onToolCall: async (toolCall, appendToolCallMessage) => {
            // Execute tool and append result to message stream
            const result = await executeTool(toolCall.function.name, toolCall.function.arguments);
            return appendToolCallMessage({ tool_call_id: toolCall.id, function_name: toolCall.function.name, tool_call_result: result });
        },
    });
    
    // StreamingTextResponse pipes tokens directly to the client as they arrive
    return new StreamingTextResponse(stream);
}

3. Pattern 2: Edge Functions for Unlimited Wall Time

// Edge runtime: billed by CPU time, not wall time
// An agent waiting for OpenAI API responses uses ~0ms CPU during the wait
// So a 5-minute agent run costs ~5ms CPU billing — extremely cheap!

export const runtime = 'edge'; // Switch to Edge runtime
export const maxDuration = 300; // 5 minutes

export async function POST(req: Request) {
    const encoder = new TextEncoder();
    const stream = new TransformStream();
    const writer = stream.writable.getWriter();
    
    // Process in background — writer can keep writing for up to 5 minutes
    (async () => {
        await writer.write(encoder.encode('Starting agent...
'));
        
        // Step 1: Plan
        const plan = await planWithLLM(userQuery);
        await writer.write(encoder.encode(`Plan: ${plan}
`));
        
        // Step 2: Execute (can take many seconds)
        for (const step of plan.steps) {
            const result = await executeStep(step);
            await writer.write(encoder.encode(`Step ${step.id}: ${result}
`));
        }
        
        await writer.close();
    })();
    
    return new Response(stream.readable, {
        headers: { 'Content-Type': 'text/event-stream' },
    });
}

4. Pattern 3: Async Queue (For Very Long Tasks)

// For tasks that might take 5+ minutes (deep research agents, batch processing)
// Pattern: Submit → Get Job ID → Poll for Status

// Route 1: Accept job, return immediately
// POST /api/agent/submit
export async function POST(req: Request) {
    const { task } = await req.json();
    
    // Create job record in DB
    const jobId = crypto.randomUUID();
    await db.jobs.create({ id: jobId, task, status: 'queued', result: null });
    
    // Push to queue (SQS, BullMQ with Redis, etc.)
    await queue.add('agent-task', { jobId, task });
    
    // Return immediately — don't wait for the agent
    return Response.json({ jobId, status: 'queued' });
}

// Route 2: Check status (frontend polls this every 2 seconds)
// GET /api/agent/status?jobId=xxx
export async function GET(req: Request) {
    const jobId = new URL(req.url).searchParams.get('jobId');
    const job = await db.jobs.findUnique({ where: { id: jobId } });
    return Response.json(job);
}

// Worker: runs on a long-lived process (ECS Fargate, EC2, etc.)
// Never times out — runs until the task is complete
import { Worker } from 'bullmq';

const worker = new Worker('agent-task', async (job) => {
    const { jobId, task } = job.data;
    
    try {
        await db.jobs.update({ where: { id: jobId }, data: { status: 'running' } });
        
        // Agent runs here — could take 10+ minutes, no timeout
        const result = await runDeepResearchAgent(task);
        
        await db.jobs.update({
            where: { id: jobId },
            data: { status: 'completed', result }
        });
    } catch (error) {
        await db.jobs.update({
            where: { id: jobId },
            data: { status: 'failed', error: error.message }
        });
    }
}, { connection: redisConnection });

5. Cold Start Optimization

// Cold starts: Lambda takes 1-3 seconds to initialize your function
// If your function is idle for >15 min, next invocation pays cold start tax

// Strategy 1: Provisioned Concurrency (AWS Lambda)
// Keep N Lambda instances warm, always ready
// Cost: ~$0.005/hr per provisioned instance
// Use for latency-sensitive agents

// Strategy 2: Ping-to-warm (cheap hack)
// Schedule CloudWatch Events to ping your function every 5 minutes
// Event payload distinguishes warm-up pings from real requests:
export async function handler(event) {
    if (event.source === 'serverless-plugin-warmup') {
        return { statusCode: 200, body: 'warmed' };
    }
    // ... real handler logic
}

// Strategy 3: Move heavy initialization outside handler
// BAD: initializes on every invocation
export async function handler(event) {
    const db = await connectToDatabase();  // ← Runs every time!
    const llm = new OpenAI();
    // ...
}

// GOOD: initializes once per Lambda container (shared across invocations)
const db = await connectToDatabase();     // ← Runs once per cold start
const llm = new OpenAI();

export async function handler(event) {
    // db and llm are already initialized — no overhead
    // ...
}

Frequently Asked Questions

Should I use Vercel Functions or AWS Lambda for AI agents?

Use Vercel Functions when: your app is already on Next.js/Vercel, tasks complete under 60 seconds, and you want zero infrastructure management. Use AWS Lambda when: you need 15-minute timeouts (Lambda maximum), you need VPC access for RDS/ElastiCache, or you need more control over concurrency and scaling. For production agents that might take several minutes, the async queue pattern with a dedicated long-running worker is the most reliable architecture regardless of which platform you use for the API layer.

How do I handle tool call state across streaming chunks?

The Vercel AI SDK's OpenAIStream with experimental_onToolCall handles this automatically. For custom streaming without the SDK, accumulate tool call chunks in a buffer: OpenAI sends tool call arguments in fragments across multiple streaming chunks, and you need to concatenate them before parsing the JSON.

Conclusion

The right serverless pattern for your agent depends on task duration: streaming Edge Functions for under-60-second tasks, async job queues with dedicated workers for tasks that might take minutes. The async queue pattern is the most robust choice for production agents because it decouples request acceptance from agent execution, makes retries trivial, and eliminates all timeout anxiety. Most sophisticated AI applications in production use this pattern even for tasks that usually complete quickly, because predictable reliability matters more than architectural simplicity.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact