Building Private AI Systems for Enterprises

Enterprise AI deployment has different requirements from hobbyist local setups. The technical challenge is straightforward; the hard problems are governance, compliance, multi-user access control, audit logging, and integrating with existing enterprise identity and data systems. This guide covers the full architecture of a production-grade private AI platform — one that your legal, security, and compliance teams can actually approve.

Reference Architecture

# Enterprise Private AI Stack

External Users (employees)
    ↓ HTTPS (internal network or VPN only)
┌─────────────────────────────────────────┐
│  API Gateway (Kong / NGINX)             │
│  - Auth: SSO via Okta / Azure AD (SAML) │
│  - Rate limiting per user/department    │
│  - Request logging (all prompts/outputs)│
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  LLM Inference Layer (vLLM)             │
│  - Models: Llama 3.1 70B, Qwen 2.5 72B │
│  - GPU cluster (A100s / H100s on-prem)  │
│  - OpenAI-compatible REST API           │
│  - Model routing (size based on request)│
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  RAG / Knowledge Layer                  │
│  - Vector DB: Qdrant (self-hosted)      │
│  - Sources: Confluence, SharePoint, S3  │
│  - Embedding model: nomic-embed-text    │
└─────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────┐
│  Observability (Langfuse self-hosted)   │
│  - Token usage by user/department       │
│  - Latency, error rate monitoring       │
│  - Prompt/response audit log (7 years)  │
│  - PII detection before logging         │
└─────────────────────────────────────────┘

vLLM: Production-Grade Inference Server

# pip install vllm

# Start vLLM serving Llama 3.1 70B on 4x A100 GPUs
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --served-model-name llama-3.1-70b \
    --tensor-parallel-size 4 \   # Splits across 4 GPUs
    --max-model-len 65536 \
    --api-key your-internal-api-key \
    --host 0.0.0.0 \
    --port 8000

# vLLM features critical for enterprise:
# - Continuous batching: handles concurrent users efficiently
# - PagedAttention: enables much higher token throughput
# - OpenAI-compatible API: works with existing LangChain code
# - Multi-LoRA: serve multiple fine-tuned adapters on one base model

# Use the API identical to OpenAI:
from openai import OpenAI

client = OpenAI(
    base_url="http://your-gpu-server:8000/v1",
    api_key="your-internal-api-key",
)

response = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "Summarize this contract..."}],
    max_tokens=1000,
)

Enterprise SSO Integration

# FastAPI gateway with Azure AD / Okta JWT validation

from fastapi import FastAPI, Depends, HTTPException, Security
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt, httpx

app = FastAPI()
bearer = HTTPBearer()

TENANT_ID = "your-azure-tenant-id"
CLIENT_ID = "your-app-client-id"

async def get_microsoft_public_keys():
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            f"https://login.microsoftonline.com/{TENANT_ID}/discovery/v2.0/keys"
        )
        return resp.json()["keys"]

async def verify_enterprise_token(
    credentials: HTTPAuthorizationCredentials = Security(bearer)
) -> dict:
    """Validate Azure AD JWT token and extract user context."""
    token = credentials.credentials
    keys = await get_microsoft_public_keys()

    try:
        payload = jwt.decode(
            token,
            keys,
            algorithms=["RS256"],
            audience=CLIENT_ID,
        )
        return {
            "user_id": payload["oid"],
            "email": payload.get("upn", payload.get("email")),
            "department": payload.get("department", "unknown"),
            "roles": payload.get("roles", []),
        }
    except jwt.InvalidTokenError as e:
        raise HTTPException(status_code=401, detail=f"Invalid token: {e}")

@app.post("/v1/chat/completions")
async def chat_with_audit(request: dict, user: dict = Depends(verify_enterprise_token)):
    """
    Proxy to vLLM with:
    1. Enterprise SSO authentication
    2. Role-based model access (only admins get 70B)
    3. Full audit logging (who asked what, when)
    """
    model = "llama-3.1-70b" if "ai-admin" in user["roles"] else "llama-3.1-8b"

    # Log to audit system (BEFORE sending to model for completeness)
    await audit_log(user=user, prompt=request.get("messages"), model=model)

    # Forward to vLLM
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "http://vllm-server:8000/v1/chat/completions",
            json={**request, "model": model},
            timeout=120,
        )
        return resp.json()

Compliance Requirements Checklist

Audit logging: Every prompt and response must be logged with user ID, timestamp, model version, and token count. Retention period per your compliance requirements (typically 7 years for financial, 6 years for HIPAA).
PII detection: Scan outputs before logging to avoid storing PII in audit logs where not needed. Use spaCy or AWS Comprehend (if using hybrid cloud) for entity detection.
Data residency: GPU servers must be in the correct geographic region. For EU data subjects, inference must occur in EU data centers under GDPR.
Model versioning: Document which model version is deployed when, and maintain the ability to reproduce outputs from historical versions for legal proceedings.
Access policy: Role-based access to different model sizes and capabilities. HR data questions should only be accessible to HR roles.

Implementing Role-Based Data Interception (RBAC)

The heart of enterprise private AI is intercepting queries before they reach the LLM and filtering out or augmenting sensitive data based on the authenticated user's LDAP/Active Directory group.

// Enterprise Gateway Middleware Pattern
async function handleQuery(userToken, userPrompt) {
  // 1. Validate JWT and fetch AD Roles
  const user = await validateEnterpriseToken(userToken);
  
  // 2. Document level security in RAG lookup
  const relevantDocs = await vectorDB.query(userPrompt, {
     filter: { allowedRoles: { $in: user.roles } } 
  });

  // 3. System prompt augmentation
  const systemPrompt = `
     You are a secure corporate assistant.
     Only use these provided documents.
     The user's security clearance is: ${Math.max(user.clearance)}`
  
  // 4. Send to VPC-isolated LLM instance (e.g. AWS Bedrock)
  return await invokeSecureLLM(systemPrompt, userPrompt, relevantDocs);
}

Conclusion

Building enterprise private AI is primarily a systems engineering and governance challenge, not an ML challenge. The models are available. The inference infrastructure is mature. The hard work is audit logging, SSO integration, access control, data residency compliance, and building the operational runbooks that let your IT team maintain the system. Start with a pilot for a single high-value use case — document review, internal knowledge search, code review — demonstrate ROI, and expand. The governance infrastructure you build for the pilot will scale to the full deployment.