opncrafter

AI Governance: Compliance as Code

Dec 30, 2025 ‱ 20 min read

If you deploy AI in 2025 without a governance layer, you are creating legal, reputational, and ethical liability. The EU AI Act took effect in 2024 and regulatory equivalents are emerging globally. Beyond compliance, governance is good engineering: it makes your AI systems auditable, debuggable, and trustworthy. This guide provides a practical framework for building governance into your AI systems from day one.

1. The EU AI Act: Risk Tiers That Determine Your Requirements

Risk TierExamplesRequirements
Unacceptable RiskSocial scoring, real-time biometric surveillance in publicBanned. Cannot deploy in EU.
High RiskRecruiting AI, medical diagnosis, credit scoring, law enforcement, critical infrastructureMandatory conformity assessment, human oversight, explainability, bias audits, registration
General Purpose AI (GPAI)LLMs, foundation models (GPT-4, Llama, Gemini)Transparency about training data, capability evaluations, copyright compliance, systemic risk assessment for most powerful models
Limited RiskChatbots, AI-generated contentDisclose AI involvement to users. Label AI-generated images/text.
Minimal RiskAI spam filters, recommendation systemsNo mandatory requirements, voluntary codes of conduct encouraged

2. Data Lineage: Answering "Why Did the Model Say This?"

Regulators and users will ask you to explain AI decisions. Without data lineage tracking, you can't answer. Data lineage means knowing exactly: which training data influenced which model version, which model version produced which output, and which user prompt triggered that output:

# Data lineage tracking schema (Postgres)
CREATE TABLE model_versions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    model_name TEXT NOT NULL,          -- 'customer-classifier-v3'
    base_model TEXT,                   -- 'gpt-4o-2024-11-20'
    training_dataset_id UUID,          -- FK to your dataset registry
    training_dataset_version TEXT,     -- Git commit or data version hash
    fine_tuned_on TIMESTAMP,
    evaluation_metrics JSONB,          -- Accuracy, F1, fairness metrics
    deployed_at TIMESTAMP,
    deprecated_at TIMESTAMP
);

CREATE TABLE inference_log (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    model_version_id UUID REFERENCES model_versions(id),
    user_id TEXT,                      -- Pseudonymized if needed
    prompt_hash TEXT,                  -- Hash of prompt (not raw text if sensitive)
    response_hash TEXT,                -- Hash of response
    prompt_token_count INT,
    created_at TIMESTAMP DEFAULT now(),
    -- RAW prompt/response stored in separate encrypted table
    raw_log_id UUID
);

3. Model Cards: Your AI's Documentation

A Model Card is the README for your deployed model. Every production AI component should have one. Google publishes a formal spec; here's a practical template:

# Model Card: Customer Support Classifier v3

## Model Details
- **Developer:** Acme Corp AI Team
- **Model date:** 2025-01-15
- **Model version:** 3.0.1
- **Base model:** GPT-4o-mini (fine-tuned via OpenAI API)
- **License:** Internal use only

## Intended Use
- **Primary use:** Classify incoming support tickets by urgency (URGENT/HIGH/LOW)
- **Out-of-scope uses:** Medical advice, legal judgments, employment decisions

## Training Data
- **Dataset:** 50,000 labeled support tickets (Jan 2023 - Dec 2024)
- **Labeling:** Human-labeled by support team leads (inter-rater agreement: 0.91 Cohen's Îș)
- **Demographics:** US English. Limited performance for non-US English speakers.
- **Data version:** dataset-v3-2025-01-10 (SHA: abc123)

## Evaluation Metrics
| Metric          | Score  | Notes                              |
|-----------------|--------|------------------------------------|
| Accuracy        | 94.2%  | Held-out test set (n=5,000)        |
| F1 (URGENT)     | 0.921  | Critical for SLA compliance        |
| F1 (LOW)        | 0.963  |                                    |
| False Urgent Rate | 3.1% | Risk: wasted engineer time         |
| False Low Rate  | 2.7%  | Risk: SLA breach on real urgencies |

## Known Limitations & Biases
- Lower accuracy (<85%) for non-native English text
- Training data reflects historical team biases in urgency assignment
- Cannot handle tickets in languages other than English

## Ethical Considerations
Outputs affect engineer workload allocation. Systematically mis-classifying
ticket types from certain customer segments could create disparate outcomes.
Recommend quarterly bias audits by customer segment.

## Changelog
- v3.0.1: Retrained with 6 months additional data. Improved URGENT F1 by 4%.
- v2.1.0: Added negation handling after false-positive analysis.

4. Audit Logging: The Evidence Trail

// Middleware: wrap your LLM calls with audit logging
async function loggedLLMCall(userId: string, prompt: string, options: LLMOptions) {
    const startTime = Date.now();
    const logEntry = {
        requestId: crypto.randomUUID(),
        userId: pseudonymize(userId),   // GDPR: hash user IDs
        modelVersion: options.model,
        promptTokens: await countTokens(prompt),
        timestamp: new Date().toISOString(),
        sessionId: options.sessionId,
    };

    try {
        const response = await callLLM(prompt, options);
        
        await db.inferenceLog.create({
            data: {
                ...logEntry,
                responseTokens: response.usage.completion_tokens,
                latencyMs: Date.now() - startTime,
                status: 'SUCCESS',
                // Store prompt+response in encrypted table for investigation
                rawLogId: await storeEncrypted({ prompt, response: response.content }),
            }
        });
        
        return response;
    } catch (error) {
        await db.inferenceLog.create({
            data: { ...logEntry, status: 'ERROR', errorCode: error.code }
        });
        throw error;
    }
}

// Access controls for raw logs
// - Engineers: can see aggregated metrics
// - Compliance team: can access raw logs with audit trail
// - Legal hold: auto-preserve logs for users involved in disputes

5. Human Oversight: When AI Must Ask for Help

// High-risk decision: routing to human review
async function classifyTicket(ticket: string, userId: string) {
    const prediction = await model.predict(ticket);
    
    // Mandatory human review triggers (for High-Risk AI systems):
    const requiresHumanReview = 
        prediction.confidence < 0.85 ||          // Low confidence
        ticket.length > 5000 ||                   // Unusually complex
        prediction.class === 'LEGAL_THREAT' ||    // High-stakes category
        await isVulnerableUser(userId);            // Protected user group
    
    if (requiresHumanReview) {
        await humanReviewQueue.add({ ticket, prediction, userId });
        return { status: 'PENDING_HUMAN_REVIEW', estimatedTime: '2 hours' };
    }
    
    return { status: 'AUTO_CLASSIFIED', result: prediction };
}

6. Fairness Evaluation

from sklearn.metrics import classification_report
import pandas as pd

# Fairness audit: evaluate model performance by demographic segment
def fairness_audit(predictions_df: pd.DataFrame) -> dict:
    segments = predictions_df.groupby('customer_region')
    
    results = {}
    for segment_name, segment_data in segments:
        report = classification_report(
            segment_data['true_label'],
            segment_data['predicted_label'],
            output_dict=True
        )
        results[segment_name] = {
            'f1_macro': report['macro avg']['f1-score'],
            'false_urgent_rate': 1 - report['URGENT']['precision'],
        }
    
    # Flag significant disparity
    f1_scores = [v['f1_macro'] for v in results.values()]
    if max(f1_scores) - min(f1_scores) > 0.10:
        print("⚠  FAIRNESS ALERT: >10% F1 disparity across segments")
    
    return results

Frequently Asked Questions

Does the EU AI Act apply to US companies?

Yes, if you offer AI systems to EU residents or users — regardless of where your company is headquartered. This is the same extraterritorial principle as GDPR. If your SaaS product has any EU customers, you need to assess your AI risk tiers under the Act.

What's the minimum viable governance for a startup?

Start with: (1) Audit logs for all LLM calls with user ID and timestamp, (2) A one-page model card for each major AI feature, (3) A clear "AI Disclosure" in your product UI telling users when they're interacting with AI, and (4) A process for reviewing and deleting user data from logs upon request (GDPR Article 17 right to erasure).

Conclusion

Governance is not just a compliance checkbox — it's the engineering infrastructure that makes AI systems trustworthy, debuggable, and improvable over time. The audit logs that help you comply with the EU AI Act are the same logs that help you diagnose quality regressions and investigate user complaints. Build governance in from day one, not as a retrofit after regulatory pressure arrives.

Continue Reading

đŸ‘šâ€đŸ’»
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK