AI Governance: Compliance as Code
Dec 30, 2025 âą 20 min read
If you deploy AI in 2025 without a governance layer, you are creating legal, reputational, and ethical liability. The EU AI Act took effect in 2024 and regulatory equivalents are emerging globally. Beyond compliance, governance is good engineering: it makes your AI systems auditable, debuggable, and trustworthy. This guide provides a practical framework for building governance into your AI systems from day one.
1. The EU AI Act: Risk Tiers That Determine Your Requirements
| Risk Tier | Examples | Requirements |
|---|---|---|
| Unacceptable Risk | Social scoring, real-time biometric surveillance in public | Banned. Cannot deploy in EU. |
| High Risk | Recruiting AI, medical diagnosis, credit scoring, law enforcement, critical infrastructure | Mandatory conformity assessment, human oversight, explainability, bias audits, registration |
| General Purpose AI (GPAI) | LLMs, foundation models (GPT-4, Llama, Gemini) | Transparency about training data, capability evaluations, copyright compliance, systemic risk assessment for most powerful models |
| Limited Risk | Chatbots, AI-generated content | Disclose AI involvement to users. Label AI-generated images/text. |
| Minimal Risk | AI spam filters, recommendation systems | No mandatory requirements, voluntary codes of conduct encouraged |
2. Data Lineage: Answering "Why Did the Model Say This?"
Regulators and users will ask you to explain AI decisions. Without data lineage tracking, you can't answer. Data lineage means knowing exactly: which training data influenced which model version, which model version produced which output, and which user prompt triggered that output:
# Data lineage tracking schema (Postgres)
CREATE TABLE model_versions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
model_name TEXT NOT NULL, -- 'customer-classifier-v3'
base_model TEXT, -- 'gpt-4o-2024-11-20'
training_dataset_id UUID, -- FK to your dataset registry
training_dataset_version TEXT, -- Git commit or data version hash
fine_tuned_on TIMESTAMP,
evaluation_metrics JSONB, -- Accuracy, F1, fairness metrics
deployed_at TIMESTAMP,
deprecated_at TIMESTAMP
);
CREATE TABLE inference_log (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
model_version_id UUID REFERENCES model_versions(id),
user_id TEXT, -- Pseudonymized if needed
prompt_hash TEXT, -- Hash of prompt (not raw text if sensitive)
response_hash TEXT, -- Hash of response
prompt_token_count INT,
created_at TIMESTAMP DEFAULT now(),
-- RAW prompt/response stored in separate encrypted table
raw_log_id UUID
);3. Model Cards: Your AI's Documentation
A Model Card is the README for your deployed model. Every production AI component should have one. Google publishes a formal spec; here's a practical template:
# Model Card: Customer Support Classifier v3
## Model Details
- **Developer:** Acme Corp AI Team
- **Model date:** 2025-01-15
- **Model version:** 3.0.1
- **Base model:** GPT-4o-mini (fine-tuned via OpenAI API)
- **License:** Internal use only
## Intended Use
- **Primary use:** Classify incoming support tickets by urgency (URGENT/HIGH/LOW)
- **Out-of-scope uses:** Medical advice, legal judgments, employment decisions
## Training Data
- **Dataset:** 50,000 labeled support tickets (Jan 2023 - Dec 2024)
- **Labeling:** Human-labeled by support team leads (inter-rater agreement: 0.91 Cohen's Îș)
- **Demographics:** US English. Limited performance for non-US English speakers.
- **Data version:** dataset-v3-2025-01-10 (SHA: abc123)
## Evaluation Metrics
| Metric | Score | Notes |
|-----------------|--------|------------------------------------|
| Accuracy | 94.2% | Held-out test set (n=5,000) |
| F1 (URGENT) | 0.921 | Critical for SLA compliance |
| F1 (LOW) | 0.963 | |
| False Urgent Rate | 3.1% | Risk: wasted engineer time |
| False Low Rate | 2.7% | Risk: SLA breach on real urgencies |
## Known Limitations & Biases
- Lower accuracy (<85%) for non-native English text
- Training data reflects historical team biases in urgency assignment
- Cannot handle tickets in languages other than English
## Ethical Considerations
Outputs affect engineer workload allocation. Systematically mis-classifying
ticket types from certain customer segments could create disparate outcomes.
Recommend quarterly bias audits by customer segment.
## Changelog
- v3.0.1: Retrained with 6 months additional data. Improved URGENT F1 by 4%.
- v2.1.0: Added negation handling after false-positive analysis.4. Audit Logging: The Evidence Trail
// Middleware: wrap your LLM calls with audit logging
async function loggedLLMCall(userId: string, prompt: string, options: LLMOptions) {
const startTime = Date.now();
const logEntry = {
requestId: crypto.randomUUID(),
userId: pseudonymize(userId), // GDPR: hash user IDs
modelVersion: options.model,
promptTokens: await countTokens(prompt),
timestamp: new Date().toISOString(),
sessionId: options.sessionId,
};
try {
const response = await callLLM(prompt, options);
await db.inferenceLog.create({
data: {
...logEntry,
responseTokens: response.usage.completion_tokens,
latencyMs: Date.now() - startTime,
status: 'SUCCESS',
// Store prompt+response in encrypted table for investigation
rawLogId: await storeEncrypted({ prompt, response: response.content }),
}
});
return response;
} catch (error) {
await db.inferenceLog.create({
data: { ...logEntry, status: 'ERROR', errorCode: error.code }
});
throw error;
}
}
// Access controls for raw logs
// - Engineers: can see aggregated metrics
// - Compliance team: can access raw logs with audit trail
// - Legal hold: auto-preserve logs for users involved in disputes5. Human Oversight: When AI Must Ask for Help
// High-risk decision: routing to human review
async function classifyTicket(ticket: string, userId: string) {
const prediction = await model.predict(ticket);
// Mandatory human review triggers (for High-Risk AI systems):
const requiresHumanReview =
prediction.confidence < 0.85 || // Low confidence
ticket.length > 5000 || // Unusually complex
prediction.class === 'LEGAL_THREAT' || // High-stakes category
await isVulnerableUser(userId); // Protected user group
if (requiresHumanReview) {
await humanReviewQueue.add({ ticket, prediction, userId });
return { status: 'PENDING_HUMAN_REVIEW', estimatedTime: '2 hours' };
}
return { status: 'AUTO_CLASSIFIED', result: prediction };
}6. Fairness Evaluation
from sklearn.metrics import classification_report
import pandas as pd
# Fairness audit: evaluate model performance by demographic segment
def fairness_audit(predictions_df: pd.DataFrame) -> dict:
segments = predictions_df.groupby('customer_region')
results = {}
for segment_name, segment_data in segments:
report = classification_report(
segment_data['true_label'],
segment_data['predicted_label'],
output_dict=True
)
results[segment_name] = {
'f1_macro': report['macro avg']['f1-score'],
'false_urgent_rate': 1 - report['URGENT']['precision'],
}
# Flag significant disparity
f1_scores = [v['f1_macro'] for v in results.values()]
if max(f1_scores) - min(f1_scores) > 0.10:
print("â ïž FAIRNESS ALERT: >10% F1 disparity across segments")
return resultsFrequently Asked Questions
Does the EU AI Act apply to US companies?
Yes, if you offer AI systems to EU residents or users â regardless of where your company is headquartered. This is the same extraterritorial principle as GDPR. If your SaaS product has any EU customers, you need to assess your AI risk tiers under the Act.
What's the minimum viable governance for a startup?
Start with: (1) Audit logs for all LLM calls with user ID and timestamp, (2) A one-page model card for each major AI feature, (3) A clear "AI Disclosure" in your product UI telling users when they're interacting with AI, and (4) A process for reviewing and deleting user data from logs upon request (GDPR Article 17 right to erasure).
Conclusion
Governance is not just a compliance checkbox â it's the engineering infrastructure that makes AI systems trustworthy, debuggable, and improvable over time. The audit logs that help you comply with the EU AI Act are the same logs that help you diagnose quality regressions and investigate user complaints. Build governance in from day one, not as a retrofit after regulatory pressure arrives.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning â no fluff, just working code and real-world context.