opncrafter

Protecting User Privacy: PII Redaction with Presidio

Jan 1, 2026 • 20 min read

When users interact with AI applications, they routinely share private information: medical symptoms, financial details, names of colleagues, addresses, phone numbers. Sending this information to a third-party LLM API (OpenAI, Anthropic, Google) may constitute a GDPR data transfer requiring consent, a HIPAA violation if it involves health information, or a breach of your data processing agreements. Microsoft Presidio is the enterprise-grade solution: an open-source PII detection and anonymization engine built for exactly this problem.

1. Architecture: Analyzer + Anonymizer Pipeline

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg  # NLP model for entity recognition

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# 1. Initialize (loads spaCy NLP model + regex recognizers)
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# Test text with multiple PII types
text = """
Hi, I'm David Johnson. Call me at (555) 867-5309 or 
email david.johnson@techcorp.com. My SSN is 123-45-6789.
My credit card is 4532-0149-2999-1234 (expires 08/27).
I live at 123 Elm Street, Springfield, IL 62701.
Patient ID: PAT-94872. Account: ACC-38291.
"""

# 2. Analyze: Find all PII with confidence scores
results = analyzer.analyze(
    text=text,
    entities=[
        "PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "US_SSN",
        "CREDIT_CARD", "LOCATION", "DATE_TIME"
    ],
    language='en',
    score_threshold=0.7,  # Only flag high-confidence detections (reduce false positives)
)

for result in results:
    print(f"{result.entity_type}: '{text[result.start:result.end]}' (confidence: {result.score:.2f})")

# Output:
# PERSON: 'David Johnson' (confidence: 0.85)
# PHONE_NUMBER: '(555) 867-5309' (confidence: 0.95)
# EMAIL_ADDRESS: 'david.johnson@techcorp.com' (confidence: 0.99)
# US_SSN: '123-45-6789' (confidence: 0.85)
# CREDIT_CARD: '4532-0149-2999-1234' (confidence: 0.99)
# LOCATION: 'Springfield, IL 62701' (confidence: 0.77)

# 3. Anonymize with different strategies per entity type
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    operators={
        # Replace with placeholder tags (default strategy)
        "PERSON": OperatorConfig("replace", {"new_value": "<PERSON>"}),
        "PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"}),
        "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
        
        # Hash for IDs (consistent — repeated IDs get same hash, useful for relationships)
        "US_SSN": OperatorConfig("hash", {"hash_type": "sha256"}),
        "CREDIT_CARD": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 12, "from_end": False}),
        
        # Keep location at city level (redact street, keep city for context)
        "LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"}),
    }
)

print(anonymized.text)
# Output:
# Hi, I'm <PERSON>. Call me at <PHONE> or email <EMAIL>.
# My SSN is a3b4c5d6e7f8... (hashed).
# My credit card is ************1234 (expires 08/27).
# I live at <LOCATION>.
# Patient ID: PAT-94872. Account: ACC-38291.  ← Custom IDs not detected yet

# Now safe to send to OpenAI
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": anonymized.text}]
)

2. Custom Recognizers for Domain-Specific IDs

from presidio_analyzer import Pattern, PatternRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts

# Pattern-based recognizer for internal Patient IDs (format: PAT-XXXXX)
patient_id_pattern = Pattern(
    name="patient_id_pattern",
    regex=r"\bPAT-\d{5,8}\b",
    score=0.9,  # High confidence for this specific format
)

patient_recognizer = PatternRecognizer(
    supported_entity="PATIENT_ID",
    patterns=[patient_id_pattern],
    context=["patient", "pid", "patient id", "patient number"],  # Context boosts confidence
)

# Pattern-based recognizer for Internal Account IDs (format: ACC-XXXXX)
account_recognizer = PatternRecognizer(
    supported_entity="ACCOUNT_ID",
    patterns=[Pattern("account_id", r"\bACC-\d{5,8}\b", 0.9)],
    context=["account", "account id", "acc"],
)

# For very complex cases: NLP-based custom recognizer using spaCy
class CustomMedicalRecognizer(EntityRecognizer):
    """Detects medical record numbers using spaCy entity labels + context."""
    
    ENTITIES = ["MEDICAL_RECORD_NUMBER"]
    DEFAULT_EXPLANATION = "Identified as medical record number based on context"
    
    def load(self):
        pass  # No additional loading needed
    
    def analyze(self, text: str, entities, nlp_artifacts: NlpArtifacts):
        results = []
        # Look for "MRN" followed by alphanumeric digits
        import re
        for match in re.finditer(r'\bMRN[:\s#]*([A-Z0-9\-]{6,12})\b', text, re.IGNORECASE):
            results.append(RecognizerResult(
                entity_type="MEDICAL_RECORD_NUMBER",
                start=match.start(),
                end=match.end(),
                score=0.95,
            ))
        return results

# Register all custom recognizers with the analyzer
analyzer.registry.add_recognizer(patient_recognizer)
analyzer.registry.add_recognizer(account_recognizer)
analyzer.registry.add_recognizer(CustomMedicalRecognizer())

# Now the same analysis detects domain-specific IDs
results = analyzer.analyze(text=text, entities=["PATIENT_ID", "ACCOUNT_ID", "MEDICAL_RECORD_NUMBER"], language='en')
# Patient ID: PAT-94872 is now detected!
# Account: ACC-38291 is now detected!

3. Reversible Pseudonymization: The Clean Room Pattern

from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig

anonymizer_engine = AnonymizerEngine()
deanonymizer_engine = DeanonymizeEngine()

def clean_room_llm_call(user_text: str, openai_client) -> str:
    """
    Clean Room Pattern:
    1. Detect and replace PII with pseudonyms (PERSON_1, PERSON_2...)
    2. Store mapping: {PERSON_1: "David Johnson"}
    3. Send pseudonymized text to LLM — LLM NEVER sees real PII
    4. LLM response uses pseudonyms: "PERSON_1 is eligible for..."
    5. Restore real names in LLM response before showing to user
    Result: LLM provides accurate response, never sees actual PII
    """
    results = analyzer.analyze(text=user_text, entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"], language='en')
    
    # Use the "encrypt" operator to create reversible pseudonyms
    # (Presidio hash creates reverse-mappable tokens with a key)
    anonymized_result = anonymizer_engine.anonymize(
        text=user_text,
        analyzer_results=results,
        operators={
            "PERSON": OperatorConfig("encrypt", {"key": "YOUR_ENCRYPTION_KEY_AES256"}),
            "EMAIL_ADDRESS": OperatorConfig("encrypt", {"key": "YOUR_ENCRYPTION_KEY_AES256"}),
            "PHONE_NUMBER": OperatorConfig("encrypt", {"key": "YOUR_ENCRYPTION_KEY_AES256"}),
            "US_SSN": OperatorConfig("encrypt", {"key": "YOUR_ENCRYPTION_KEY_AES256"}),
        }
    )
    
    # Send pseudonymized text to LLM
    llm_response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": anonymized_result.text}]
    ).choices[0].message.content
    
    # Restore real values in LLM response using the stored encryption mappings
    deanonymized_response = deanonymizer_engine.deanonymize(
        text=llm_response,
        entities=anonymized_result.items,  # Contains the token-to-real-value mapping
        operators={"DEFAULT": OperatorConfig("decrypt", {"key": "YOUR_ENCRYPTION_KEY_AES256"})},
    )
    
    return deanonymized_response.text  # Real names restored, accurate LLM response

# Example:
# User: "Is David Johnson (SSN: 123-45-6789) eligible for the loan?"  
# Sent to LLM: "Is F9k2mXpQ== (SSN: Z8j1nY4R==) eligible for the loan?"
# LLM responds: "F9k2mXpQ== meets the criteria for loan approval."
# Returned to user: "David Johnson meets the criteria for loan approval."
# OpenAI never processed real PII ✅

Frequently Asked Questions

Does Presidio work for non-English languages?

Yes — Presidio supports multiple languages through its NLP engine configuration. For English, it uses spaCy's en_core_web_lg. For other languages, install additional spaCy models (python -m spacy download es_core_news_lg for Spanish, de_core_news_lg for German, etc.) and configure NlpEngineProvider with the appropriate model name. Regex-based recognizers (phone numbers, SSNs, credit cards) are inherently language-independent as far as pattern syntax goes, but you'll need country-specific patterns — US SSNs look different from UK National Insurance Numbers. Presidio's recognizer registry allows layering multiple country-specific regex patterns on the same entity type.

What's the performance impact of running Presidio on every LLM request?

The spaCy NER model (the main latency component) takes 20-80ms for typical messages (100-500 characters) on modern hardware. For high-volume production APIs, run Presidio as a separate microservice to avoid loading the spaCy model on every serverless function cold start. Cache Presidio analyzer results for identical inputs (users often resend the same message). For pure pattern-matching (no NER), using only regex-based recognizers drops latency to under 5ms. A reasonable architecture: run NER-based detection on message ingestion into your RAG pipeline, regex-only detection on the real-time chat path, and async full-corpus scanning for data audit purposes.

Conclusion

PII redaction is not optional for production AI applications handling real user data. Microsoft Presidio's combination of spaCy NER-based entity detection and regex pattern matching covers the vast majority of sensitive data types (names, phones, emails, SSNs, credit cards, medical IDs) without requiring custom ML training. The extensible recognizer registry allows adding domain-specific patterns for your internal IDs in minutes. The Clean Room Pattern — pseudonymize before LLM call, restore afterwards — satisfies GDPR's data minimization principle and HIPAA's de-identification requirements while maintaining the accuracy of LLM responses. For organizations building AI on sensitive data, Presidio + the Clean Room Pattern is the architectural foundation that compliance, legal, and security teams require.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK