Architecture Report

Evidence File Anonymization Pipeline

Phased architecture for PII anonymization of user-uploaded evidence files (PDF, images). From manual review at launch to automated pipeline at scale.

Kleros Enterprise Feb 2026 v1.0 — Draft

Guiding Principles

The webapp is in early stages with a small user base. The initial evidence files may not contain PII at all. The architecture is designed to ship fast, collect real data on PII prevalence, and add automated anonymization only when justified by actual usage patterns.

Each phase is designed so that the transition to the next phase requires no breaking changes to the upload flow, database schema, or frontend contract. The processing stage exists from day one — it simply does more work in later phases.


Phased Rollout

Phase 1 — Ship Now

Manual Review with User Consent

Trigger: Launch. This is your starting state.

The user uploads evidence. A required consent checkbox grants Kleros permission to review and redact PII if necessary. The file passes through a no-op processing stage and is stored. Internally, staff review flagged files and manually redact using standard tools (PDF editor, image editor).

User Upload + consent ☑
processEvidence() no-op passthrough
Store File + consent log
Manual Review internal staff

Consent UI

The checkbox is required (not optional) to submit. Suggested wording:

What to store at upload time

{ "file_id": "evt_abc123", "original_hash": "sha256:9f3a...", // integrity proof "consent_granted": true, "consent_timestamp": "2026-02-19T14:30:00Z", "consent_user_id": "usr_xyz789", "processing_stage": "none", // "none" | "manual" | "automated" "pii_detected": null, // populated after review "pii_categories": [], // e.g. ["name","address","face"] "redacted_file_id": null // points to clean version }
PII tracking from day one. Even during manual review, log what categories of PII you find (names, addresses, faces, national IDs, financial data). This data directly informs whether and when to invest in automation.
  • Required consent checkbox on upload form
  • Log consent event (timestamp + user ID) in database
  • Store SHA-256 hash of original file for integrity
  • Implement processEvidence() as a no-op passthrough
  • Add processing_stage and pii_categories fields to evidence schema
  • Internal team process: review uploads weekly, manually redact if needed
  • Track PII occurrences in a simple spreadsheet or log
Phase 2 — Informed Automation Decision

PII Triage & Pipeline Skeleton

Trigger: You've manually reviewed ~50–100 files and have data on PII frequency.

Before building any automation, you now have real data. This phase is a decision gate, not necessarily a code change. Based on what you've observed, one of three paths:

Observation PII Rate Action
Almost no files contain PII < 5% Stay in Phase 1. Manual review is sustainable. Revisit quarterly.
Some files contain PII, mostly predictable patterns 5–30% Implement Phase 3 with Presidio (self-hosted). The PII types are known and the volume justifies automation but not cloud costs.
Many files, diverse and complex PII types > 30% Evaluate Phase 3B with a cloud API — Google Cloud DLP (broadest entity coverage) or AWS Comprehend (native Spanish PII support). Cost justified by volume.
The PII categories matter as much as the rate. If 20% of files contain PII but it's always email addresses and phone numbers, a few regex patterns handle it — no ML needed. If it's names embedded in free text and faces in photos, you need NER and computer vision. Let the data guide the tooling choice.
⚡ Choose One Path
Phase 3A — Self-Hosted

Automated Pipeline with Presidio

Trigger: PII is frequent enough to automate, but types are predictable and volume is moderate.

The no-op processEvidence() is replaced with a call to a self-hosted Presidio microservice. No data leaves your infrastructure. Custom recognizers are added for entity types specific to your evidence.

User Upload + consent ☑
processEvidence() Presidio Analyzer
Anonymizer redact / mask
Store Redacted + manifest

For image evidence, add the Presidio Image Redactor container for face/text detection in images.

✓ Advantages

No data leaves your infrastructure

No API costs or usage limits

MIT license, fully free

Highly customizable recognizers

Predictable latency (no network round-trip)

Can run air-gapped if needed

✗ Tradeoffs

Lower out-of-box accuracy vs. cloud APIs

~20–30 built-in recognizers (vs. 150+)

You maintain the infrastructure

NER model tuning needed for best results

Image redaction is less mature

Multi-language support requires separate models

Infrastructure

# docker-compose.yml (add alongside existing services) presidio-analyzer: image: mcr.microsoft.com/presidio-analyzer:latest ports: ["5002:3000"] presidio-anonymizer: image: mcr.microsoft.com/presidio-anonymizer:latest ports: ["5001:3000"] presidio-image-redactor: # only if handling image evidence image: mcr.microsoft.com/presidio-image-redactor:latest ports: ["5003:3000"]
or
Phase 3B — Cloud-Based

Automated Pipeline with Cloud API (Google Cloud DLP or AWS Comprehend)

Trigger: PII is frequent, diverse, multi-language, or you need maximum detection accuracy and can accept cloud processing.

The same processEvidence() integration point is used, but it calls a cloud NLP API instead of local processing. The original file is sent to the cloud provider, analyzed, de-identified, and only the redacted version is stored. Two viable options:

User Upload + consent ☑
processEvidence() Cloud DLP / Comprehend
Deidentify / Redact redact / mask
Store Redacted + manifest

Choosing Between Cloud DLP and AWS Comprehend

Dimension Google Cloud DLP AWS Comprehend
PII language support 60+ languages, auto-detection English + Spanish (first-class)
Entity types 150+ infoTypes globally 22 universal + 14 country-specific (US, UK, CA, IN)
Best for our use case Maximum entity coverage, multi-jurisdiction Spanish-first evidence, simpler API surface
API complexity Higher — inspection templates, deidentify configs in JSON Simpler — single DetectPiiEntities call
Redaction mode Real-time deidentify with multiple techniques (masking, tokenization, date-shifting) Async batch job required for redaction; real-time for detection only
Structured data Yes — BigQuery, Cloud Storage, databases Text only (no native PDF; text extraction required)
S3/Object Lambda integration No (GCS native) Yes — S3 Object Lambda for automatic PII gating
Custom entity types Custom infoTypes via regex/dictionary Custom entity recognizers (requires training)
Pricing model Per content item inspected Per 100-character unit (3 unit minimum per request)
Cost at low volume ~$10–50/month ~$5–30/month
Infrastructure GCP project required AWS account required
Data jurisdiction EU regions available EU regions available (eu-west-1, eu-central-1)
Recommendation for Kleros: Given that Spanish is the first use case, AWS Comprehend has a meaningful edge — it explicitly supports Spanish PII detection as a first-class feature, whereas Cloud DLP's Spanish support is part of broader auto-detection. Comprehend's API is also simpler to integrate. However, if the use case expands to many jurisdictions and languages beyond EN/ES, Cloud DLP's 150+ infoTypes and 60+ language coverage becomes the stronger choice.

✓ Shared Advantages (both)

No infrastructure to maintain

Managed, scalable, pay-per-use

Well-calibrated confidence scores

SOC 2 / ISO certified providers

EU region deployment available

✗ Shared Tradeoffs (both)

Raw PII sent to third-party cloud

Vendor dependency (GCP or AWS)

Adds network latency to upload flow

Less customizable than Presidio

Consent must disclose cloud processing

Privacy consent implications. If you adopt any cloud-based pipeline, update the consent checkbox wording to inform users that their file will be processed by a third-party cloud service for anonymization. The current wording ("Kleros reviewing and redacting") implies internal-only processing.

Shared Redaction Schema

Regardless of which pipeline produces the anonymized file, use a consistent redaction format. This ensures downstream consumers (jurors, UI, audit logs) don't need to know which pipeline was used.

PII Category Placeholder Token Example
Person name [PERSON_NAME] "The claimant [PERSON_NAME] filed on..."
Email address [EMAIL] "Contact at [EMAIL]"
Phone number [PHONE] "Reached at [PHONE]"
Physical address [ADDRESS] "Located at [ADDRESS]"
National ID / SSN [NATIONAL_ID] "ID number [NATIONAL_ID]"
Financial account [FINANCIAL_ACCT] "Bank account [FINANCIAL_ACCT]"
Face in image blurred region Gaussian blur applied over detected face bounding box

Redaction Manifest

Stored alongside each anonymized file. Records what was redacted (categories, not values) so arbitrators can assess evidence completeness without seeing the original PII.

{ "file_id": "evt_abc123", "redacted_file_id": "evt_abc123_redacted", "pipeline": "manual", // "manual" | "presidio" | "cloud_dlp" | "comprehend" "redaction_timestamp": "2026-02-19T15:00:00Z", "entities_redacted": [ { "type": "PERSON_NAME", "count": 2 }, { "type": "EMAIL", "count": 1 }, { "type": "FACE", "count": 1 } ], "original_hash": "sha256:9f3a...", "redacted_hash": "sha256:b7c1..." }

The processEvidence() Interface

This is the single integration point that evolves across phases. The function signature and return type stay constant — only the implementation changes.

// Phase 1: No-op async function processEvidence(file: File): Promise<ProcessedEvidence> { return { file, // unchanged pipeline: "none", entitiesFound: [], redactionApplied: false, }; } // ───────────────────────────────────────────────────────────────────────────── // Phase 3A: Presidio async function processEvidence(file: File): Promise<ProcessedEvidence> { const text = await extractText(file); const analysis = await presidioAnalyze(text); const redacted = await presidioAnonymize(text, analysis); return { file: rebuildFile(file, redacted), pipeline: "presidio", entitiesFound: analysis.entities, redactionApplied: true, }; } // ───────────────────────────────────────────────────────────────────────────── // Phase 3B-i: Google Cloud DLP async function processEvidence(file: File): Promise<ProcessedEvidence> { const content = await fileToBuffer(file); const result = await dlpClient.deidentifyContent({ parent: `projects/${projectId}/locations/global`, item: { byteItem: { data: content, type: 'PDF' } }, deidentifyConfig: REDACTION_CONFIG, }); return { file: result.item, pipeline: "cloud_dlp", entitiesFound: result.overview.transformationSummaries, redactionApplied: true, }; } // ───────────────────────────────────────────────────────────────────────────── // Phase 3B-ii: AWS Comprehend async function processEvidence(file: File): Promise<ProcessedEvidence> { const text = await extractText(file); const result = await comprehendClient.detectPiiEntities({ Text: text, LanguageCode: "es", // or "en" — auto-detect upstream }); const redacted = applyRedactions(text, result.Entities); return { file: rebuildFile(file, redacted), pipeline: "comprehend", entitiesFound: result.Entities, redactionApplied: true, }; }

Important Considerations

Evidence Integrity for Dispute Resolution

In a legal/arbitration context, redacted evidence can be challenged. Always store the SHA-256 hash of the original file at upload time, before any processing. The redacted version gets its own hash. This creates a verifiable chain: "this redacted file was derived from this original, which the user uploaded at this timestamp with this consent."

Consider whether you need an encrypted escrow of the original — accessible only under specific conditions (e.g., multi-sig from arbitrators) — for cases where redactions are disputed.

PDF Complexity

PDFs are deceptively complex. PII can live in text content streams, form fields, annotations, embedded images, XMP metadata, and even in the document title or author fields. A basic text extraction misses several of these layers. Both Presidio and the cloud APIs (Cloud DLP, Comprehend) primarily handle extracted text — you'll need supplementary processing for metadata stripping and embedded image analysis.

Multi-Language Evidence

If Kleros handles disputes across jurisdictions, evidence may be in multiple languages. Presidio requires separate spaCy/Stanza models per language. AWS Comprehend supports English and Spanish PII detection natively, which aligns with the current first use case. Cloud DLP handles the broadest range with automatic language detection across 60+ languages. Factor this into your Phase 2 decision.

Consent Wording Evolution

If you transition from Phase 1 (internal review) to Phase 3B (Cloud DLP or Comprehend), the consent language must be updated to disclose third-party cloud processing. Files uploaded under the original consent should not retroactively be processed through any cloud API without re-consent.

Cost Estimation for Cloud APIs

Google Cloud DLP charges per content item inspected. AWS Comprehend charges per 100-character unit with a 3-unit minimum per request. At low volumes (hundreds of files/month), both are minimal — typically under $50/month. At scale (thousands of files with large PDFs), costs can become significant, particularly with Comprehend's per-character model on long documents. Presidio's cost is purely the compute infrastructure you already run.


Open Questions