Phased architecture for PII anonymization of user-uploaded evidence files (PDF, images). From manual review at launch to automated pipeline at scale.
The webapp is in early stages with a small user base. The initial evidence files may not contain PII at all. The architecture is designed to ship fast, collect real data on PII prevalence, and add automated anonymization only when justified by actual usage patterns.
Each phase is designed so that the transition to the next phase requires no breaking changes to the upload flow, database schema, or frontend contract. The processing stage exists from day one — it simply does more work in later phases.
The user uploads evidence. A required consent checkbox grants Kleros permission to review and redact PII if necessary. The file passes through a no-op processing stage and is stored. Internally, staff review flagged files and manually redact using standard tools (PDF editor, image editor).
The checkbox is required (not optional) to submit. Suggested wording:
processEvidence() as a no-op passthroughprocessing_stage and pii_categories fields to evidence schemaBefore building any automation, you now have real data. This phase is a decision gate, not necessarily a code change. Based on what you've observed, one of three paths:
| Observation | PII Rate | Action |
|---|---|---|
| Almost no files contain PII | < 5% | Stay in Phase 1. Manual review is sustainable. Revisit quarterly. |
| Some files contain PII, mostly predictable patterns | 5–30% | Implement Phase 3 with Presidio (self-hosted). The PII types are known and the volume justifies automation but not cloud costs. |
| Many files, diverse and complex PII types | > 30% | Evaluate Phase 3B with a cloud API — Google Cloud DLP (broadest entity coverage) or AWS Comprehend (native Spanish PII support). Cost justified by volume. |
The no-op processEvidence() is replaced with a call to a self-hosted Presidio microservice.
No data leaves your infrastructure. Custom recognizers are added for entity types specific to your evidence.
For image evidence, add the Presidio Image Redactor container for face/text detection in images.
No data leaves your infrastructure
No API costs or usage limits
MIT license, fully free
Highly customizable recognizers
Predictable latency (no network round-trip)
Can run air-gapped if needed
Lower out-of-box accuracy vs. cloud APIs
~20–30 built-in recognizers (vs. 150+)
You maintain the infrastructure
NER model tuning needed for best results
Image redaction is less mature
Multi-language support requires separate models
The same processEvidence() integration point is used, but it calls a cloud NLP API
instead of local processing. The original file is sent to the cloud provider, analyzed, de-identified,
and only the redacted version is stored. Two viable options:
| Dimension | Google Cloud DLP | AWS Comprehend |
|---|---|---|
| PII language support | 60+ languages, auto-detection | English + Spanish (first-class) |
| Entity types | 150+ infoTypes globally | 22 universal + 14 country-specific (US, UK, CA, IN) |
| Best for our use case | Maximum entity coverage, multi-jurisdiction | Spanish-first evidence, simpler API surface |
| API complexity | Higher — inspection templates, deidentify configs in JSON | Simpler — single DetectPiiEntities call |
| Redaction mode | Real-time deidentify with multiple techniques (masking, tokenization, date-shifting) | Async batch job required for redaction; real-time for detection only |
| Structured data | Yes — BigQuery, Cloud Storage, databases | Text only (no native PDF; text extraction required) |
| S3/Object Lambda integration | No (GCS native) | Yes — S3 Object Lambda for automatic PII gating |
| Custom entity types | Custom infoTypes via regex/dictionary | Custom entity recognizers (requires training) |
| Pricing model | Per content item inspected | Per 100-character unit (3 unit minimum per request) |
| Cost at low volume | ~$10–50/month | ~$5–30/month |
| Infrastructure | GCP project required | AWS account required |
| Data jurisdiction | EU regions available | EU regions available (eu-west-1, eu-central-1) |
No infrastructure to maintain
Managed, scalable, pay-per-use
Well-calibrated confidence scores
SOC 2 / ISO certified providers
EU region deployment available
Raw PII sent to third-party cloud
Vendor dependency (GCP or AWS)
Adds network latency to upload flow
Less customizable than Presidio
Consent must disclose cloud processing
Regardless of which pipeline produces the anonymized file, use a consistent redaction format. This ensures downstream consumers (jurors, UI, audit logs) don't need to know which pipeline was used.
| PII Category | Placeholder Token | Example |
|---|---|---|
| Person name | [PERSON_NAME] |
"The claimant [PERSON_NAME] filed on..." |
| Email address | [EMAIL] |
"Contact at [EMAIL]" |
| Phone number | [PHONE] |
"Reached at [PHONE]" |
| Physical address | [ADDRESS] |
"Located at [ADDRESS]" |
| National ID / SSN | [NATIONAL_ID] |
"ID number [NATIONAL_ID]" |
| Financial account | [FINANCIAL_ACCT] |
"Bank account [FINANCIAL_ACCT]" |
| Face in image | blurred region | Gaussian blur applied over detected face bounding box |
Stored alongside each anonymized file. Records what was redacted (categories, not values) so arbitrators can assess evidence completeness without seeing the original PII.
This is the single integration point that evolves across phases. The function signature and return type stay constant — only the implementation changes.
In a legal/arbitration context, redacted evidence can be challenged. Always store the SHA-256 hash of the original file at upload time, before any processing. The redacted version gets its own hash. This creates a verifiable chain: "this redacted file was derived from this original, which the user uploaded at this timestamp with this consent."
Consider whether you need an encrypted escrow of the original — accessible only under specific conditions (e.g., multi-sig from arbitrators) — for cases where redactions are disputed.
PDFs are deceptively complex. PII can live in text content streams, form fields, annotations, embedded images, XMP metadata, and even in the document title or author fields. A basic text extraction misses several of these layers. Both Presidio and the cloud APIs (Cloud DLP, Comprehend) primarily handle extracted text — you'll need supplementary processing for metadata stripping and embedded image analysis.
If Kleros handles disputes across jurisdictions, evidence may be in multiple languages. Presidio requires separate spaCy/Stanza models per language. AWS Comprehend supports English and Spanish PII detection natively, which aligns with the current first use case. Cloud DLP handles the broadest range with automatic language detection across 60+ languages. Factor this into your Phase 2 decision.
If you transition from Phase 1 (internal review) to Phase 3B (Cloud DLP or Comprehend), the consent language must be updated to disclose third-party cloud processing. Files uploaded under the original consent should not retroactively be processed through any cloud API without re-consent.
Google Cloud DLP charges per content item inspected. AWS Comprehend charges per 100-character unit with a 3-unit minimum per request. At low volumes (hundreds of files/month), both are minimal — typically under $50/month. At scale (thousands of files with large PDFs), costs can become significant, particularly with Comprehend's per-character model on long documents. Presidio's cost is purely the compute infrastructure you already run.