Evidence Anonymization Pipeline — Architecture Report

Guiding Principles

The webapp is in early stages with a small user base. The initial evidence files may not contain PII at all. The architecture is designed to ship fast, collect real data on PII prevalence, and add automated anonymization only when justified by actual usage patterns.

Each phase is designed so that the transition to the next phase requires no breaking changes to the upload flow, database schema, or frontend contract. The processing stage exists from day one — it simply does more work in later phases.

Phased Rollout

Phase 1 — Ship Now

Manual Review with User Consent

Trigger: Launch. This is your starting state.

The user uploads evidence. A required consent checkbox grants Kleros permission to review and redact PII if necessary. The file passes through a no-op processing stage and is stored. Internally, staff review flagged files and manually redact using standard tools (PDF editor, image editor).

User Upload + consent ☑

→

processEvidence() no-op passthrough

→

Store File + consent log

→

Manual Review internal staff

Consent UI

The checkbox is required (not optional) to submit. Suggested wording:

What to store at upload time

{
  "file_id":          "evt_abc123",
  "original_hash":     "sha256:9f3a...",       // integrity proof
  "consent_granted":   true,
  "consent_timestamp": "2026-02-19T14:30:00Z",
  "consent_user_id":   "usr_xyz789",
  "processing_stage":  "none",                // "none" | "manual" | "automated"
  "pii_detected":      null,                 // populated after review
  "pii_categories":    [],                   // e.g. ["name","address","face"]
  "redacted_file_id":  null                  // points to clean version
}
      

PII tracking from day one. Even during manual review, log what categories of PII you find (names, addresses, faces, national IDs, financial data). This data directly informs whether and when to invest in automation.

Required consent checkbox on upload form
Log consent event (timestamp + user ID) in database
Store SHA-256 hash of original file for integrity
Implement processEvidence() as a no-op passthrough
Add processing_stage and pii_categories fields to evidence schema
Internal team process: review uploads weekly, manually redact if needed
Track PII occurrences in a simple spreadsheet or log

↓

Phase 2 — Informed Automation Decision

PII Triage & Pipeline Skeleton

Trigger: You've manually reviewed ~50–100 files and have data on PII frequency.

Before building any automation, you now have real data. This phase is a decision gate, not necessarily a code change. Based on what you've observed, one of three paths:

Observation	PII Rate	Action
Almost no files contain PII	< 5%	Stay in Phase 1. Manual review is sustainable. Revisit quarterly.
Some files contain PII, mostly predictable patterns	5–30%	Implement Phase 3 with Presidio (self-hosted). The PII types are known and the volume justifies automation but not cloud costs.
Many files, diverse and complex PII types	> 30%	Evaluate Phase 3B with a cloud API — Google Cloud DLP (broadest entity coverage) or AWS Comprehend (native Spanish PII support). Cost justified by volume.

The PII categories matter as much as the rate. If 20% of files contain PII but it's always email addresses and phone numbers, a few regex patterns handle it — no ML needed. If it's names embedded in free text and faces in photos, you need NER and computer vision. Let the data guide the tooling choice.

↓

⚡ Choose One Path

Phase 3A — Self-Hosted

Automated Pipeline with Presidio

Trigger: PII is frequent enough to automate, but types are predictable and volume is moderate.

The no-op processEvidence() is replaced with a call to a self-hosted Presidio microservice. No data leaves your infrastructure. Custom recognizers are added for entity types specific to your evidence.

User Upload + consent ☑

→

processEvidence() Presidio Analyzer

→

Anonymizer redact / mask

→

Store Redacted + manifest

For image evidence, add the Presidio Image Redactor container for face/text detection in images.

✓ Advantages

No data leaves your infrastructure

No API costs or usage limits

MIT license, fully free

Highly customizable recognizers

Predictable latency (no network round-trip)

Can run air-gapped if needed

✗ Tradeoffs

Lower out-of-box accuracy vs. cloud APIs

~20–30 built-in recognizers (vs. 150+)

You maintain the infrastructure

NER model tuning needed for best results

Image redaction is less mature

Multi-language support requires separate models

Infrastructure

# docker-compose.yml (add alongside existing services)

presidio-analyzer:
  image: mcr.microsoft.com/presidio-analyzer:latest
  ports: ["5002:3000"]

presidio-anonymizer:
  image: mcr.microsoft.com/presidio-anonymizer:latest
  ports: ["5001:3000"]

presidio-image-redactor:              # only if handling image evidence
  image: mcr.microsoft.com/presidio-image-redactor:latest
  ports: ["5003:3000"]
      

Phase 3B — Cloud-Based

Automated Pipeline with Cloud API (Google Cloud DLP or AWS Comprehend)

Trigger: PII is frequent, diverse, multi-language, or you need maximum detection accuracy and can accept cloud processing.

The same processEvidence() integration point is used, but it calls a cloud NLP API instead of local processing. The original file is sent to the cloud provider, analyzed, de-identified, and only the redacted version is stored. Two viable options:

User Upload + consent ☑

→

processEvidence() Cloud DLP / Comprehend

→

Deidentify / Redact redact / mask

→

Store Redacted + manifest

Choosing Between Cloud DLP and AWS Comprehend

Dimension	Google Cloud DLP	AWS Comprehend
PII language support	60+ languages, auto-detection	English + Spanish (first-class)
Entity types	150+ infoTypes globally	22 universal + 14 country-specific (US, UK, CA, IN)
Best for our use case	Maximum entity coverage, multi-jurisdiction	Spanish-first evidence, simpler API surface
API complexity	Higher — inspection templates, deidentify configs in JSON	Simpler — single DetectPiiEntities call
Redaction mode	Real-time deidentify with multiple techniques (masking, tokenization, date-shifting)	Async batch job required for redaction; real-time for detection only
Structured data	Yes — BigQuery, Cloud Storage, databases	Text only (no native PDF; text extraction required)
S3/Object Lambda integration	No (GCS native)	Yes — S3 Object Lambda for automatic PII gating
Custom entity types	Custom infoTypes via regex/dictionary	Custom entity recognizers (requires training)
Pricing model	Per content item inspected	Per 100-character unit (3 unit minimum per request)
Cost at low volume	~$10–50/month	~$5–30/month
Infrastructure	GCP project required	AWS account required
Data jurisdiction	EU regions available	EU regions available (eu-west-1, eu-central-1)

Recommendation for Kleros: Given that Spanish is the first use case, AWS Comprehend has a meaningful edge — it explicitly supports Spanish PII detection as a first-class feature, whereas Cloud DLP's Spanish support is part of broader auto-detection. Comprehend's API is also simpler to integrate. However, if the use case expands to many jurisdictions and languages beyond EN/ES, Cloud DLP's 150+ infoTypes and 60+ language coverage becomes the stronger choice.

✓ Shared Advantages (both)

No infrastructure to maintain

Managed, scalable, pay-per-use

Well-calibrated confidence scores

SOC 2 / ISO certified providers

EU region deployment available

✗ Shared Tradeoffs (both)

Raw PII sent to third-party cloud

Vendor dependency (GCP or AWS)

Adds network latency to upload flow

Less customizable than Presidio

Consent must disclose cloud processing

Privacy consent implications. If you adopt any cloud-based pipeline, update the consent checkbox wording to inform users that their file will be processed by a third-party cloud service for anonymization. The current wording ("Kleros reviewing and redacting") implies internal-only processing.

Shared Redaction Schema

Regardless of which pipeline produces the anonymized file, use a consistent redaction format. This ensures downstream consumers (jurors, UI, audit logs) don't need to know which pipeline was used.

PII Category	Placeholder Token	Example
Person name	`[PERSON_NAME]`	"The claimant `[PERSON_NAME]` filed on..."
Email address	`[EMAIL]`	"Contact at `[EMAIL]`"
Phone number	`[PHONE]`	"Reached at `[PHONE]`"
Physical address	`[ADDRESS]`	"Located at `[ADDRESS]`"
National ID / SSN	`[NATIONAL_ID]`	"ID number `[NATIONAL_ID]`"
Financial account	`[FINANCIAL_ACCT]`	"Bank account `[FINANCIAL_ACCT]`"
Face in image	blurred region	Gaussian blur applied over detected face bounding box

Redaction Manifest

Stored alongside each anonymized file. Records what was redacted (categories, not values) so arbitrators can assess evidence completeness without seeing the original PII.

{
  "file_id":           "evt_abc123",
  "redacted_file_id":  "evt_abc123_redacted",
  "pipeline":           "manual",       // "manual" | "presidio" | "cloud_dlp" | "comprehend"
  "redaction_timestamp": "2026-02-19T15:00:00Z",
  "entities_redacted":  [
    { "type": "PERSON_NAME", "count": 2 },
    { "type": "EMAIL",       "count": 1 },
    { "type": "FACE",        "count": 1 }
  ],
  "original_hash":      "sha256:9f3a...",
  "redacted_hash":      "sha256:b7c1..."
}
    

The processEvidence() Interface

This is the single integration point that evolves across phases. The function signature and return type stay constant — only the implementation changes.

// Phase 1: No-op
async function processEvidence(file: File): Promise<ProcessedEvidence> {
  return {
    file,                        // unchanged
    pipeline: "none",
    entitiesFound: [],
    redactionApplied: false,
  };
}

// ─────────────────────────────────────────────────────────────────────────────

// Phase 3A: Presidio
async function processEvidence(file: File): Promise<ProcessedEvidence> {
  const text = await extractText(file);
  const analysis = await presidioAnalyze(text);
  const redacted = await presidioAnonymize(text, analysis);
  return {
    file: rebuildFile(file, redacted),
    pipeline: "presidio",
    entitiesFound: analysis.entities,
    redactionApplied: true,
  };
}

// ─────────────────────────────────────────────────────────────────────────────

// Phase 3B-i: Google Cloud DLP
async function processEvidence(file: File): Promise<ProcessedEvidence> {
  const content = await fileToBuffer(file);
  const result = await dlpClient.deidentifyContent({
    parent: `projects/${projectId}/locations/global`,
    item: { byteItem: { data: content, type: 'PDF' } },
    deidentifyConfig: REDACTION_CONFIG,
  });
  return {
    file: result.item,
    pipeline: "cloud_dlp",
    entitiesFound: result.overview.transformationSummaries,
    redactionApplied: true,
  };
}

// ─────────────────────────────────────────────────────────────────────────────

// Phase 3B-ii: AWS Comprehend
async function processEvidence(file: File): Promise<ProcessedEvidence> {
  const text = await extractText(file);
  const result = await comprehendClient.detectPiiEntities({
    Text: text,
    LanguageCode: "es",             // or "en" — auto-detect upstream
  });
  const redacted = applyRedactions(text, result.Entities);
  return {
    file: rebuildFile(file, redacted),
    pipeline: "comprehend",
    entitiesFound: result.Entities,
    redactionApplied: true,
  };
}
    

Important Considerations

Evidence Integrity for Dispute Resolution

In a legal/arbitration context, redacted evidence can be challenged. Always store the SHA-256 hash of the original file at upload time, before any processing. The redacted version gets its own hash. This creates a verifiable chain: "this redacted file was derived from this original, which the user uploaded at this timestamp with this consent."

Consider whether you need an encrypted escrow of the original — accessible only under specific conditions (e.g., multi-sig from arbitrators) — for cases where redactions are disputed.

PDF Complexity

PDFs are deceptively complex. PII can live in text content streams, form fields, annotations, embedded images, XMP metadata, and even in the document title or author fields. A basic text extraction misses several of these layers. Both Presidio and the cloud APIs (Cloud DLP, Comprehend) primarily handle extracted text — you'll need supplementary processing for metadata stripping and embedded image analysis.

Multi-Language Evidence

If Kleros handles disputes across jurisdictions, evidence may be in multiple languages. Presidio requires separate spaCy/Stanza models per language. AWS Comprehend supports English and Spanish PII detection natively, which aligns with the current first use case. Cloud DLP handles the broadest range with automatic language detection across 60+ languages. Factor this into your Phase 2 decision.

Consent Wording Evolution

If you transition from Phase 1 (internal review) to Phase 3B (Cloud DLP or Comprehend), the consent language must be updated to disclose third-party cloud processing. Files uploaded under the original consent should not retroactively be processed through any cloud API without re-consent.

Cost Estimation for Cloud APIs

Google Cloud DLP charges per content item inspected. AWS Comprehend charges per 100-character unit with a 3-unit minimum per request. At low volumes (hundreds of files/month), both are minimal — typically under $50/month. At scale (thousands of files with large PDFs), costs can become significant, particularly with Comprehend's per-character model on long documents. Presidio's cost is purely the compute infrastructure you already run.

Open Questions

What jurisdictions will evidence come from? (affects language models + PII entity types)
Should the original un-redacted file ever be recoverable? (encrypted escrow vs. permanent deletion)
Will jurors/arbitrators see the redaction manifest, or only the redacted file?
What's the maximum acceptable file size? (affects processing time and cloud API costs)
Is there a requirement to process evidence on-chain or via IPFS? (affects where redaction must happen)
Should the consent and redaction metadata be stored on-chain for auditability?

Evidence File Anonymization Pipeline

Guiding Principles

Phased Rollout

Manual Review with User Consent

Consent UI

What to store at upload time

PII Triage & Pipeline Skeleton

Automated Pipeline with Presidio

✓ Advantages

✗ Tradeoffs

Infrastructure

Automated Pipeline with Cloud API (Google Cloud DLP or AWS Comprehend)

Choosing Between Cloud DLP and AWS Comprehend

✓ Shared Advantages (both)

✗ Shared Tradeoffs (both)

Shared Redaction Schema

Redaction Manifest

The processEvidence() Interface

Important Considerations

Evidence Integrity for Dispute Resolution

PDF Complexity

Multi-Language Evidence

Consent Wording Evolution

Cost Estimation for Cloud APIs

Open Questions