Query Anonymization

Privacy-preserving query logging with Blake3 cryptographic fingerprinting for compliance without sacrificing observability.

Why Anonymize Queries?

The Privacy Problem

Raw query logging exposes sensitive data:

-- Raw query (PII exposed!)
SELECT * FROM users WHERE email = 'alice@example.com' AND ssn = '123-45-6789'

This violates:

  • GDPR (personal data)
  • HIPAA (protected health information)
  • PCI DSS (payment card data)
  • SOC 2 (customer data protection)

The Solution: Anonymization

-- Anonymized query (safe to log)
SELECT * FROM users WHERE email = ? AND ssn = ?

Benefits

  • No PII in logs
  • Compliance-friendly
  • Still useful for query analysis
  • Preserves query structure

How It Works

Original Query
      ↓
Parse SQL (sqlparser crate)
      ↓
Extract literal values
      ↓
Generate fingerprints (Blake3 hash)
      ↓
Replace literals with placeholders
      ↓
Anonymized Query + Fingerprints
                            

Example Processing

Input:

SELECT * FROM orders WHERE user_id = 12345 AND status = 'completed'

Output:

{
  "normalized_query": "SELECT * FROM orders WHERE user_id = ? AND status = ?",
  "value_fingerprints": [
    "blake3:abc123...",  // 12345
    "blake3:def456..."   // 'completed'
  ]
}

Blake3 Fingerprinting

ScryData uses Blake3 for fast, secure hashing:

~1GB/s
Hashing throughput
Secure
Cryptographically safe
Deterministic
Same value = same fingerprint
Collision-resistant
Different values = different fingerprints

Fingerprint Format

blake3:0123456789abcdef...
│      └─ Hex-encoded hash (64 chars)
└─ Algorithm prefix

Why Fingerprints?

Fingerprints enable privacy-preserving analytics. The same value always produces the same fingerprint, so you can:

  • Track access patterns without seeing actual values
  • Identify hot data (frequently accessed fingerprints)
  • Detect anomalies (sudden spikes in specific fingerprints)
  • Maintain compliance while preserving observability

Hot Data Detection

ScryData tracks frequently accessed fingerprints using Count-Min Sketch + Top-K heap:

curl http://localhost:9090/debug/hot_data
{
  "top_k": [
    {"fingerprint": "blake3:a1b2c3d4...", "access_count": 15234},
    {"fingerprint": "blake3:f6e5d4c3...", "access_count": 8901}
  ]
}

Use Cases:

  • Cache Optimization: Cache hot fingerprint results
  • Performance Tuning: Optimize queries accessing hot data
  • Security: Detect unusual access patterns (credential stuffing)

Supported SQL

Statements

  • SELECT, INSERT, UPDATE, DELETE
  • WHERE clauses, JOIN conditions
  • HAVING clauses, VALUES lists
  • Function arguments, Subqueries

Literal Types

  • Numbers (integers, decimals)
  • Strings (single-quoted)
  • Booleans (TRUE, FALSE)
  • NULL

Preserved Elements

  • Table names (needed for query analysis)
  • Column names (needed for query analysis)
  • Function names, operators, SQL keywords

Examples

SELECT Query

-- Original
SELECT name, email FROM users WHERE id = 12345

-- Anonymized
SELECT name, email FROM users WHERE id = ?
-- Fingerprints: ["blake3:abc123..."]

INSERT Query

-- Original
INSERT INTO users (name, email, age) VALUES ('Alice', 'alice@example.com', 30)

-- Anonymized
INSERT INTO users (name, email, age) VALUES (?, ?, ?)
-- Fingerprints: ["blake3:aaa...", "blake3:bbb...", "blake3:ccc..."]

Complex Query

-- Original
SELECT o.id FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status IN ('pending', 'processing')
  AND o.total > 100.00

-- Anonymized
SELECT o.id FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status IN (?, ?)
  AND o.total > ?
-- Fingerprints: ["blake3:111...", "blake3:222...", "blake3:333..."]

Configuration

[publisher]
anonymize = true  # Enable anonymization (default)
export SCRY_PUBLISHER__ANONYMIZE=true

Development: Set anonymize = false to see actual query values during debugging.

Maintain Compliance Without Sacrificing Observability

ScryData's query anonymization keeps your logs compliant while preserving analysis capabilities.

Request Early Access