Why Anonymize Queries?
The Privacy Problem
Raw query logging exposes sensitive data:
-- Raw query (PII exposed!)
SELECT * FROM users WHERE email = 'alice@example.com' AND ssn = '123-45-6789'
This violates:
- GDPR (personal data)
- HIPAA (protected health information)
- PCI DSS (payment card data)
- SOC 2 (customer data protection)
The Solution: Anonymization
-- Anonymized query (safe to log)
SELECT * FROM users WHERE email = ? AND ssn = ?
Benefits
- No PII in logs
- Compliance-friendly
- Still useful for query analysis
- Preserves query structure
How It Works
Original Query
↓
Parse SQL (sqlparser crate)
↓
Extract literal values
↓
Generate fingerprints (Blake3 hash)
↓
Replace literals with placeholders
↓
Anonymized Query + Fingerprints
Example Processing
Input:
SELECT * FROM orders WHERE user_id = 12345 AND status = 'completed'
Output:
{
"normalized_query": "SELECT * FROM orders WHERE user_id = ? AND status = ?",
"value_fingerprints": [
"blake3:abc123...", // 12345
"blake3:def456..." // 'completed'
]
}
Blake3 Fingerprinting
ScryData uses Blake3 for fast, secure hashing:
~1GB/s
Hashing throughput
Secure
Cryptographically safe
Deterministic
Same value = same fingerprint
Collision-resistant
Different values = different fingerprints
Fingerprint Format
blake3:0123456789abcdef...
│ └─ Hex-encoded hash (64 chars)
└─ Algorithm prefix
Why Fingerprints?
Fingerprints enable privacy-preserving analytics. The same value always produces the same fingerprint, so you can:
- Track access patterns without seeing actual values
- Identify hot data (frequently accessed fingerprints)
- Detect anomalies (sudden spikes in specific fingerprints)
- Maintain compliance while preserving observability
Hot Data Detection
ScryData tracks frequently accessed fingerprints using Count-Min Sketch + Top-K heap:
curl http://localhost:9090/debug/hot_data
{
"top_k": [
{"fingerprint": "blake3:a1b2c3d4...", "access_count": 15234},
{"fingerprint": "blake3:f6e5d4c3...", "access_count": 8901}
]
}
Use Cases:
- Cache Optimization: Cache hot fingerprint results
- Performance Tuning: Optimize queries accessing hot data
- Security: Detect unusual access patterns (credential stuffing)
Supported SQL
Statements
- SELECT, INSERT, UPDATE, DELETE
- WHERE clauses, JOIN conditions
- HAVING clauses, VALUES lists
- Function arguments, Subqueries
Literal Types
- Numbers (integers, decimals)
- Strings (single-quoted)
- Booleans (TRUE, FALSE)
- NULL
Preserved Elements
- Table names (needed for query analysis)
- Column names (needed for query analysis)
- Function names, operators, SQL keywords
Examples
SELECT Query
-- Original
SELECT name, email FROM users WHERE id = 12345
-- Anonymized
SELECT name, email FROM users WHERE id = ?
-- Fingerprints: ["blake3:abc123..."]
INSERT Query
-- Original
INSERT INTO users (name, email, age) VALUES ('Alice', 'alice@example.com', 30)
-- Anonymized
INSERT INTO users (name, email, age) VALUES (?, ?, ?)
-- Fingerprints: ["blake3:aaa...", "blake3:bbb...", "blake3:ccc..."]
Complex Query
-- Original
SELECT o.id FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status IN ('pending', 'processing')
AND o.total > 100.00
-- Anonymized
SELECT o.id FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status IN (?, ?)
AND o.total > ?
-- Fingerprints: ["blake3:111...", "blake3:222...", "blake3:333..."]
Configuration
[publisher]
anonymize = true # Enable anonymization (default)
export SCRY_PUBLISHER__ANONYMIZE=true
Development: Set anonymize = false to see actual query values during debugging.
Maintain Compliance Without Sacrificing Observability
ScryData's query anonymization keeps your logs compliant while preserving analysis capabilities.
Request Early Access