PII Scanning¶
Automatically detect personally identifiable information in your synced data — email addresses, phone numbers, credit card numbers, and more.
Prerequisites
- At least one data source synced
- Presidio and spaCy are included in the
dangopackage dependencies — no extra install needed
Quick Start¶
-
Run a PII scan:
-
Review findings — each finding shows the source, table, column, entity type, and confidence.
-
Mark false positives:
Detailed Steps¶
How PII Scanning Works¶
PII scanning uses Presidio (Microsoft's data protection SDK) with spaCy (an NLP library) to analyze text columns for patterns that look like personal information.
flowchart LR
A[Select table] --> B[Sample 100 rows per column]
B --> C[Presidio analyzes each value]
C --> D{Score >= 0.5?}
D -->|Yes| E[Apply heuristics]
D -->|No| F[Skip]
E --> G[Report finding] For each string column in your data, Dango:
- Samples up to 100 distinct values from the column
- Runs each value through Presidio's NLP-based entity recognition
- Filters results by a confidence score threshold of 0.5
- Applies additional heuristics to reduce false positives
- Reports findings with entity type, confidence, and match count
Detected Entity Types¶
Dango scans for 7 types of PII:
| Entity Type | Examples | Notes |
|---|---|---|
EMAIL_ADDRESS | [email protected] | High precision — very few false positives |
PHONE_NUMBER | +1-555-0123, (555) 867-5309 | May match formatted numbers |
CREDIT_CARD | 4111-1111-1111-1111 | Luhn algorithm validation |
US_SSN | 123-45-6789 | US Social Security Numbers |
IP_ADDRESS | 192.168.1.1, 2001:db8::1 | IPv4 and IPv6 |
IBAN_CODE | DE89 3704 0044 0532 0130 00 | International Bank Account Numbers |
PERSON | John Smith, Maria Garcia | Uses spaCy NER — higher false positive rate (see below) |
PERSON entity has a higher false positive rate
The PERSON entity type uses spaCy's named entity recognition, which can flag product names, codes, or abbreviations as person names. Dango mitigates this with:
- 30% match ratio threshold: At least 30% of sampled values must match to flag a column
- Structured data heuristic: Columns with long values (avg > 100 chars) containing JSON/array delimiters are excluded — these are typically structured data, not names
Confidence Scoring¶
- Threshold: 0.5 (on a 0–1 scale). Results below this are discarded.
- PERSON entity: Requires at least 30% of sampled values to match (vs. any single match for other types). This dramatically reduces false positives from spaCy NER.
- Structured data filter: If a column's average value length exceeds 100 characters and values contain JSON/array delimiters (
[,],{,\n), the PERSON entity is suppressed for that column.
Viewing Reports¶
Managing False Positives¶
If the scanner flags a column incorrectly, mark it as not_pii:
# Mark a column as not PII
dango governance pii-set stripe customer business_name --status not_pii \
--reason "Contains company names, not person names"
# Mark a column as confirmed PII
dango governance pii-set stripe customer email --status pii \
--reason "Customer email addresses"
# List all overrides
dango governance pii-list
Overrides are stored in .dango/pii-overrides.yml:
# .dango/pii-overrides.yml (auto-managed by CLI)
overrides:
- source: stripe
table: customer
column: business_name
status: not_pii
set_by: admin
reason: "Contains company names, not person names"
updated_at: "2026-05-15T10:30:00+00:00"
Configuration Reference¶
pii-overrides.yml Format¶
overrides:
- source: stripe # Source name
table: customer # Table name within the source
column: business_name # Column name
status: not_pii # "pii" or "not_pii"
set_by: admin # Username who set the override
reason: "Company names" # Human-readable reason
updated_at: "2026-05-15T10:30:00+00:00"
Override Fields¶
| Field | Type | Description |
|---|---|---|
source | string | Source name (e.g., stripe) |
table | string | Table name (e.g., customer) |
column | string | Column name (e.g., business_name) |
status | string | pii (confirmed PII) or not_pii (false positive) |
set_by | string | Username of the person who created the override |
reason | string | Explanation for the override |
updated_at | datetime | When the override was last modified |
Presidio + spaCy internals
Dango uses Presidio's AnalyzerEngine with spaCy's en_core_web_sm language model for named entity recognition. The model is downloaded automatically the first time PII scanning runs.
- Engine initialization: Lazy — Presidio and spaCy are loaded on first scan, not at startup
- Language model:
en_core_web_sm(English, small model) — balanced between accuracy and speed - Scan scope: Only string/text columns are scanned (VARCHAR, TEXT, STRING, CHAR, BPCHAR)
- Sample size: 100 distinct values per column — balances coverage with performance
- Caching: Scan results are cached in SQLite. Re-running a scan for the same source/table returns cached results unless new data has been synced.
Verification¶
After running a PII scan, verify the results:
# Run a scan for a specific source
dango governance pii-report --source stripe
# Check that overrides are applied
dango governance pii-list
Verify that:
- High-confidence findings (>0.9) on columns like
email,phoneare correctly flagged - Low-confidence findings on non-PII columns are marked as false positives
- Overrides correctly suppress known false positives in subsequent scans
Troubleshooting¶
Too many false positives
- Use
dango governance pii-setto mark false positives asnot_pii - The
PERSONentity type is the most common source of false positives — it flags product names, codes, and abbreviations - Columns with structured data (JSON, arrays) can trigger false PERSON matches — the structured data heuristic should catch most of these
PERSON entity noise
- The 30% match ratio threshold filters out columns where only a few values look like names
- If a column consistently produces false PERSON matches, mark it with
pii-set ... not_pii - Consider whether the column actually contains person names embedded in other text
spaCy model not found
- The
en_core_web_smmodel is downloaded automatically on first use - If download fails (e.g., no internet), run manually:
python -m spacy download en_core_web_sm - Verify installation:
python -c "import spacy; spacy.load('en_core_web_sm')"
Scan taking too long
- PII scanning samples 100 values per column — it shouldn't take more than a few seconds per table
- If a source has many tables with many string columns, scans can take longer
- Use
--sourceto scan a specific source instead of all sources at once
Next Steps¶
- Schema Drift — another governance feature that protects data quality
- Webhook Notifications —
pii_detectedevent type for automated alerts - Data Catalog — PII flags are visible in the catalog's column metadata