A collaboration between Wordcab and Knowledgator to produce production-grade PII, PHI, and PCI detection for open use.
This model is fine-tuned on knowledgator/gliner-multitask-large-v0.5 for comprehensive PII detection across various domains.
# Using uv (recommended)
uv pip install gliner
# Using pip
pip install glinerfrom gliner import GLiNER
model = GLiNER.from_pretrained("wordcab/wordcab-pii-detection-large-v0.3")
text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
labels = ["name", "phone number", "account number"]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"{entity['text']} => {entity['label']}")name- Full namesname given- First/given namesname family- Last/family namesname medical professional- Healthcare provider namesdob- Date of birthage- Age informationgender- Gender identifiersmarital status- Marital status
email address- Email addressesphone number- Phone numberslocation address- Street addresseslocation address street- Street nameslocation city- City nameslocation state- State/province nameslocation country- Country nameslocation zip- ZIP/postal codeslocation- General location referencescounty- County namesaddress- General address informationzip- ZIP codes
credit card- Credit card numberscredit card expiration- Card expiration datescvv- CVV/security codesaccount number- Bank account numbersaccounts- Account referencesssn- Social Security Numberspin- PIN codesmoney- Monetary amounts
condition- Medical conditionsmedical process- Medical procedurestest result- Medical test resultsorganization medical facility- Healthcare facility namesdischarge date- Hospital discharge dates
passport number- Passport numberspolicy number- Insurance policy numbersconfirmation number- Confirmation/reference numbersesidno- ESI numbers
organization- Organization namesoccupation- Job titles/occupationsdate- General datesdate interval- Date rangestime- Time referencesduration- Time durationsmonth- Month referencesorigin- Ethnic/national originlanguage- Language informationphysical attribute- Physical descriptionsnumerical pii- Other numerical identifierspassword- Passwordsfilename- File namesplanduration- Plan durationsrate- Rates/percentagesnumber- General numbers
text = "Please send the invoice to [email protected]"
labels = ["email address"]
entities = model.predict_entities(text, labels, threshold=0.3)
# Output: [email protected] => email addresstext = "Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024 from St. Mary's Hospital"
labels = ["name", "dob", "discharge date", "organization medical facility"]
entities = model.predict_entities(text, labels, threshold=0.3)
# Output:
# Mary Johnson => name
# 01/15/1980 => dob
# March 10, 2024 => discharge date
# St. Mary's Hospital => organization medical facilitytext = "Card ending in 4532, CVV 123, expires 12/25. Account: 9876543210"
labels = ["credit card", "cvv", "credit card expiration", "account number"]
entities = model.predict_entities(text, labels, threshold=0.3)# Use all available labels for comprehensive detection
all_labels = [
'name', 'name given', 'name family', 'name medical professional',
'phone number', 'email address', 'ssn', 'credit card', 'cvv',
'credit card expiration', 'location address', 'location city',
'location state', 'location country', 'location zip', 'dob', 'age',
'gender', 'account number', 'organization', 'occupation', 'passport number',
'policy number', 'condition', 'medical process', 'organization medical facility'
]
text = "Your sensitive document text here..."
entities = model.predict_entities(text, all_labels, threshold=0.3)threshold: Confidence threshold for entity detection (0.0-1.0). Default: 0.3- Lower values (0.2-0.3): Higher recall, may include more false positives
- Higher values (0.5-0.7): Higher precision, may miss some entities
texts = ["Text 1 with PII", "Text 2 with PII", "Text 3 with PII"]
labels = ["name", "phone number", "email address"]
results = model.run(texts, labels, threshold=0.3, batch_size=8)
for text_idx, entities in enumerate(results):
print(f"Text {text_idx + 1}:")
for entity in entities:
print(f" {entity['text']} => {entity['label']}")@misc{smechov2025wordcabpii,
title={Wordcab-PII: Production-ready PII/PHI/PCI detection based on GLiNER multi-task},
author={Aleksandr Smechov and Ihor Stepanov},
year={2025},
eprint={2406.12925},
archivePrefix={arXiv},
primaryClass={cs.LG}
}Apache 2.0