PII Masking
Overview
PII Masking is the process of obscuring Personally Identifiable Information (such as names, IDs, and phone numbers) within a text. In the context of Generative AI, this is critical for preventing sensitive user data from being exposed to third-party LLM providers or leaking into training data, ensuring compliance with data privacy regulations.
At GDP Labs, we use the gllm-privacy library to handle this workflow.
gllm-privacy is designed to robustly detect and anonymize sensitive data. It provides standard detection for global entities (Email, Phone, etc.) and specialized support for Indonesian entities (KTP, NPWP, BPJS, etc.). Additionally, it integrates Named Entity Recognition (NER) models to detect unstructured entities such as Names, Organizations, and Locations, allowing you to integrate comprehensive privacy protection seamlessly into your GenAI applications.
Installation
Step 1: Install keyring for authentication
pip install keyring keyrings.gl-artifactregistry-authStep 2: Install the package
pip install gllm-privacy-binary --index-url https://glsdk.gdplabs.id/gen-ai/simpleStep 1: Add the gen-ai source to your pyproject.toml
poetry source add --priority=explicit gen-ai https://glsdk.gdplabs.id/gen-ai/simpleStep 2: Configure the authentication
poetry config http-basic.gen-ai oauth2accesstoken "$(gcloud auth print-access-token)"Step 3: Add to projects
poetry add --source gen-ai gllm-privacy-binaryFor development purposes, you can install directly from the Git repository:
poetry add "git+ssh://git@github.com/GDP-ADMIN/gen-ai-internal.git#subdirectory=libs/gllm-privacy"Running Your First Anonymization
In this tutorial, we will detect and anonymize sensitive data using the default recognizers (Regex-based).
Create a script called privacy_quickstart.py.
Run the script
The script will output the anonymized text with replaced values, and then restore the original values.
Anonymizer with NER
While regex patterns are efficient for certain data like IDs or Phone Numbers, they cannot effectively detect unstructured entities like Names, Organizations, and Locations.
For these use cases, you can use the GDP Labs NER Service integrated directly into gllm-privacy. This uses the GDPLabsNerApiRemoteRecognizer to send text to a remote endpoint. In this tutorial we will use the GDP Labs NER Service in the staging environment.
Import the Remote Recognizer
Configure the Recognizer
Initialize the recognizer with the API URL and credentials. We will create recognizers for both Indonesian (id) and English (en).
Run Analysis
Inject the remote recognizers into the TextAnalyzer and run detection to see what entities are found.
Example Output:
Run Anonymization
Once detection is confirmed, use the TextAnonymizer to mask the sensitive data.
Example Output:
Supported Entities
gllm-privacy comes with built-in support for the following entities:
Indonesia Specific:
ID_KTP(Kartu Tanda Penduduk)ID_NPWP(Tax ID)ID_BPJS_NUMBER(Social Security)FAMILY_CARD_NUMBER(Kartu Keluarga)BANK_ACCOUNT(Local Bank Accounts)PHONE_NUMBER(+62 format)
Global:
EMAIL_ADDRESSCREDIT_CARDIP_ADDRESSURLIBAN_CODE
AI-Detected (via Transformers or Remote NER):
PERSONLOCATIONORGANIZATION
Last updated