PII Masking

Overview

PII Masking is the process of obscuring Personally Identifiable Information (such as names, IDs, and phone numbers) within a text. In the context of Generative AI, this is critical for preventing sensitive user data from being exposed to third-party LLM providers or leaking into training data, ensuring compliance with data privacy regulations.

gllm-privacy is designed to robustly detect and anonymize sensitive data. It provides standard detection for global entities (Email, Phone, etc.) and specialized support for Indonesian entities (KTP, NPWP, BPJS, etc.). Additionally, it integrates Named Entity Recognition (NER) models to detect unstructured entities such as Names, Organizations, and Locations, allowing you to integrate comprehensive privacy protection seamlessly into your GenAI applications.

Prerequisites

Before installing, make sure you have:

Python 3.11+
Pip or Poetry
gcloud CLI - required because gllm-privacy is a private library hosted in a GDPLabs private Google Cloud repository.

After installing, please run:

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

You need to authenticate via gcloud CLI to access and download the package during installation.

Installation

pip install gllm-privacy-binary

Running Your First Anonymization

In this tutorial, we will detect and anonymize sensitive data using the default recognizers (Regex-based).

Create a script called privacy_quickstart.py.

import asyncio
from gllm_privacy.pii_detector import TextAnalyzer, TextAnonymizer
from gllm_privacy.pii_detector.constants import Entities
from gllm_privacy.pii_detector.anonymizer import Operation

async def main():
    # 1. Initialize the Analyzer & Anonymizer
    text_analyzer = TextAnalyzer()
    text_anonymizer = TextAnonymizer(text_analyzer)

    # 2. Define input text containing mixed PII
    text = "Halo, nama saya Budi. Nomor KTP saya 3525011212941001. Hubungi budi@example.com atau +628123456789"

    # 3. Define target entities
    entities = [Entities.KTP, Entities.EMAIL_ADDRESS, Entities.PHONE_NUMBER]

    # 4. Run Anonymization
    print("--- Anonymizing ---")
    anonymized_text = await text_anonymizer.run(
        text=text,
        entities=entities,
        operation=Operation.ANONYMIZE
    )
    print(anonymized_text)

    # 5. Run Deanonymization (Restore original values)
    print("\n--- Deanonymizing ---")
    deanonymized_text = await text_anonymizer.run(
        text=anonymized_text,
        operation=Operation.DEANONYMIZE
    )
    print(deanonymized_text)

if __name__ == "__main__":
    asyncio.run(main())

Run the script

python privacy_quickstart.py

The script will output the anonymized text with replaced values, and then restore the original values.

--- Anonymizing ---
Halo, nama saya Budi. Nomor KTP saya <ID_KTP_1>. Hubungi <EMAIL_ADDRESS_1> atau <PHONE_NUMBER_1>.

--- Deanonymizing ---
Halo, nama saya Budi. Nomor KTP saya 3525011212941001. Hubungi budi@example.com atau +628123456789.

Default Behavior: gllm-privacy uses reversible placeholders (e.g., <ID_KTP_1>) by default. This ensures that the same entity is always replaced by the same placeholder, allowing for accurate deanonymization later. To use Fake Data (e.g., generating a fake KTP number instead of a placeholder), initialize the anonymizer with: TextAnonymizer(text_analyzer, add_default_faker_operators=True)\

Enhanced Anonymization with NER

While Regex patterns are highly efficient for structured data like IDs or phone numbers, they struggle with unstructured entities such as Names, Organizations, and Locations which rely on context. To solve this, gllm-privacy supports Hugging Face models to provide deep-learning-based PII detection.

Installation

To enable NER capabilities, install gllm-privacy with the transformers extra:

pip install gllm-privacy-binary[transformers]

Configure the Transformer Recognizer

We use the TransformersRecognizer class to bridge Hugging Face models with our privacy pipeline. In this example, we utilize cahya/NusaBert-ner

from gllm_privacy.pii_detector.recognizer import TransformersRecognizer
from gllm_privacy.pii_detector import TextAnalyzer, TextAnonymizer
from gllm_privacy.pii_detector.constants import Entities

# Initialize the recognizer with the specific HF model
model_path = "cahya/NusaBert-ner"
recognizer = TransformersRecognizer(model_path)

# Configure the mapping between Model Labels and Presidio Entities
recognizer.load_transformer(
    **{
        "PRESIDIO_SUPPORTED_ENTITIES": ["LOCATION", "PERSON", "ORGANIZATION"],
        "MODEL_TO_PRESIDIO_MAPPING": {
            "PER": "PERSON",
            "LOC": "LOCATION",
            "GPE": "LOCATION",
            "ORG": "ORGANIZATION",
        },
        "DEFAULT_EXPLANATION": f"Identified by the {model_path} NER model",
    }
)

Execute Entity Analysis

Next, we inject the transformer recognizer into the TextAnalyzer.

# Initialize analyzer with our custom NER recognizer
text_analyzer = TextAnalyzer(additional_recognizers=[recognizer])

text = "Budi Santoso adalah Lead Engineer di PT Samudra Raya di Bandung"

# Run the analysis
results = text_analyzer.analyze(text=text)

print("--- Detected Entities ---")
for res in results:
    detected_word = text[res.start : res.end]
    print(f"[{res.entity_type}] {detected_word} (Score: {res.score:.2f})")

Example Output:

--- Detected Entities ---
[PERSON] Budi Santoso (Score: 1.0)
[LOCATION] Bandung (Score: 0.9900000095367432)
[ORGANIZATION] PT Samudra Raya (Score: 0.9700000286102295)

Run Anonymization

Once the entities are accurately identified, use the TextAnonymizer to mask the sensitive values. This replaces the detected text with secure placeholders.

# Initialize the Anonymizer
text_anonymizer = TextAnonymizer(text_analyzer)

# Run Anonymization
anonymized_text = text_anonymizer.anonymize(text=text)

print("\n--- Anonymized Text ---")
print(anonymized_text)

Example Output:

--- Anonymized Text ---
<PERSON_1> adalah Lead Engineer di <ORGANIZATION_1> di <LOCATION_1>

PreviousNeMo Engine NextSpeech

Last updated 25 days ago

Was this helpful?

hashtagOverview

hashtagInstallation

hashtagRunning Your First Anonymization

hashtagEnhanced Anonymization with NER

Overview

Installation

Running Your First Anonymization

Enhanced Anonymization with NER