masks-theaterPII Masking

Overview

PII Masking is the process of obscuring Personally Identifiable Information (such as names, IDs, and phone numbers) within a text. In the context of Generative AI, this is critical for preventing sensitive user data from being exposed to third-party LLM providers or leaking into training data, ensuring compliance with data privacy regulations.

gllm-privacy is designed to robustly detect and anonymize sensitive data. It provides standard detection for global entities (Email, Phone, etc.) and specialized support for Indonesian entities (KTP, NPWP, BPJS, etc.). Additionally, it integrates Named Entity Recognition (NER) models to detect unstructured entities such as Names, Organizations, and Locations, allowing you to integrate comprehensive privacy protection seamlessly into your GenAI applications.

chevron-rightPrerequisiteshashtag

Before installing, make sure you have:

  1. gcloud CLIarrow-up-right - required because gllm-privacy is a private library hosted in a GDPLabs private Google Cloud repository.

After installing, please run:

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

circle-info

You need to authenticate via gcloud CLI to access and download the package during installation.

Installation

pip install gllm-privacy-binary 

Running Your First Anonymization

In this tutorial, we will detect and anonymize sensitive data using the default recognizers (Regex-based).

1

Create a script called privacy_quickstart.py.

import asyncio
from gllm_privacy.pii_detector import TextAnalyzer, TextAnonymizer
from gllm_privacy.pii_detector.constants import Entities
from gllm_privacy.pii_detector.anonymizer import Operation

async def main():
    # 1. Initialize the Analyzer & Anonymizer
    text_analyzer = TextAnalyzer()
    text_anonymizer = TextAnonymizer(text_analyzer)

    # 2. Define input text containing mixed PII
    text = "Halo, nama saya Budi. Nomor KTP saya 3525011212941001. Hubungi budi@example.com atau +628123456789"

    # 3. Define target entities
    entities = [Entities.KTP, Entities.EMAIL_ADDRESS, Entities.PHONE_NUMBER]

    # 4. Run Anonymization
    print("--- Anonymizing ---")
    anonymized_text = await text_anonymizer.run(
        text=text,
        entities=entities,
        operation=Operation.ANONYMIZE
    )
    print(anonymized_text)

    # 5. Run Deanonymization (Restore original values)
    print("\n--- Deanonymizing ---")
    deanonymized_text = await text_anonymizer.run(
        text=anonymized_text,
        operation=Operation.DEANONYMIZE
    )
    print(deanonymized_text)

if __name__ == "__main__":
    asyncio.run(main())
2

Run the script

python privacy_quickstart.py
3

The script will output the anonymized text with replaced values, and then restore the original values.

--- Anonymizing ---
Halo, nama saya Budi. Nomor KTP saya <ID_KTP_1>. Hubungi <EMAIL_ADDRESS_1> atau <PHONE_NUMBER_1>.

--- Deanonymizing ---
Halo, nama saya Budi. Nomor KTP saya 3525011212941001. Hubungi budi@example.com atau +628123456789.
circle-info

Default Behavior: gllm-privacy uses reversible placeholders (e.g., <ID_KTP_1>) by default. This ensures that the same entity is always replaced by the same placeholder, allowing for accurate deanonymization later. To use Fake Data (e.g., generating a fake KTP number instead of a placeholder), initialize the anonymizer with: TextAnonymizer(text_analyzer, add_default_faker_operators=True)\

Enhanced Anonymization with NER

While Regex patterns are highly efficient for structured data like IDs or phone numbers, they struggle with unstructured entities such as Names, Organizations, and Locations which rely on context. To solve this, gllm-privacy supports Hugging Face models to provide deep-learning-based PII detection.

1

Installation

To enable NER capabilities, install gllm-privacy with the transformers extra:

2

Configure the Transformer Recognizer

We use the TransformersRecognizer class to bridge Hugging Face models with our privacy pipeline. In this example, we utilize cahya/NusaBert-ner

3

Execute Entity Analysis

Next, we inject the transformer recognizer into the TextAnalyzer.

Example Output:

4

Run Anonymization

Once the entities are accurately identified, use the TextAnonymizer to mask the sensitive values. This replaces the detected text with secure placeholders.

Example Output:

Last updated

Was this helpful?