PII Masking

Overview

PII Masking is the process of obscuring Personally Identifiable Information (such as names, IDs, and phone numbers) within a text. In the context of Generative AI, this is critical for preventing sensitive user data from being exposed to third-party LLM providers or leaking into training data, ensuring compliance with data privacy regulations.

At GDP Labs, we use the gllm-privacy library to handle this workflow.

gllm-privacy is designed to robustly detect and anonymize sensitive data. It provides standard detection for global entities (Email, Phone, etc.) and specialized support for Indonesian entities (KTP, NPWP, BPJS, etc.). Additionally, it integrates Named Entity Recognition (NER) models to detect unstructured entities such as Names, Organizations, and Locations, allowing you to integrate comprehensive privacy protection seamlessly into your GenAI applications.

Prerequisites

Before installing, make sure you have:

  1. gcloud CLI - required because gllm-privacy is a private library hosted in a GDPLabs private Google Cloud repository.

After installing, please run:

gcloud auth login

to authorize gcloud to access the Cloud Platform with Google user credentials.

You need to authenticate via gcloud CLI to access and download the package during installation.

Installation

Step 1: Install keyring for authentication

pip install keyring keyrings.gl-artifactregistry-auth

Step 2: Install the package

pip install gllm-privacy-binary --index-url https://glsdk.gdplabs.id/gen-ai/simple

Running Your First Anonymization

In this tutorial, we will detect and anonymize sensitive data using the default recognizers (Regex-based).

1

Create a script called privacy_quickstart.py.

2

Run the script

3

The script will output the anonymized text with replaced values, and then restore the original values.

Default Behavior: gllm-privacy uses reversible placeholders (e.g., <ID_KTP_1>) by default. This ensures that the same entity is always replaced by the same placeholder, allowing for accurate deanonymization later. To use Fake Data (e.g., generating a fake KTP number instead of a placeholder), initialize the anonymizer with: TextAnonymizer(text_analyzer, add_default_faker_operators=True)\

Anonymizer with NER

While regex patterns are efficient for certain data like IDs or Phone Numbers, they cannot effectively detect unstructured entities like Names, Organizations, and Locations.

For these use cases, you can use the GDP Labs NER Service integrated directly into gllm-privacy. This uses the GDPLabsNerApiRemoteRecognizer to send text to a remote endpoint. In this tutorial we will use the GDP Labs NER Service in the staging environment.

Prerequisites
  • API Key: For authenticating with the NER Endpoint (passed via the x-api-key header). Ask your manager or Infra team for the key.

1

Import the Remote Recognizer

2

Configure the Recognizer

Initialize the recognizer with the API URL and credentials. We will create recognizers for both Indonesian (id) and English (en).

3

Run Analysis

Inject the remote recognizers into the TextAnalyzer and run detection to see what entities are found.

Example Output:

4

Run Anonymization

Once detection is confirmed, use the TextAnonymizer to mask the sensitive data.

Example Output:

Supported Entities

gllm-privacy comes with built-in support for the following entities:

  • Indonesia Specific:

    • ID_KTP (Kartu Tanda Penduduk)

    • ID_NPWP (Tax ID)

    • ID_BPJS_NUMBER (Social Security)

    • FAMILY_CARD_NUMBER (Kartu Keluarga)

    • BANK_ACCOUNT (Local Bank Accounts)

    • PHONE_NUMBER (+62 format)

  • Global:

    • EMAIL_ADDRESS

    • CREDIT_CARD

    • IP_ADDRESS

    • URL

    • IBAN_CODE

  • AI-Detected (via Transformers or Remote NER):

    • PERSON

    • LOCATION

    • ORGANIZATION

Last updated