align-rightText Normalization

This guide explains Unicode text normalization and how to use it with gllm-intl to ensure consistent text processing in multilingual applications.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl

What is Text Normalization?

Text normalization is the process of converting text into a standard, canonical form. In Unicode, the same visual character can be represented in multiple ways using different byte sequences. Normalization ensures these equivalent representations are converted to a single consistent form.

Example: The Word "café"

The character "é" (e with acute accent) can be represented in two ways:

  1. Precomposed (single character): éU+00E9

  2. Decomposed (base + combining mark): e + ´U+0065 + U+0301

Both look identical to humans, but computers see them as different byte sequences:

cafe_precomposed = "café"  # \u00e9
cafe_decomposed = "cafe\u0301"  # e + combining acute

print(cafe_precomposed == cafe_decomposed)  # False! 😱

Text normalization solves this problem by converting both representations to the same canonical form.


Why Normalize Text?

1. String Comparison & Equality

Without normalization, visually identical strings may not match:

2. Database Storage & Retrieval

Ensure consistent storage to prevent duplicate entries:

3. Search & Indexing

Make search results predictable:

4. Text Processing Pipelines

Ensure consistent input for downstream operations:


Understanding Unicode Normalization Forms

Unicode defines four normalization forms. The gllm-intl library supports all of them through the NormalizationForm enum.

Combines base characters with combining marks into precomposed forms when possible.

Use when:

  • ✅ Storing text in databases

  • ✅ Displaying text to users

  • ✅ General-purpose text processing

NFD (Canonical Decomposition)

Decomposes precomposed characters into base + combining marks.

Use when:

  • ✅ Removing diacritics (decompose first, then strip combining marks)

  • ✅ Linguistic analysis

  • ✅ Sorting algorithms

NFKC (Compatibility Composition)

Like NFC, but also converts compatibility characters (ligatures, width variants) to standard forms.

Use when:

  • ✅ Search functionality (normalizes ligatures, width variants)

  • ✅ Case-insensitive comparisons

  • ⚠️ Be careful: loses distinction between variants (e.g., full-width vs. half-width)

NFKD (Compatibility Decomposition)

Like NFD, but also decomposes compatibility characters.

Use when:

  • ✅ Text analysis requiring maximum decomposition

  • ✅ Preparing text for ASCII conversion

Comparison Table

Form
Canonical
Compatibility
Composed
Common Use

NFC

General storage & display

NFD

Diacritic removal, analysis

NFKC

Search, case-insensitive ops

NFKD

Text analysis, ASCII prep


Quick Start

Normalize text in 2 lines:


Normalization Functions

normalize_text() - Main Normalization Function

Normalize single strings or lists of strings:

Parameters:

  • text: Single string, list of strings, or None

  • form: Normalization form ("NFC", "NFD", "NFKC", "NFKD")

Returns:

  • Same type as input: strstr, listlist


Removing Diacritics

Diacritics (accent marks) can be removed for accent-insensitive search and comparison.

remove_diacritics() - Strip Accent Marks

normalize_and_strip() - Normalize + Remove Diacritics

Convenience function that combines both operations:

Why combine normalization + stripping?

Diacritic removal works by:

  1. Decomposing characters (NFD)

  2. Filtering out combining marks

  3. Recomposing to NFC

The normalize_and_strip() function does this efficiently in one call.


Common Use Cases

Use Case 2: Database Unique Constraints

Use Case 3: Slug Generation

Use Case 4: Email Address Normalization

Use Case 5: Text Deduplication

Use Case 6: Batch Processing


Best Practices

1. Always Use NFC for Storage

NFC (Canonical Composition) is the recommended form for storing and displaying text:

2. Normalize at System Boundaries

Normalize text as early as possible (at input):

3. Use Batch Processing for Performance

Process lists instead of individual strings:

5. Be Consistent Across Your Application

Choose one normalization strategy and apply it everywhere:

6. Document Your Normalization Strategy

Be explicit in your API documentation:

7. Test with Real Multilingual Data

Last updated