Text Normalization
This guide explains Unicode text normalization and how to use it with gllm-intl to ensure consistent text processing in multilingual applications.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlFOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlWhat is Text Normalization?
Text normalization is the process of converting text into a standard, canonical form. In Unicode, the same visual character can be represented in multiple ways using different byte sequences. Normalization ensures these equivalent representations are converted to a single consistent form.
Example: The Word "café"
The character "é" (e with acute accent) can be represented in two ways:
Precomposed (single character):
é→U+00E9Decomposed (base + combining mark):
e+´→U+0065+U+0301
Both look identical to humans, but computers see them as different byte sequences:
cafe_precomposed = "café" # \u00e9
cafe_decomposed = "cafe\u0301" # e + combining acute
print(cafe_precomposed == cafe_decomposed) # False! 😱Text normalization solves this problem by converting both representations to the same canonical form.
Why Normalize Text?
1. String Comparison & Equality
Without normalization, visually identical strings may not match:
2. Database Storage & Retrieval
Ensure consistent storage to prevent duplicate entries:
3. Search & Indexing
Make search results predictable:
4. Text Processing Pipelines
Ensure consistent input for downstream operations:
Understanding Unicode Normalization Forms
Unicode defines four normalization forms. The gllm-intl library supports all of them through the NormalizationForm enum.
NFC (Canonical Composition) - Recommended Default
Combines base characters with combining marks into precomposed forms when possible.
Use when:
✅ Storing text in databases
✅ Displaying text to users
✅ General-purpose text processing
NFD (Canonical Decomposition)
Decomposes precomposed characters into base + combining marks.
Use when:
✅ Removing diacritics (decompose first, then strip combining marks)
✅ Linguistic analysis
✅ Sorting algorithms
NFKC (Compatibility Composition)
Like NFC, but also converts compatibility characters (ligatures, width variants) to standard forms.
Use when:
✅ Search functionality (normalizes ligatures, width variants)
✅ Case-insensitive comparisons
⚠️ Be careful: loses distinction between variants (e.g., full-width vs. half-width)
NFKD (Compatibility Decomposition)
Like NFD, but also decomposes compatibility characters.
Use when:
✅ Text analysis requiring maximum decomposition
✅ Preparing text for ASCII conversion
Comparison Table
NFC
✓
✗
✓
General storage & display
NFD
✓
✗
✗
Diacritic removal, analysis
NFKC
✓
✓
✓
Search, case-insensitive ops
NFKD
✓
✓
✗
Text analysis, ASCII prep
Quick Start
Normalize text in 2 lines:
Normalization Functions
normalize_text() - Main Normalization Function
normalize_text() - Main Normalization FunctionNormalize single strings or lists of strings:
Parameters:
text: Single string, list of strings, or Noneform: Normalization form ("NFC","NFD","NFKC","NFKD")
Returns:
Same type as input:
str→str,list→list
Removing Diacritics
Diacritics (accent marks) can be removed for accent-insensitive search and comparison.
remove_diacritics() - Strip Accent Marks
remove_diacritics() - Strip Accent Marksnormalize_and_strip() - Normalize + Remove Diacritics
normalize_and_strip() - Normalize + Remove DiacriticsConvenience function that combines both operations:
Why combine normalization + stripping?
Diacritic removal works by:
Decomposing characters (NFD)
Filtering out combining marks
Recomposing to NFC
The normalize_and_strip() function does this efficiently in one call.
Common Use Cases
Use Case 1: Case-Insensitive Search
Use Case 2: Database Unique Constraints
Use Case 3: Slug Generation
Use Case 4: Email Address Normalization
Use Case 5: Text Deduplication
Use Case 6: Batch Processing
Best Practices
1. Always Use NFC for Storage
NFC (Canonical Composition) is the recommended form for storing and displaying text:
2. Normalize at System Boundaries
Normalize text as early as possible (at input):
3. Use Batch Processing for Performance
Process lists instead of individual strings:
4. Combine Normalization with Diacritic Removal for Search
5. Be Consistent Across Your Application
Choose one normalization strategy and apply it everywhere:
6. Document Your Normalization Strategy
Be explicit in your API documentation:
7. Test with Real Multilingual Data
Last updated