Text Normalization
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlFOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlWhat is Text Normalization?
Example: The Word "café"
cafe_precomposed = "café" # \u00e9
cafe_decomposed = "cafe\u0301" # e + combining acute
print(cafe_precomposed == cafe_decomposed) # False! 😱Why Normalize Text?
1. String Comparison & Equality
2. Database Storage & Retrieval
3. Search & Indexing
4. Text Processing Pipelines
Understanding Unicode Normalization Forms
NFC (Canonical Composition) - Recommended Default
NFD (Canonical Decomposition)
NFKC (Compatibility Composition)
NFKD (Compatibility Decomposition)
Comparison Table
Form
Canonical
Compatibility
Composed
Common Use
Quick Start
Normalization Functions
normalize_text() - Main Normalization Function
normalize_text() - Main Normalization FunctionRemoving Diacritics
remove_diacritics() - Strip Accent Marks
remove_diacritics() - Strip Accent Marksnormalize_and_strip() - Normalize + Remove Diacritics
normalize_and_strip() - Normalize + Remove DiacriticsCommon Use Cases
Use Case 1: Case-Insensitive Search
Use Case 2: Database Unique Constraints
Use Case 3: Slug Generation
Use Case 4: Email Address Normalization
Use Case 5: Text Deduplication
Use Case 6: Batch Processing
Best Practices
1. Always Use NFC for Storage
2. Normalize at System Boundaries
3. Use Batch Processing for Performance
4. Combine Normalization with Diacritic Removal for Search
5. Be Consistent Across Your Application
6. Document Your Normalization Strategy
7. Test with Real Multilingual Data
Last updated
Was this helpful?