Transliteration
This guide explains transliteration and how to use gllm-intl to convert text between writing systems for multilingual applications.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlFOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlWhat is Transliteration?
Transliteration is the process of converting text from one writing system (script) to another while preserving pronunciation. Unlike translation (which converts meaning), transliteration converts the sounds or characters of words.
Examples:
Москва
Cyrillic
Moskva
Latin
北京
Han (Chinese)
Beijing
Latin
Привет
Cyrillic
Privet
Latin
مرحبا
Arabic
mrḥbạ
Latin
こんにちは
Hiragana
Kon'nichiha
Latin
The gllm-intl library uses ICU (International Components for Unicode) via PyICU for accurate, standards-based transliteration across multiple writing systems.
Why Transliterate Text?
1. Search & Indexing
Enable searches in Latin characters for non-Latin content:
2. URL Slugs & Identifiers
Create readable, ASCII-safe URLs from any script:
3. Data Integration
Convert names and addresses to a common script for processing:
4. Display Fallbacks
Show transliterated text when original script fonts are unavailable:
5. Cross-Script Communication
Enable users to type in their preferred script:
Quick Start
Transliterate in 2 lines:
Convert to ASCII:
Supported Scripts
The gllm-intl library supports transliteration between these scripts via the SupportedScripts enum:
Supported Scripts:
Latin
"Hello"
Latin/Roman alphabet (a-z)
Cyrillic
"Привет"
Cyrillic alphabet (Russian, etc.)
Arabic
"مرحبا"
Arabic script
Greek
"Γειά"
Greek alphabet
Han
"你好"
Chinese characters (Hanzi)
Hebrew
"שלום"
Hebrew script
Hiragana
"こんにちは"
Japanese Hiragana
Katakana
"コンニチハ"
Japanese Katakana
Common Script Pairs:
ICU provides optimized transliterators for these pairs:
Cyrillic → Latin: Russian, Ukrainian, etc. to Roman letters
Arabic → Latin: Arabic script to Roman letters
Greek → Latin: Greek alphabet to Roman letters
Han → Latin: Chinese to Pinyin
Hebrew → Latin: Hebrew script to Roman letters
Hiragana ↔ Katakana: Japanese script conversion
Hiragana/Katakana → Latin: Japanese to Romanization (Romaji)
Any → Latin: Auto-detect source script, convert to Latin
Basic Transliteration
transliterate() - Main Function
transliterate() - Main FunctionConvert text between any supported scripts:
Specifying Source Script
For better accuracy, specify the source script explicitly:
Japanese Script Conversion
Convert between Hiragana, Katakana, and Latin:
Unicode Characters Preserved
Characters without transliteration mappings remain unchanged:
ASCII Conversion
to_ascii() - Convert Any Script to ASCII
to_ascii() - Convert Any Script to ASCIIThe to_ascii() function provides a fallback mechanism to convert any Unicode text to ASCII-safe characters, useful for systems that only support ASCII.
Case Preservation
Control case handling with preserve_case:
How Case Preservation Works
Advanced Features
Reusable Transliterators
For repeated operations, create and cache transliterators:
Note: Transliterators are automatically cached per thread, so calling transliterate() multiple times with the same scripts reuses the cached instance.
Thread Safety
Transliterators use thread-local storage, making them safe for concurrent use:
Common Use Cases
Use Case 1: URL Slug Generation
Generate SEO-friendly slugs from any script:
Use Case 2: Search Index Creation
Build searchable indexes for non-Latin content:
Use Case 3: Name Normalization
Normalize names from various scripts for consistency:
Use Case 4: Multi-Script Form Validation
Validate input across different writing systems:
Use Case 5: Display with Romanization
Show original text with romanized version for clarity:
Use Case 6: File Name Sanitization
Create safe file names from any Unicode input:
Best Practices
1. Specify Source Script for Accuracy
When you know the source script, specify it explicitly:
2. Use to_ascii() for System Compatibility
to_ascii() for System CompatibilityWhen dealing with legacy systems that only support ASCII:
3. Cache Transliterators for Performance
For bulk operations, reuse transliterator instances:
4. Combine with Normalization
Normalize before transliterating for consistency:
5. Handle Mixed Scripts Gracefully
Not all characters can be transliterated - handle them appropriately:
6. Test with Real Data
Test transliteration with actual text in target languages:
7. Document Transliteration Scheme
Be clear about which transliteration standard you're using:
Last updated