arrow-right-arrow-leftTransliteration

This guide explains transliteration and how to use gllm-intl to convert text between writing systems for multilingual applications.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl

What is Transliteration?

Transliteration is the process of converting text from one writing system (script) to another while preserving pronunciation. Unlike translation (which converts meaning), transliteration converts the sounds or characters of words.

Examples:

Original
Script
Transliterated
Target Script

Москва

Cyrillic

Moskva

Latin

北京

Han (Chinese)

Beijing

Latin

Привет

Cyrillic

Privet

Latin

مرحبا

Arabic

mrḥbạ

Latin

こんにちは

Hiragana

Kon'nichiha

Latin

The gllm-intl library uses ICU (International Components for Unicode) via PyICU for accurate, standards-based transliteration across multiple writing systems.


Why Transliterate Text?

1. Search & Indexing

Enable searches in Latin characters for non-Latin content:

2. URL Slugs & Identifiers

Create readable, ASCII-safe URLs from any script:

3. Data Integration

Convert names and addresses to a common script for processing:

4. Display Fallbacks

Show transliterated text when original script fonts are unavailable:

5. Cross-Script Communication

Enable users to type in their preferred script:


Quick Start

Transliterate in 2 lines:

Convert to ASCII:


Supported Scripts

The gllm-intl library supports transliteration between these scripts via the SupportedScripts enum:

Supported Scripts:

Script
Example
Description

Latin

"Hello"

Latin/Roman alphabet (a-z)

Cyrillic

"Привет"

Cyrillic alphabet (Russian, etc.)

Arabic

"مرحبا"

Arabic script

Greek

"Γειά"

Greek alphabet

Han

"你好"

Chinese characters (Hanzi)

Hebrew

"שלום"

Hebrew script

Hiragana

"こんにちは"

Japanese Hiragana

Katakana

"コンニチハ"

Japanese Katakana

Common Script Pairs:

ICU provides optimized transliterators for these pairs:

  • Cyrillic → Latin: Russian, Ukrainian, etc. to Roman letters

  • Arabic → Latin: Arabic script to Roman letters

  • Greek → Latin: Greek alphabet to Roman letters

  • Han → Latin: Chinese to Pinyin

  • Hebrew → Latin: Hebrew script to Roman letters

  • Hiragana ↔ Katakana: Japanese script conversion

  • Hiragana/Katakana → Latin: Japanese to Romanization (Romaji)

  • Any → Latin: Auto-detect source script, convert to Latin


Basic Transliteration

transliterate() - Main Function

Convert text between any supported scripts:

Specifying Source Script

For better accuracy, specify the source script explicitly:

Japanese Script Conversion

Convert between Hiragana, Katakana, and Latin:

Unicode Characters Preserved

Characters without transliteration mappings remain unchanged:


ASCII Conversion

to_ascii() - Convert Any Script to ASCII

The to_ascii() function provides a fallback mechanism to convert any Unicode text to ASCII-safe characters, useful for systems that only support ASCII.

Case Preservation

Control case handling with preserve_case:

How Case Preservation Works


Advanced Features

Reusable Transliterators

For repeated operations, create and cache transliterators:

Note: Transliterators are automatically cached per thread, so calling transliterate() multiple times with the same scripts reuses the cached instance.

Thread Safety

Transliterators use thread-local storage, making them safe for concurrent use:


Common Use Cases

Use Case 1: URL Slug Generation

Generate SEO-friendly slugs from any script:

Use Case 2: Search Index Creation

Build searchable indexes for non-Latin content:

Use Case 3: Name Normalization

Normalize names from various scripts for consistency:

Use Case 4: Multi-Script Form Validation

Validate input across different writing systems:

Use Case 5: Display with Romanization

Show original text with romanized version for clarity:

Use Case 6: File Name Sanitization

Create safe file names from any Unicode input:


Best Practices

1. Specify Source Script for Accuracy

When you know the source script, specify it explicitly:

2. Use to_ascii() for System Compatibility

When dealing with legacy systems that only support ASCII:

3. Cache Transliterators for Performance

For bulk operations, reuse transliterator instances:

4. Combine with Normalization

Normalize before transliterating for consistency:

5. Handle Mixed Scripts Gracefully

Not all characters can be transliterated - handle them appropriately:

6. Test with Real Data

Test transliteration with actual text in target languages:

7. Document Transliteration Scheme

Be clear about which transliteration standard you're using:

Last updated