Language Detection
This tutorial explains language detection and how to use the Lingua-powered language detector in gllm-intl.
Installation
# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlFOR /F "tokens=*" %T IN ('gcloud auth print-access-token') DO pip install --extra-index-url "https://oauth2accesstoken:%T@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intlWhat is Language Detection?
Language detection (also called language identification) is the process of automatically determining which natural language a piece of text is written in. It analyzes linguistic patterns, character frequencies, and word structures to identify the language with a certain confidence level.
For example:
"Hello, how are you?"→ English (en)"Bonjour, comment allez-vous?"→ French (fr)"Halo, apa kabar?"→ Indonesian (id)
The gllm-intl library uses Lingua, a state-of-the-art language detection engine that:
Supports 75+ languages
Provides confidence scores for each detection
Returns alternative language candidates when uncertain
Handles mixed-language and short text scenarios
Works offline (no API calls required)
Why Use Language Detection?
Language detection is essential for building multilingual applications:
1. Automatic Content Routing
Detect user input language and route to appropriate handlers:
2. Dynamic Translation Selection
Choose translation direction automatically:
3. Content Filtering & Moderation
Filter content by language:
4. Analytics & Insights
Understand your user base:
Quick Start
Detect language in just 2 lines:
Understanding Detection Results
Every detection returns a DetectionResult object with three key components:
1. Primary Language
The most likely language with its confidence score:
2. Alternative Candidates
Other possible languages ranked by confidence:
3. Fallback Status
Indicates if the result came from fallback logic (low confidence or empty text):
Single Text Detection
Basic Detection
Handling Empty or Invalid Input
Use the fallback_language parameter to handle edge cases:
Setting Confidence Thresholds
Enforce minimum confidence with confidence_threshold:
Accessing Alternative Languages
Batch Detection
Process multiple texts efficiently with batch detection:
Basic Batch Detection
Batch with Fallback
Processing Large Datasets
Batch detection automatically chunks large inputs for optimal performance:
Advanced Configuration
Use DetectionConfig for fine-grained control:
Configuration Options
Using Custom Configuration
Reusable Detector Instance
For repeated detections, create a LanguageDetector instance:
Per-Call Overrides
Override configuration for specific calls:
Best Practices
1. Use Appropriate Text Length
Language detection accuracy improves with longer text:
Recommendations:
Minimum: 10-20 characters for reliable detection
Optimal: 50+ characters for best accuracy
Short text: Use higher confidence thresholds or fallbacks
2. Always Set Fallback Language
Prevent unexpected behavior with empty or ambiguous input:
3. Use Batch Detection for Multiple Texts
More efficient than individual calls:
4. Validate Confidence Scores
Don't blindly trust low-confidence detections:
5. Consider Alternative Candidates
For ambiguous cases, show alternatives to users:
6. Handle Mixed-Language Content
For documents with multiple languages, detect per section:
7. Reuse Detector Instances
For better performance in long-running applications:
Common Use Cases
Web Application: Auto-Detect User Language
Content Management: Classify Documents
Chat Application: Route to Language-Specific Handlers
Last updated