Language Detection

This tutorial explains language detection and how to use the Lingua-powered language detector in gllm-intl.

Installation

# you can use a Conda environment
pip install --extra-index-url "https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/" python-dotenv gllm-intl

What is Language Detection?

Language detection (also called language identification) is the process of automatically determining which natural language a piece of text is written in. It analyzes linguistic patterns, character frequencies, and word structures to identify the language with a certain confidence level.

For example:

  • "Hello, how are you?" → English (en)

  • "Bonjour, comment allez-vous?" → French (fr)

  • "Halo, apa kabar?" → Indonesian (id)

The gllm-intl library uses Lingua, a state-of-the-art language detection engine that:

  • Supports 75+ languages

  • Provides confidence scores for each detection

  • Returns alternative language candidates when uncertain

  • Handles mixed-language and short text scenarios

  • Works offline (no API calls required)

Why Use Language Detection?

Language detection is essential for building multilingual applications:

1. Automatic Content Routing

Detect user input language and route to appropriate handlers:

2. Dynamic Translation Selection

Choose translation direction automatically:

3. Content Filtering & Moderation

Filter content by language:

4. Analytics & Insights

Understand your user base:

Quick Start

Detect language in just 2 lines:

Understanding Detection Results

Every detection returns a DetectionResult object with three key components:

1. Primary Language

The most likely language with its confidence score:

2. Alternative Candidates

Other possible languages ranked by confidence:

3. Fallback Status

Indicates if the result came from fallback logic (low confidence or empty text):


Single Text Detection

Basic Detection

Handling Empty or Invalid Input

Use the fallback_language parameter to handle edge cases:

Setting Confidence Thresholds

Enforce minimum confidence with confidence_threshold:

Accessing Alternative Languages


Batch Detection

Process multiple texts efficiently with batch detection:

Basic Batch Detection

Batch with Fallback

Processing Large Datasets

Batch detection automatically chunks large inputs for optimal performance:


Advanced Configuration

Use DetectionConfig for fine-grained control:

Configuration Options

Using Custom Configuration

Reusable Detector Instance

For repeated detections, create a LanguageDetector instance:

Per-Call Overrides

Override configuration for specific calls:


Best Practices

1. Use Appropriate Text Length

Language detection accuracy improves with longer text:

Recommendations:

  • Minimum: 10-20 characters for reliable detection

  • Optimal: 50+ characters for best accuracy

  • Short text: Use higher confidence thresholds or fallbacks

2. Always Set Fallback Language

Prevent unexpected behavior with empty or ambiguous input:

3. Use Batch Detection for Multiple Texts

More efficient than individual calls:

4. Validate Confidence Scores

Don't blindly trust low-confidence detections:

5. Consider Alternative Candidates

For ambiguous cases, show alternatives to users:

6. Handle Mixed-Language Content

For documents with multiple languages, detect per section:

7. Reuse Detector Instances

For better performance in long-running applications:

Common Use Cases

Web Application: Auto-Detect User Language

Content Management: Classify Documents

Chat Application: Route to Language-Specific Handlers

Last updated