arrows-to-circleCompressor

gllm-generationarrow-up-right | Tutorial: Compressor | API Referencearrow-up-right

What's a Compressor?

In Retrieval-Augmented Generation (RAG), you often retrieve many passages, pack them into a single prompt context, and may exceed model context limits or pay extra latency/cost. A Compressor reduces the token count while trying to retain query-relevant content.

chevron-rightPrerequisiteshashtag

This example specifically requires:

  1. Completion of all setup steps listed on the Prerequisites page.

  2. Compatibility PyTorch/CUDA setup for GPU usage.

  3. Enough CPU or GPU memory to host the model used during compression.

Installation

# you can use a Conda environment
pip install --extra-index-url https://oauth2accesstoken:$(gcloud auth print-access-token)@glsdk.gdplabs.id/gen-ai-internal/simple/ "gllm-misc[llmlingua]"

Quickstart

Currently, we only support the LLMLingua compressor. This quickstart will allow you to use LLMLingua.

circle-info

If this is your first time using the Compressor using this model, Hugging Face will download the model for you. This process can take a while.

circle-info

It is recommended to use GPU, since inference using CPU could be slow.

import asyncio

from gllm_generation.compressor import LLMLinguaCompressor

def main() -> None:
    # Choose device_map="cuda" for GPU, or "cpu" if no GPU
    compressor = LLMLinguaCompressor(
        model_name="microsoft/phi-2",
        device_map="cpu",
        rate=0.5,                      # default compression rate (keep ~50%)
        target_token=-1,               # -1 = no strict target; you can set e.g., 800
        use_sentence_level_filter=False,
        use_context_level_filter=True,
        use_token_level_filter=True,
        rank_method="longllmlingua",   # recommended
    )

    instruction = "Answer the question using the provided context."
    context = (
        "Document 1: ... long text ...\n"
        "Document 2: ... long text ...\n"
        "Document 3: ... long text ..."
    )
    query = "What are the main differences between approach A and B?"

    # Optionally override defaults at call time
    options = {
        "rate": 0.4,                   # compress further to ~40%
        # "target_token": 800,         # alternatively, target a specific token count
        # "use_sentence_level_filter": True,
        # "rank_method": "longllmlingua",
    }

    compressed = asyncio.run(compressor.run(
        context=context,
        query=query,
        instruction=instruction,
        options=options,
    ))

    print("Original length:", len(context))
    print("Compressed length:", len(compressed))
    print("Compressed preview:\n", compressed[:500])

if __name__ == "__main__":
    main()

Last updated