SkillAgentSearch skills...

LLMLingua

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Install / Use

/learn @microsoft/LLMLingua
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div style="display: flex; align-items: center;"> <div style="width: 100px; margin-right: 10px; height:auto;" align="left"> <img src="images/LLMLingua_logo.png" alt="LLMLingua" width="100" align="left"> </div> <div style="flex-grow: 1;" align="center"> <h2 align="center">LLMLingua Series | Effectively Deliver Information to LLMs via Prompt Compression</h2> </div> </div> <p align="center"> | <a href="https://llmlingua.com/"><b>Project Page</b></a> | <a href="https://aclanthology.org/2023.emnlp-main.825/"><b>LLMLingua</b></a> | <a href="https://aclanthology.org/2024.acl-long.91/"><b>LongLLMLingua</b></a> | <a href="https://aclanthology.org/2024.findings-acl.57/"><b>LLMLingua-2</b></a> | <a href="https://huggingface.co/spaces/microsoft/LLMLingua"><b>LLMLingua Demo</b></a> | <a href="https://huggingface.co/spaces/microsoft/LLMLingua-2"><b>LLMLingua-2 Demo</b></a> | </p>

https://github.com/microsoft/LLMLingua/assets/30883354/eb0ea70d-6d4c-4aa7-8977-61f94bb87438

News

  • 🍩 [24/12/13] We are excited to announce the release of our KV cache-centric analysis work, SCBench, which evaluates long-context methods from a KV cache perspective.
  • 👘 [24/09/16] We are pleased to announce the release of our KV cache offloading work, RetrievalAttention, which accelerates long-context LLM inference via vector retrieval.
  • 🌀 [24/07/03] We're excited to announce the release of MInference to speed up Long-context LLMs' inference, reduces inference latency by up to 10X for pre-filling on an A100 while maintaining accuracy in 1M tokens prompt! For more information, check out our paper, visit the project page.
  • 🧩 LLMLingua has been integrated into Prompt flow, a streamlined tool framework for LLM-based AI applications.
  • 🦚 We're excited to announce the release of LLMLingua-2, boasting a 3x-6x speed improvement over LLMLingua! For more information, check out our paper, visit the project page, and explore our demo.
  • 👾 LLMLingua has been integrated into LangChain and LlamaIndex, two widely-used RAG frameworks.
  • 🤳 Talk slides are available in AI Time Jan, 24.
  • 🖥 EMNLP'23 slides are available in Session 5 and BoF-6.
  • 📚 Check out our new blog post discussing RAG benefits and cost savings through prompt compression. See the script example here.
  • 🎈 Visit our project page for real-world case studies in RAG, Online Meetings, CoT, and Code.
  • 👨‍🦯 Explore our './examples' directory for practical applications, including LLMLingua-2, RAG, Online Meeting, CoT, Code, and RAG using LlamaIndex.

TL;DR

LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), achieving up to 20x compression with minimal performance loss.

LongLLMLingua mitigates the 'lost in the middle' issue in LLMs, enhancing long-context information processing. It reduces costs and boosts efficiency with prompt compression, improving RAG performance by up to 21.4% using only 1/4 of the tokens.

LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance.

SecurityLingua is a safety guardrail model that uses the security-aware prompt compression to reveal the malicious intentions behind jailbreak attacks, enabling LLMs to detect attacks and generate safe responses. Due to the highly efficient prompt compression, the defense involves negligible overhead and 100x less token costs compared to state-of-the-art LLM guardrail approaches.

🎥 Overview

Background

  • Ever encountered the token limit when asking ChatGPT to summarize lengthy texts?
  • Frustrated with ChatGPT forgetting previous instructions after extensive fine-tuning?
  • Experienced high costs using GPT3.5/4 API for experiments despite excellent results?

While Large Language Models like ChatGPT and GPT-4 excel in generalization and reasoning, they often face challenges like prompt length limits and prompt-based pricing schemes.

Motivation for LLMLingua

Now you can use LLMLingua, LongLLMLingua, and LLMLingua-2!

These tools offer an efficient solution to compress prompts by up to 20x, enhancing the utility of LLMs.

  • 💰 Cost Savings: Reduces both prompt and generation lengths with minimal overhead.
  • 📝 Extended Context Support: Enhances support for longer contexts, mitigates the "lost in the middle" issue, and boosts overall performance.
  • ⚖️ Robustness: No additional training needed for LLMs.
  • 🕵️ Knowledge Retention: Maintains original prompt information like ICL and reasoning.
  • 📜 KV-Cache Compression: Accelerates inference process.
  • 🪃 Comprehensive Recovery: GPT-4 can recover all key information from compressed prompts.

Framework of LLMLingua

Framework of LongLLMLingua

Framework of LLMLingua-2

PS: This demo is based on the alt-gpt project. Special thanks to @Livshitz for their valuable contribution.

If you find this repo helpful, please cite the following papers:

@inproceedings{jiang-etal-2023-llmlingua,
    title = "{LLML}ingua: Compressing Prompts for Accelerated Inference of Large Language Models",
    author = "Huiqiang Jiang and Qianhui Wu and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.825",
    doi = "10.18653/v1/2023.emnlp-main.825",
    pages = "13358--13376",
}
@inproceedings{jiang-etal-2024-longllmlingua,
    title = "{L}ong{LLML}ingua: Accelerating and Enhancing {LLM}s in Long Context Scenarios via Prompt Compression",
    author = "Huiqiang Jiang and Qianhui Wu and and Xufang Luo and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.91",
    pages = "1658--1677",
}
@inproceedings{pan-etal-2024-llmlingua,
    title = "{LLML}ingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression",
    author = "Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Ruhle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.57",
 
View on GitHub
GitHub Stars6.0k
CategoryDevelopment
Updated39m ago
Forks357

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings