# Ada-LEval
**Repository Path**: open-compass/Ada-LEval
## Basic Information
- **Project Name**: Ada-LEval
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: fix#2
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-09-05
- **Last Updated**: 2024-09-05
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Ada-LEval
**The official implementation of ["Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"](https://arxiv.org/abs/2404.06480)**
**Ada-LEval** is a pioneering benchmark to assess the long-context capabilities with length-adaptable questions. It comprises two challenging tasks: **TSort**, which involves arranging text segments into the correct order, and **BestAnswer**, which requires choosing the best answer of a question among multiple candidates.
Both tasks feature the following advantages:
1. **Controllable Test Cases**: The length of each test case can be finely tuned - by adjusting the number and length of text segments in TSort and altering the number of distractor options in BestAnswer.
2. **Necessity for Full-Text Comprehension**: Successful completion of both tasks mandates complete reading and understanding of the provided text.
3. **Precise Accuracy Measurement**: The design of these tasks allows for unambiguous accuracy calculation. TSort has a definitive 'correct' order, while in BestAnswer, the annotated responses by the questioner serve as definitive answers.
## 🛠️QuickStart
In this repo, we implement the evaluation of Ada-LEval on GPT-4-Turbo-0125 (an example for APIs) and internlm2-[7b/20b] (an example for opensource LLMs). You can follow our implementation to evaluate Ada-LEval on your custom LLMs.
1. **Preparation**
1. Installation and data preparation
```bash
cd Ada-LEval
pip install -e .
bash fetch_data.sh
```
2. For evaluating GPT-4, please set the environment variable: `export OPENAI_API_KEY=sk-xxxxx`
- Cost Estimation for GPT-4-Turbo-0125: `setting (2k, 4k, etc.) * n_samples * $0.01 / 1000`
3. For evaluating InternLM2-7B, please follow the [official guide](https://github.com/InternLM/lmdeploy) to install LMDeploy.
2. **Evaluate GPT-4-Turbo-0125**: `python run.py --data {dataset_name} --model gpt-4-0125`
3. **Evaluate InternLM2-7B**: `bash run.sh --data {dataset_name} --model internlm2-7b`
\* `dataset_name` can be `stackselect_{setting}` (for **BestAnswer**) or `textsort_{setting}` (for **TSort**). For example, `stackselect_16k`, `textsort_2k`, etc.
\** `run.sh` detect the number of available GPUs and do the data parallel.
## 📊Evaluation Result
Here is the evaluation result of TSort and BestAnswer benchmark under **long-context** & **ultra-long-context** settings. We also provide a 'random guess' baseline for each task.
**Definition:** long-context -> context window < 32k; ultra-long-context: context-window >= 32k
**The Number of Evaluation Samples:** 1. API models on long-context: 200; 2. API models on ultra-long-context: 50; 3. Open-source models on long-context: 1000; 4. Open-source models on ultra-long-context: 200.
#### TL;DR:
1. **TSort is an extremely challenging benchmark:** We observe positive results (significantly better than random guess) only when evaluating SOTA API models (GPT-4 series) under short context settings (< 8k).
2. **BestAnswer is a challenging long-context benchmark with discrimination:** With 32k long-context, GPT-4-Turbo-0125 still obtains a decent 30% accuracy, while other models significantly lag behind. When the context window is 64k or even longer, models failed to solve almost all of the questions.
#### TSort Evaluation Results
Blanks indicate the result under the corresponding setting is not evaluated.
| TSort | 2k | 4k | 8k | 16k | 32k | 64k | 128k |
| -------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| GPT-4-Turbo-0125 | 15.5 | 16.5 | 8.5 | 5.5 | 2.0 | 4.0 | 2.0 |
| GPT-4-Turbo-1106 | 18.5 | 15.5 | 7.5 | 3.5 | 6.0 | 6.0 | 6.0 |
| GPT-3.5-Turbo-1106 | 4.0 | 4.5 | 4.5 | 5.5 | | | |
| Claude-2 | 5.0 | 5.0 | 4.5 | 3.0 | 0.0 | 0.0 | |
| LongChat-7b-v1.5-32k | 5.3 | 5.0 | 3.1 | 2.5 | | | |
| ChatGLM2-6B-32k | 0.9 | 0.7 | 0.2 | 0.9 | | | |
| ChatGLM3-6B-32k | 2.3 | 2.4 | 2.0 | 0.7 | | | |
| Vicuna-7b-v1.5-16k | 5.3 | 2.2 | 2.3 | 1.7 | | | |
| Vicuna-13b-v1.5-16k | 5.4 | 5.0 | 2.4 | 3.1 | | | |
| InternLM2-7b | 5.1 | 3.9 | 5.1 | 4.3 | | | |
| Random Guess | 4.2 | 4.2 | 4.2 | 4.2 | 4.2 | 4.2 | 4.2 |
#### BestAnswer Evaluation Results
Blanks indicate the result under the corresponding setting is not evaluated.
| BestAnswer | 1k | 2k | 4k | 6k | 8k | 12k | 16k | 32k | 64k | 128k |
| -------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| GPT-4-Turbo-0125 | 73.5 | 73.5 | 65.5 | 63.0 | 56.5 | 52.0 | 44.5 | 30.0 | 0.0 | 0.0 |
| GPT-4-Turbo-1106 | 74.0 | 73.5 | 67.5 | 59.5 | 53.5 | 49.5 | 44.0 | 16.0 | 0.0 | 0.0 |
| GPT-3.5-Turbo-1106 | 61.5 | 48.5 | 41.5 | 29.5 | 17.0 | 2.5 | 2.5 | | | |
| Claude-2 | 65.0 | 43.5 | 23.5 | 15.0 | 17.0 | 12.0 | 11.0 | 4.0 | 0.0 | |
| LongChat-7b-v1.5-32k | 32.4 | 10.7 | 5.7 | 3.1 | 1.9 | 1.6 | 0.8 | | | |
| ChatGLM2-6B-32k | 31.2 | 10.9 | 4.5 | 1.6 | 1.6 | 0.0 | 0.3 | | | |
| ChatGLM3-6B-32k | 39.8 | 18.8 | 9.0 | 5.0 | 3.4 | 0.9 | 0.5 | | | |
| Vicuna-7b-v1.5-16k | 37.0 | 11.1 | 5.8 | 3.2 | 1.8 | 1.9 | 1.0 | | | |
| Vicuna-13b-v1.5-16k | 53.4 | 29.2 | 13.1 | 4.3 | 2.2 | 1.4 | 0.9 | | | |
| InternLM2-7b | 58.6 | 49.5 | 33.9 | 12.3 | 13.4 | 2.0 | 0.8 | 0.5 | 0.5 | 0.0 |
| Random Guess | 26.7 | 10.1 | 4.5 | 3.0 | 2.3 | 1.4 | 1.1 | 0.6 | 0.3 | 0.1 |
## 🖊️Citation
```bib
@misc{wang2024adaleval,
title={Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks},
author={Chonghua Wang and Haodong Duan and Songyang Zhang and Dahua Lin and Kai Chen},
year={2024},
eprint={2404.06480},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```