# DEA-SQL

**Repository Path**: cntony/DEA-SQL

## Basic Information

- **Project Name**: DEA-SQL
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-02-02
- **Last Updated**: 2025-02-13

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm

### 🔥🔥 2024.05. DEA-SQL is accepted by Findings of ACL 2024!

Based on the idea that **D**ecomposition for **E**nhancing **A**ttention, we propose the workflow paradigm method named DEA-SQL with five major steps as shown in Figure. Check out our [paper](https://arxiv.org/abs/2402.10671) for more information.


![model](./docs/model.png)

## Set Up

### Environment

```bash
# 1. Clone the repo
git clone https://github.com/FlyingFeather/DEA-SQL.git
cd DEA-SQL && mkdir data

# 2. Make a conda environment
conda create -n deasql python=3.9
conda activate deasql

# 3. Install requirements
pip install -r requirements.txt
python nltk_downloader.py
```


### Dataset

Download the data set from the [spider official website](https://yale-lily.github.io/spider) under `DEA-SQL` , unzip it and put it into the `data` folder. 
We provide the data in [drive](https://drive.google.com/file/d/15tQXB7ilPXBxuJU7ynPeNk_nLqtg8XgK/view?usp=sharing) if it is unable to download dataset from spider official website.


```
mkdir data
unzip spider.zip -d data
```

The directory structure should be as follows:

```
.
├── argsparser.py
├── common
├── correct_sql.py
├── data
│   └── spider
│		├── ...
│		└── database
├── data_preprocess.py
├── docs
├── evaluation
├── fewshot
├── filter_characters.py
├── gen_sql.py
├── get_ner.py
├── hardness_eval.py
├── __init__.py
├── LICENSE
├── llm
├── logger.py
├── main.py
├── nltk_downloader.py
├── outputs
├── prompt
├── README.md
├── requirements.txt
└── single_eval.py
```


## Usage
Please modify the OpenAI configuration in `common/static_config.py` and configure the relevant environment variables for the Azure OpenAI API.

Several important parameters:
- **dataset**: The name of dataset.
- **few_shot_mode**: The method of retrieving fewshot can be selected from [random, ques_tim, masked_ques_sim].
- **few_shot_data**: The data of retrieving fewshot can be selected from [train_merge_v1, train_merge_v5]
- **insert_value**: The number of lines that are inserted in database prompt.
- **embedding_base_model**: The base embedding model in retrieving few-shot step.
- **sc_filter_nums**: The number of information filter layer.

## Quick Start

### prediction on the Spider Dev datasets
```
python main.py --save_file_name "dea-sql.txt" --dataset "spider" --mode "dev" --sample "False" --few_shot_mode "masked_ques_sim" --insert_value 3 --embedding_base_model "openai"  --sc_filter_nums 3 --few_shot_data "train_merge_v5"
```

### evaluation on the Spider Dev datasets
For the first evaluation, please perform: `python nltk_downloader.py`

```
python evaluation/test-suite-sql-eval/evaluation.py --gold "evaluation/gold_files/spider_dev_gold.sql" --pred "outputs/spider/dea-sql.txt" --db ./data/spider/database --print_file_name "outputs/spider/spider-dea-sql.txt" --table './data/spider/tables.json' --etype exec
```

## Citing DEA-SQL

```
@article{xie2024decomposition,
      title={Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm}, 
      author={Yuanzhen Xie and Xinzhou Jin and Tao Xie and MingXiong Lin and Liang Chen and Chenyun Yu and Lei Cheng and ChengXiang Zhuo and Bo Hu and Zang Li},
      journal={arXiv preprint arXiv:2402.10671},
      year={2024}
}
```