# AnaMeta

**Repository Path**: mirrors_microsoft/AnaMeta

## Basic Information

- **Project Name**: AnaMeta
- **Description**: Code, model and data for ACL'23 paper "AnaMeta: A Table Understanding Dataset of Field Metadata Knowledge Shared by Multi-dimensional Data Analysis Tasks"
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-19
- **Last Updated**: 2025-10-04

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# AnaMeta Repo
This repo contains training, prediction and evaluation code for paper [_AnaMeta: A Table Understanding Dataset of Field Metadata Knowledge Shared by Multi-dimensional Data Analysis Tasks_](https://aclanthology.org/2023.findings-acl.604/).

You can run this repo within docker container, and the dockerfile see [_dockerfile_](dockerfile).

## KDF Framework
### Pre-trained model embedding
The paper involves 2 pre-trained model embedding: TAPAS, TABBIE. You can generate the first one with [Hugging Face](https://github.com/huggingface). For TABBIE, you can generate embeddings from [TABBIE repo](TODO).

### Metadata code
Functions of codes are as following: 

+ [data](data): Load necessary data.
+ [metadata/evaluations](metadata/evaluations): Evaluation metric for each task in paper.
+ [metadata/measure_type](metadata/measure_type): Map measure type from different dataset to measure type in paper.
+ [metadata/metadata_data](metadata/metadata_data): Construct batch of input data for metadata.
+ [metadata/predict](metadata/predict): Evaluation model.
+ [metadata/train](metadata/train): Train model.
+ [metadata/find_perform.py](../../metadata_exp/metadata/find_perform.py): Find useful evaluation metric in log.
+ [metadata/run_train.py](metadata/run_train.py): Entry for training metadata.
+ [metadata/run_predict.py](metadata/run_predict.py): Entry for inference metadata.
+ [model](model): Model of metadata.

#### Train
Run the following script for training metadata:
```shell
# TABBIE  
python -m metadata.run_train --model_size=customize --features=metadata-tabbie --model_name metadata2 --train_batch_size=64 --valid_batch_size=96 --msr_pos_weight=0.8 --sum_pos_weight=0.4 --avg_neg_weight=0.3 --both_neg_weight=0.5 --no_label_weight=0.2 --train_epochs=10 --save_model_fre=5 --num_layers=60 --num_hidden=128 --lang=en --num_workers=0 --corpus all --chart <chart_path> --pivot <pivot_path> --vendor <vendor_path> --t2d <t2d_path> --semtab <semtab_path> --tf1_layers 2 --tf2_layers 2 --df_subset 1 2 3 4 5  --mode general --use_emb --use_entity --entity_type transe100 --entity_emb_path <entity_emb_path> --entity_recognition semtab --use_df

# TAPAS 
python -m metadata.run_train --model_size=customize --features=metadata-tapas_display --model_name metadata2 --train_batch_size=64 --valid_batch_size=96 --msr_pos_weight=0.8 --sum_pos_weight=0.4 --avg_neg_weight=0.3 --both_neg_weight=0.5 --no_label_weight=0.2 --train_epochs=10 --save_model_fre=5 --num_layers=60 --num_hidden=128 --lang=en --num_workers=0 --corpus all --chart <chart_path> --pivot <pivot_path> --vendor <vendor_path> --t2d <t2d_path> --semtab <semtab_path> --tf1_layers 2 --tf2_layers 2 --df_subset 1 2 3 4 5  --mode general --use_emb --use_entity --entity_type transe100 --entity_emb_path <entity_emb_path> --entity_recognition semtab --use_df
```
Note,
+ If you don't need to use data feature, remove `--use_df`
+ If you don't need to use knowledge graph information, remove `--use_entity`

#### Inference
Run the following script for inference metadata:
```shell
# TABBIE
python -m metadata.run_predict --model_size=customize --features=metadata-tabbie --valid_batch_size=96 --mode general  --num_layers=3 --num_hidden=128 --lang=en --model_load_path <model_load_path>  --eval_dataset test --corpus all --chart <chart_path> --pivot <pivot_path> --vendor <vendor_path> --t2d <t2d_path> --semtab <semtab_path>  --use_emb --tf1_layers 2 --tf2_layers 2 --model_name metadata2 --df_subset 1 2 3 4 5 --use_emb --use_entity --entity_type transe100 --entity_emb_path <entity_emb_path>  --entity_recognition semtab --use_df --num_workers 0

# TAPAS
python -m metadata.run_predict --model_size=customize --features=metadata-tapas_display --valid_batch_size=96 --mode general  --num_layers=3 --num_hidden=128 --lang=en --model_load_path <model_load_path>  --eval_dataset test --corpus all --chart <chart_path> --pivot <pivot_path> --vendor <vendor_path> --t2d <t2d_path> --semtab <semtab_path>  --use_emb --tf1_layers 2 --tf2_layers 2 --model_name metadata2 --df_subset 1 2 3 4 5 --use_emb --use_entity --entity_type transe100 --entity_emb_path <entity_emb_path>  --entity_recognition semtab --use_df --num_workers 0
```

Note, 
+ You need to keep the same setting as training.

## Interfaces
For metadata IDs and sentences interface, you can generate them by [run_predict.py](metadata/run_predict.py) with `--require_records --records4downstream_task`.

For metadata embeddings interface, you can generate them by [download_embedding.py](metadata/download_embedding.py).

For training and evaluating Table2Charts model, we follow [this repo](https://github.com/microsoft/Table2Charts).

For TableQA tasks, we follow [this repo](https://github.com/HKUNLP/UnifiedSKG).

## Benchmarks
Training and evaluation for traditional machine learning are in [traditional_ml](traditional_ml) folder.

Training and testing for TURL follow [TURL repo](https://github.com/sunlab-osu/TURL).

Training and testing for TAPAS and TABBIE can leverage KDF framework without distribution fusion and knowledge fusion.


## Data
The paper involves 5 datasets: Chart, Pivot, T2D, TURL, SemTab. 
For T2D dataset, you can download from this [url](http://webdatacommons.org/webtables/goldstandard.html#toc2). 
For TURL dataset, you can download from this [url](https://github.com/sunlab-osu/TURL). 
For SemTab dataset, you can download from this [url](https://github.com/sunlab-osu/TURL).

Dataset quality inspection is in [dataset_quality](dataset_quality) folder.

## Citation
If you find our work helpful, please use the following citations.
```
@inproceedings{he-etal-2023-anameta,
    title = "{A}na{M}eta: A Table Understanding Dataset of Field Metadata Knowledge Shared by Multi-dimensional Data Analysis Tasks",
    author = "He, Xinyi and Zhou, Mengyu and Zhou, Mingjie and Xu, Jialiang and Lv, Xiao and Li, Tianle and Shao, Yijia and Han, Shi and Yuan, Zejian and Zhang, Dongmei",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.604/",
    doi = "10.18653/v1/2023.findings-acl.604",
    pages = "9471--9492",
}
```

## Contributing

This project welcomes contributions and suggestions.  Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
trademarks or logos is subject to and must follow 
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.