# FusionAudio **Repository Path**: cocobar/FusionAudio ## Basic Information - **Project Name**: FusionAudio - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-29 - **Last Updated**: 2025-10-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # FusionAudio-1.2M, Towards Fine-grained Audio Captioning with Multimodal Contextual Cues πππ Official implementation of **FusionAudio-1.2M**: Towards Fine-grained Audio Captioning with Multimodal Contextual Cues  * **Authors**: [Shunian Chen*](https://github.com/Shunian-Chen), [Xinyuan Xie*](https://github.com/satsuki2486441738), [Zheshu Chen*](https://github.com/kawagebo12), [Liyan Zhao](https://github.com/Apostasi0225cuhksz), [Owen Lee](https://github.com/KaiTheSkyWalker), [Zhan Su](https://scholar.google.com/citations?user=VzEpVpoAAAAJ), [Qilin Sun](https://scholar.google.com/citations?user=igqPS8sAAAAJ), [Benyou Wang](https://scholar.google.com.hk/citations?user=Jk4vJU8AAAAJ) * **Institutions**: The Chinese University of Hong Kong, Shenzhen * **Resources**: [πPaper](https://arxiv.org/abs/2506.01111) [π€Dataset](https://huggingface.co/datasets/SatsukiVie/FusionAudio) * **Models**: [π€FusionAudio](https://huggingface.co/SatsukiVie/FusionAudio) ## π‘ Highlights * π₯ **Large-scale high-quality** audio captioning dataset **FusionAudio-1.2M** * π₯ **Multimodal context fusion** for more fine-grained audio understanding * π₯ **SOTA performance** achieving state-of-the-art results on multiple audio understanding benchmarks ## π News **\[2025/06/01\]** π Our papar [FusionAudio-1.2M, Towards Fine-grained Audio Captioning with Multimodal Contextual Cues](https://arxiv.org/abs/2506.01111) is available! **\[2025/05/16\]** π Released FusionAudio-1.2M [dataset](https://huggingface.co/datasets/SatsukiVie/FusionAudio), [model](https://huggingface.co/SatsukiVie/FusionAudio/tree/main), and code! ## π Quick Start ### Environment Setup ```bash # Create conda environment conda create -n FusionAudio python=3.10 conda activate FusionAudio # Install dependencies pip install -r requirements.txt pip install -e src/GAMA/hf-dev-train/transformers-main pip install -e src/GAMA/peft-main ``` ### Quick Inference We provide an easy-to-use inference script `quick_inference.py` that supports both command-line and Python API usage. #### Command Line Usage ```bash python quick_inference.py \ --base_model /path/to/Llama-2-7b-chat-hf-qformer \ --model_path /path/to/fusionaudio_checkpoint.pth \ --audio /path/to/your/audio.wav \ --question "Please describe this audio in detail." ``` #### Python API Usage ```python from quick_inference import FusionAudioInference # Initialize inferencer inferencer = FusionAudioInference( base_model_path="/path/to/Llama-2-7b-chat-hf-qformer", model_path="/path/to/fusionaudio_checkpoint.pth", device="cuda:0" ) # Audio captioning response = inferencer.predict( audio_path="/path/to/your/audio.wav", question="Please describe this audio in detail." ) print(f"Audio description: {response}") ``` For detailed parameter descriptions, run `python quick_inference.py --help`. ## π Dataset ### FusionAudio-1.2M We constructed a large-scale dataset containing 1.2 million high-quality audio-text pairs. **Caption&QA Dataset Download Link**: [π€ Hugging Face](https://huggingface.co/datasets/SatsukiVie/FusionAudio) **Video Download**: ```bash # Preparation # 1. Write a cookie for Google Account to a txt file # 2. Change line 56,line 116,line 118 for the txt file and downloaded video path # 3. run VideoDownload.py # 4. If the download speed is slow,you can download videos on servers such as AWS cd data python VideoDownload.py ``` #### Data Format ```json [ { "audio_id": "path_to_audio_file", "instruction": "Question", "input": "", "dataset": "dataset_name", "task": "type_of_task", "output": "correct_answer" } ] ``` ## ποΈ Training ### Preprocessing 1. Download Llama-2-7b-chat-hf-qformer model (refer to [GAMA README](https://github.com/Sreyan88/GAMA)) 2. Update the model path in `src/GAMA/gama_finetune.py` at lines 96 and 101 ### Start Training ```bash conda activate FusionAudio cd scripts/train/ bash train.sh ``` ## π Evaluation ### Classification Task Evaluation ```bash cd scripts/eval bash eval_cls.sh ``` ### Captioning Evaluation ```bash cd scripts/eval bash infer.sh ``` ### Retrieval Task Evaluation ```bash # Environment preparation (refer to WavCaps repository) # 1. Configure environment according to https://github.com/XinhaoMei/WavCaps/tree/master/retrieval # 2. Set ckpt_path in inference.yaml # 3. Put eval_retrieval.py into the downloaded retrieval folder cd scripts python eval_retrieval.py ``` ## π Data Statistics  ## π οΈ Model Downloads | Model Name | Purpose | Download Link | |---------|------|----------| | FusionAudio-25k/FusionAudio-25k-high | General audio understanding | [π€ HuggingFace](https://huggingface.co/SatsukiVie/FusionAudio) | | FusionAudio-Retrieval | Audio retrieval | [π€ HuggingFace](https://huggingface.co/Zheshu/FusionAudio-Retrieval) | ## β€οΈ Acknowledgments * **GAMA**: Thanks for providing excellent infrastructure * **WavCaps**: Thanks for pioneering work in audio captioning * **Llama**: Thanks for providing powerful language model foundation * **AudioSet**: Thanks for providing large-scale audio dataset and ontology ## βοΈ Citation If our work is helpful for your research, please consider giving a star β and citing our paper π ```bibtex @misc{chen2025fusionaudio12mfinegrainedaudiocaptioning, title={FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion}, author={Shunian Chen and Xinyuan Xie and Zheshu Chen and Liyan Zhao and Owen Lee and Zhan Su and Qilin Sun and Benyou Wang}, year={2025}, eprint={2506.01111}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2506.01111}, } ``` ## π License **Usage License**: This dataset and models are intended for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and other related models. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. ---