# slowfast_nfnets **Repository Path**: mirrors_deepmind/slowfast_nfnets ## Basic Information - **Project Name**: slowfast_nfnets - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2022-06-24 - **Last Updated**: 2025-10-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Towards Learning Universal Audio Representations In [Towards Learning Universal Audio Representations] (to appear at [ICASSP 2022]), we introduce a Holistic Audio Representation Evaluation Suite (HARES), containing 12 downstream tasks spanning the speech, music, and environmental sound domains, with the hope that this will spur research on developing better models for universal audio representations. Together with the benchmark, we also propose a new Slowfast NFNet architecture in the paper. ## HARES tasks Below is a summary of all 12 HARES tasks, with the links to obtaining these freely available datasets. Note that the lables of original test sets of [Birdsong] and [TUT18] are not publicly availabe - therefore we use the splits created by the authors of [Pre-Training Audio Representations with Self-Supervision] ([download link]), which is based on the original training subset. For more details about how to assemble these tasks, please refer to Appendix A of the arXiv version of [our paper]. | Dataset | Task | #Samples | #Classes | Domain | |----------|:-------------|------:|------:|:------| | [AudioSet] | audio tagging | 1.9m | 527 | environment | | [Birdsong] | animal sound | 36k | 2 | environment | | [TUT18] | acoustic scenes | 8.6k | 10 | environment | | [ESC-50] | acoustic scenes | 2.0k | 50 | environment | | [Speech Commands v1] | keyword | 90k | 12 | speech | | [Speech Commands v2] | keyword | 96k | 35 | speech | | [Fluent Speech Commands] | intention | 27k | 31 | speech | | [VoxForge] | languge id | 145k | 6 | speech | | [VoxCeleb] | speaker id | 147k | 1251 | speech | | [NSynth-instrument] | instrument id | 293k | 11 | music | | [NSynth-pitch] | pitch estimation | 293k | 128 | music | | [MagnaTagATune] | music tagging | 26k | 50 | music | ## Audio Slowfast NFNets, a JAX implementation We provide a [JAX]/[Haiku] implementation of the Slowfast NfNet-F0. This convolutional neural network combines Slowfast networks' ability to model both transient and long-range signals in audio, and NFNets' strong performance optimized for hardware accelerators. It achieves the state-of-the-art score on the HARES benchmark. You may use our unit tests to test your development environment and to know more about the usage of the models, which can be executed using `pytest`: ```bash $ pip install -r requirements.txt $ python -m pytest [-n ] slowfast_nfnets ``` ### Usage The unit tests provided together with the model shows a few use cases of how the model can be run. ## Citing this work BibTex for citing the paper: ```bibtex @inproceedings{wang2022towards, title={Towards Learning Universal Audio Representations}, author={Wang, Luyu and Luc, Pauline and Wu, Yan and Recasens, Adria and Smaira, Lucas and Brock, Andrew and Jaegle, Andrew and Alayrac, Jean-Baptiste and Dieleman, Sander and Carreira, Joao and van den Oord, Aaron}, booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={4593--4597}, year={2022}, organization={IEEE} } ``` ## Disclaimer This is not an official Google product. [ICASSP 2022]: https://2022.ieeeicassp.org/ [JAX]: https://github.com/google/jax "JAX on GitHub" [Haiku]: https://github.com/deepmind/dm-haiku [Towards Learning Universal Audio Representations]: https://arxiv.org/abs/2111.12124 [AudioSet]: http://research.google.com/audioset/ [Birdsong]: http://dcase.community/challenge2018/task-bird-audio-detection [TUT18]: http://dcase.community/challenge2018/task-acoustic-scene-classification [ESC-50]: http://github.com/karolpiczak/ESC-50 [Speech Commands v1]: http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz [Speech Commands v2]: http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz [Fluent Speech Commands]: http://fluent.ai/research/fluent-speech-commands/ [VoxForge]: http://tensorflow.org/datasets/catalog/voxforge [VoxCeleb]: http://tensorflow.org/datasets/catalog/voxceleb [NSynth-instrument]: http://tensorflow.org/datasets/catalog/nsynth [NSynth-pitch]: http://tensorflow.org/datasets/catalog/nsynth [MagnaTagATune]: http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset [Pre-Training Audio Representations with Self-Supervision]: https://ieeexplore.ieee.org/abstract/document/9060816 [our paper]: https://arxiv.org/abs/2111.12124 [download link]: https://drive.google.com/drive/folders/1VXExUxPkUgcBgCLBd9fX8R-X6BR-g3gu?usp=sharing