# NSC **Repository Path**: zhoub86/NSC ## Basic Information - **Project Name**: NSC - **Description**: Neural Speech Codec - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-03-18 - **Last Updated**: 2021-03-18 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Efficient And Scalable Neural Residual Waveform Coding With Collaborative Quantization [![LICENSE](https://img.shields.io/badge/license-MIT-green)](https://github.com/cocosci/pam-nac-v2/master/LICENSE) [![Python](https://img.shields.io/badge/Python-3.6-purple)](https://www.python.org/) [![TensorFlow](https://img.shields.io/badge/TensorFlow-2.0-orange)](https://www.tensorflow.org/) [![Paper](https://img.shields.io/badge/PDF-IEEEXplore-blue)](https://ieeexplore.ieee.org/document/9054347/) Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters, significantly less than many other generative models. Please consider citing our papers if this helps. ``` @inproceedings{zhen2020cq, author={Kai Zhen and Mi Suk Lee and Jongmo Sung and Seungkwon Beack and Minje Kim}, title={{Efficient And Scalable Neural Residual Waveform Coding with Collaborative Quantization}}, year=2020, booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020}, doi={10.1109/ICASSP40776.2020.9054347} url={https://ieeexplore.ieee.org/document/9054347} } @inproceedings{Zhen2019, author={Kai Zhen and Jongmo Sung and Mi Suk Lee and Seungkwon Beack and Minje Kim}, title={{Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding}}, year=2019, booktitle={Proc. Interspeech 2019}, pages={3396--3400}, doi={10.21437/Interspeech.2019-1816}, url={http://dx.doi.org/10.21437/Interspeech.2019-1816} } ``` # Demos - Project Page - I: https://saige.sice.indiana.edu/research-projects/neural-audio-coding/ - Project Page - II: http://kaizhen.us/collaborative-quantization ![alt text](https://github.com/cocosci/NSC/blob/master/figure/model_3.png) # The Code Structure - utilities.py: supporting functions for Hann windowing, waveform segmentation, and objective measure calculation - lpc_utilities.py: LPC analyzer, synthesizer and related functions implemented in Python - neural_speech_coding_module.py: model configuration, training and evaluation for one neural codec - cmrl.py: model training and evaluation with multiple neural codecs - loss_terms_and_measures: loss functions and others to calculate objective measures such as pesq - nn_core_operator.py: some fundamental operations such as convolution and quantization - constants.py: definitions on the frame size, sample rate and other initializations - main.py: the entry file # The Dataset The experiment is conducted on TIMIT corpus. https://catalog.ldc.upenn.edu/LDC93S1 # Run The Code ## Training ``` python main.py --learning_rate_tanh 0.0002 # the learning rate for the 1st codec --learning_rate_greedy_followers '0.00002 0.000002' # the learning rate for the added codecs and finetuning --epoch_tanh 200 # the epoch for the 1st codec --epoch_greedy_followers '50 50' # the epoch for the added codecs and finetuning --batch_size 128 --num_resnets 2 # number of neural codecs involved --training_mode 4 # see main.py for specifications --base_model_id '1993783' # used for finetuning and evaluation --from_where_step 2 # used for finetuning and evaluation --suffix '_greedy_all_' # the suffix of the name of the model to be saved --bottleneck_kernel_and_dilation '9 9 100 20 1 2' # configuration of the ResNet block --save_unique_mark 'follower_all' # the name of the model to be saved --the_strides '2' # the stride value for the down sampling CNN layer --coeff_term '60 10 10 0' # coefficients for the loss terms --res_scalar 1.0 --pretrain_step 2 # number of pretrained step with no quantization --target_entropy 2.2 # target entropy --num_bins_for_follower '32 32' # number of quantization bins --is_cq 1 # is collaborative quantization being enabled ``` ## Evaluation ``` python main.py --training_mode 0 # the base_model_id will need to be set correctly, other settings do not need to be changed ``` # References Our work is built upon several recent publications on end-to-end speech coding, trainable quantizer and LPCNet. - [1] Douglas O’Shaughnessy, “Linear predictive coding,” IEEE potentials, vol. 7, no. 1, pp. 29–32, 1988. - [2] J.-M. Valin and J. Skoglund, “LPCNet: Improving neural speech synthesis through linear prediction,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. - [3] S. Kankanahalli, “End-to-end optimized speech coding with deep neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018. - [4] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems (NIPS), 2017, pp. 1141–1151. Some of the code is borrowed from https://github.com/sri-kankanahalli/autoencoder-speech-compression