# warp-rnnt **Repository Path**: create_future/warp-rnnt ## Basic Information - **Project Name**: warp-rnnt - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-08-25 - **Last Updated**: 2021-08-25 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ![PyPI](https://img.shields.io/pypi/v/warp-rnnt.svg) [![Downloads](https://pepy.tech/badge/warp-rnnt)](https://pepy.tech/project/warp-rnnt) # CUDA-Warp RNN-Transducer A GPU implementation of RNN Transducer (Graves [2012](https://arxiv.org/abs/1211.3711), [2013](https://arxiv.org/abs/1303.5778)). This code is ported from the [reference implementation](https://github.com/awni/transducer/blob/master/ref_transduce.py) (by Awni Hannun) and fully utilizes the CUDA warp mechanism. The main bottleneck in the loss is a forward/backward pass, which based on the dynamic programming algorithm. In particular, there is a nested loop to populate a lattice with shape (T, U), and each value in this lattice depend on the two previous cells from each dimension (e.g. [forward pass](https://github.com/awni/transducer/blob/6b37e98c21551c7ed2181e2f526053bae8ae94d2/ref_transduce.py#L56)). CUDA executes threads in groups of 32 parallel threads called [warps](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture). Full efficiency is realized when all 32 threads of a warp agree on their execution path. This is exactly what is used to optimize the RNN Transducer. The lattice is split into warps in the T dimension. In each warp, variables between threads exchanged using a fast operations. As soon as the current warp fills the last value, the next two warps (t+32, u) and (t, u+1) are start running. A schematic procedure for the forward pass is shown in the figure below, where T - number of frames, U - number of labels, W - warp size. The similar procedure for the backward pass runs in parallel. ![](lattice.gif) ## Performance NVIDIA Profiler shows advantage of the _warp_ implementation over the _non-warp_ implementation. This warp implementation: ![](warp-rnnt.nvvp.png) Non-warp implementation [warp-transducer](https://github.com/HawkAaron/warp-transducer): ![](warp-transducer.nvvp.png) Unfortunately, in practice this advantage disappears because the memory operations takes much longer. Especially if you synchronize memory on each iteration. | | warp_rnnt (gather=False) | warp_rnnt (gather=True) | [warprnnt_pytorch](https://github.com/HawkAaron/warp-transducer/tree/master/pytorch_binding) | [transducer (CPU)](https://github.com/awni/transducer) | | :---------------------- | ------------------: | ------------------: | ------------------: | ------------------: | | **T=150, U=40, V=28** | | N=1 | 0.50 ms | 0.54 ms | 0.63 ms | 1.28 ms | | N=16 | 1.79 ms | 1.72 ms | 1.85 ms | 6.15 ms | | N=32 | 3.09 ms | 2.94 ms | 2.97 ms | 12.72 ms | | N=64 | 5.83 ms | 5.54 ms | 5.23 ms | 23.73 ms | | N=128 | 11.30 ms | 10.74 ms | 9.99 ms | 47.93 ms | | **T=150, U=20, V=5000** | | N=1 | 0.95 ms | 0.80 ms | 1.74 ms | 21.18 ms | | N=16 | 8.74 ms | 6.24 ms | 16.20 ms | 240.11 ms | | N=32 | 17.26 ms | 12.35 ms | 31.64 ms | 490.66 ms | | N=64 | out-of-memory | out-of-memory | out-of-memory | 944.73 ms | | N=128 | out-of-memory | out-of-memory | out-of-memory | 1894.93 ms | | **T=1500, U=300, V=50** | | N=1 | 5.89 ms | 4.99 ms | 10.02 ms | 121.82 ms | | N=16 | 95.46 ms | 78.88 ms | 76.66 ms | 732.50 ms | | N=32 | out-of-memory | 157.86 ms | 165.38 ms | 1448.54 ms | | N=64 | out-of-memory | out-of-memory | out-of-memory | 2767.59 ms | [Benchmarked](pytorch_binding/benchmark.py) on a GeForce RTX 2070 Super GPU, Intel i7-10875H CPU @ 2.30GHz. ## Note - This implementation assumes that the input is log_softmax. - In addition to alphas/betas arrays, counts array is allocated with shape (N, U * 2), which is used as a scheduling mechanism. - [core_gather.cu](core_gather.cu) is a memory-efficient version that expects log_probs with the shape (N, T, U, 2) only for blank and labels values. It shows excellent performance with a large vocabulary. - Do not expect that this implementation will greatly reduce the training time of RNN Transducer model. Probably, the main bottleneck will be a trainable joint network with an output (N, T, U, V). - Also, there is a restricted version, called [Recurrent Neural Aligner](https://github.com/1ytic/warp-rna), with assumption that the length of input sequence is equal to or greater than the length of target sequence. ## Install There are two bindings for the core algorithm: - [pytorch_binding](pytorch_binding) - [tensorflow_binding](tensorflow_binding) ## Reference - Awni Hannun [transducer](https://github.com/awni/transducer) - Mingkun Huang [warp-transducer](https://github.com/HawkAaron/warp-transducer)