# cltest

**Repository Path**: cqchu/cltest

## Basic Information

- **Project Name**: cltest
- **Description**: CL Test
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 2
- **Created**: 2020-08-28
- **Last Updated**: 2021-04-01

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README


眨眼间两个月的实习已经步入尾声，仅以此文记录一下实习期间的一些工作。

实习期间可以说主要做了三件事，首先在实习开始阶段比较细致的阅读了`Tflite`的源码，相关的笔记可以参考[这里](https://gitee.com/cqchu/cltest/tree/master/tflite_notes)，后续如果有其他同事想阅读相关源码的话可以参考，这里就先不多赘述了。第二件事是对Winogard算法的探索与实现，由于这个内容也已经和大家分享过了，所以此处也不多说。这里主要介绍一下第三件事，对Mali GPU上进行量化推理的探索。

#### ARM Int8 Dot Product

Mali在其最近的几款GPU上（目前来看主要包括`G52, G76, G57, G77, G68, G78`这几款）新增了一个硬件模块`int8 dot product`，从硬件上加速量化神经网络的推理。关于底层硬件相关的信息ARM给的很少，可以说几乎没有。软件上ARM是以OpenCL Extension的方式支持了这个特性，[Arm Mali Bifrost and Valhall OpenCL Developer Guide](https://developer.arm.com/documentation/101574/0400/OpenCL-extensions/OpenCL-extensions-supported-by-the-Mali-GPU-OpenCL-driver?lang=en)中列举了这几个Extension，其可以以如下的方式被Enable：

```C++
#pragma OPENCL EXTENSION cl_arm_integer_dot_product_int8 : enable
#pragma OPENCL EXTENSION cl_arm_integer_dot_product_accumulate_int8 : enable
#pragma OPENCL EXTENSION cl_arm_integer_dot_product_accumulate_int16 : enable
// cl_arm_integer_dot_product_accumulate_saturate_int8只有从G77开始才会被支持。
#pragma OPENCL EXTENSION cl_arm_integer_dot_product_accumulate_saturate_int8 : enable
```

这几个extension分别会引入几个可以做内积的函数，这几个函数在编译时应该会被OpenCL编译成Mali上关于`int8 dot product`模块的指令，其中`cl_arm_integer_dot_product_int8`引入的函数为

```C++
int arm_dot(char4 a, char4 b);
uint arm_dot(uchar4 a, uchar4 b);
```

这个函数的返回值就是`(a.x * b.x) + (a.y * b.y) + (a.z * b.z) + (a.w * b.w)`，即对两个`char4/uchar4`的输入求内积并返回，其一般用法是：

```C++
int accum = 0;
for (/* conditions */) {
	char4 lhs, rhs;
    accum += arm_dot(lhs, rhs);
}
```

`cl_arm_integer_dot_product_accumulate_int8/cl_arm_integer_dot_product_accumulate_int16`中引入的函数为

```C++
// cl_arm_integer_dot_product_accumulate_int8
int arm_dot_acc(char4 a, char4 b, int acc);
uint arm_dot_acc(uchar4 a, uchar4 b, uint acc);

// cl_arm_integer_dot_product_accumulate_int16
int arm_dot_acc(short2 a, short2 b, int acc);
uint arm_dot_acc(ushort2 a, ushort2 b, uint acc);
```

其中前两者返回值是`acc + (a.x * b.x) + (a.y * b.y) + (a.z * b.z) + (a.w * b.w)`，后两者返回值是`acc + (a.x * b.x) + (a.y * b.y)`，典型用法如下

```C++
int accum = 0;
for (/* conditions */) {
	char4 lhs, rhs;
    accum = arm_dot_acc(lhs, rhs, accum);
}
```

`cl_arm_integer_dot_product_accumulate_saturate_int8`中引入的函数为

```C++
int arm_dot_acc_sat(char4 a, char4 b, int acc);
uint arm_dot_acc_sat(uchar4 a, uchar4 b, uint acc);
```

其返回的结果是`satutate(acc + (a.x * b.x) + (a.y * b.y) + (a.z * b.z) + (a.w * b.w))`

更具体的信息可以参考[这里](https://www.khronos.org/registry/OpenCL/extensions/arm/cl_arm_integer_dot_product.txt)

一般来说，这些函数的使用可以这样子

```C++
#ifdef cl_arm_integer_dot_product_int8
	#pragma OPENCL EXTENSION cl_arm_integer_dot_product_int8 : enable
	define ARM_DOT(lhs, rhs, acc) (acc) += arm_dot((lhs), (rhs))
#elif defined cl_arm_integer_dot_product_accumulate_int8
	#pragma OPENCL EXTENSION cl_arm_integer_dot_product_accumulate_int8 : enable
    define ARM_DOT(lhs, rhs, acc) (acc) = arm_dot_acc((lhs), (rhs), (acc))
#else
    // 与浮点数的实现相同
    define ARM_DOT(lhs, rhs, acc) (acc) += dot((lhs), (rhs))
#endif
```

更多使用的例子可以参考[ACL](https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/cl_kernels/depthwise_convolution_quantized.cl#L57)

#### 各家对GPU Int8的支持

|                             框架                             | 支持 Int8 CL Kernel |                    备注                    |
| :----------------------------------------------------------: | :-----------------: | :----------------------------------------: |
| [Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite/tree/develop/mobile/src/operators/kernel/cl/cl_kernel) |         否          |                                            |
| [Tflite](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu/cl/kernels) |         否          | 其会把输入的量化模型转为浮点模型，然后去算 |
| [MNN](https://github.com/alibaba/MNN/tree/master/source/backend/opencl/execution/cl) |        是？         |  使用mad24()写了conv与depthconv的int8版本  |
| [TNN](https://github.com/Tencent/TNN/tree/master/source/tnn/device/opencl/cl) |         否          |                                            |
| [NCNN](https://github.com/Tencent/ncnn/tree/master/src/layer/vulkan) |         否          |    无OpenCL实现，Vulkan后端应该也没实现    |
| [ACL](https://github.com/ARM-software/ComputeLibrary/tree/master/src/core/CL/cl_kernels) |         是          |                                            |
| [Mace](https://github.com/XiaoMi/mace/tree/master/mace/ops/opencl) |         否          |                                            |

#### 其他的内积函数

OpenCL中还支持一些原生的内积函数，这里列出做简要介绍，以便于后续对比`arm_dot()`函数和这些内积函数的性能差异

##### dot()

```C++
float  dot (gentype p0, gentype p1);
half   dot (gentype p0, gentype p1);	// if half extension (cl_khr_fp16) enabled 
double dot (gentype p0, gentype p1);    // if double extension (cl_khr_fp64) enabled 
```

其中`gentype`可以是`float/half/double`，`float2/half2/double2`，`float4/half4/double4`，以`float`为例其计算的是`float res = dot(lhs, rhs)`即等价于`float res = (lhs.x*rhs.x) + (lhs.y*rhs.y) + (lhs.z*rhs.z) + (lhs.w*rhs.w)`。

##### mad()

```C++
gentype mad(gentype a, gentype b, gentype c)
```

其中`gentype`可以是`float/half/double`，`float2/half2/double2`，`float4/half4/double4`，其计算过程是`a * b + c`，以`float4`为例，`float4 res = mad(lhs, rhs, val)`，则等价于`res.x = lhs.x*rhs.x+val.x`，`res.y = lhs.y*rhs.y+val.y`，在`res.z`和`res.w`上也会做这样的计算，所以如果要求内积的话，还需要再进行`acc = res.x + res.y + res.z + res.w`。

##### mad24()

```C++
gentype mad24 (gentype x, gentype y, gentype z)
```

其将两个`24bit`的整数`x`和`y`相乘，再将乘的结果和`32bit`的`z`相加，得到一个最终的结果，其他的和`mad`是类似的。

#### 量化性能探索

为了分析`int8`相对于`fp32`的性能变化，此处使用`matmul`算子上的运行时间作为比较的对象。具体来说，这里测试的是一个`512*512`和一个`512*512`的矩阵相乘的性能，其中用作对比的是`fp32`和`int8`的kernel实现方式一模一样，只是在输入输出的数据类型，以及最核心的那个向量内积求的方式有区别。具体测试的平台使用的是Mate20手机，其上的Mali GPU型号是G76。

##### Case 1

`fp32`和`int8`的输入输出均使用`buffer`，然后输入的右操作数均做了转置以使访存连续。这两个kernel分别对应了`matmul_fp32_lbrbobntnp`和`matmul_int8_lbrbobntnp`，`lbrbobntnp`表示`lhs-buffer(lb), rhs-buffer(rb), out-buffer(ob), no tiling, no packing`。考虑到矩阵乘会使结果超出`int8`的表示空间，此处会让累加结果对素数求余，以使最终结果可以在`int8`的表示空间。

`fp32`的算子分别基于这三种`dot`，`mad`，`manual`即手动去乘实现，代码如下：

```C++
#pragma OPENCL EXTENSION cl_arm_printf : enable
__kernel void matmul_fp32_lbrbobntnp(__global float *out_buf, __global float4 *lhs_buf,
                                   __global float4 *rhs_buf, int4 hwkt) {
    int height = hwkt.x, width = hwkt.y, kslice = hwkt.z/4;
    int h = get_global_id(0);
    int w = get_global_id(1);    

    /****************** dot ******************/
    float acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc += dot(lhs_buf[h*kslice + i + 0], rhs_buf[w*kslice + i + 0]);
        // ...
    }
    out_buf[h*width + w] = acc;
    
    /****************** manual *******************/
    float acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc += lhs_buf[h*kslice + i + 0].x * rhs_buf[w*kslice + i + 0].x;
        // ...
    }
    out_buf[h*width + w] = acc;
    
    /****************** mad *******************/
    float4 acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc = mad(lhs_buf[h*kslice + i + 0], rhs_buf[w*kslice + i + 0], acc);
        // ...
    }
    out_buf[h*width + w] = acc.x + acc.y + acc.z + acc.w;
}
```

`int8`算子分别基于`arm_dot`，`mad24`，`dot`来实现

```C++
#pragma OPENCL EXTENSION cl_arm_printf : enable
#pragma OPENCL EXTENSION cl_arm_integer_dot_product_int8 : enable

__kernel void matmul_int8_lbrbobntnp(__global char * out_buf, __global char4 *lhs_buf,
                                     __global char4 *rhs_buf, int4 hwkt) {
    int height = hwkt.x, width = hwkt.y, kslice = hwkt.z/4;
    int h = get_global_id(0);
    int w = get_global_id(1);

    // arm_dot
    int acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc += arm_dot(lhs_buf[h*kslice + i + 0], rhs_buf[w*kslice + i + 0]);
		// ...
    }
    out_buf[h*width + w] = convert_char(acc % 53);

    // manual
    int acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc += lhs_buf[h*kslice + i + 0].x * rhs_buf[w*kslice + i + 0].x;
        // ...
    }
    out_buf[h*width + w] = convert_char(acc % 53);

   	// mad24
    int4 acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc = mad24(convert_int4(lhs_buf[h*kslice + i + 0]), convert_int4(rhs_buf[w*kslice + i + 0]), acc);
        // ...
    }
    out_buf[h*width + w] = convert_char((acc.x+acc.y+acc.z+acc.w) % 53);

    // dot
    float acc = 0;
    for(int i=0; i<kslice; i+=4) {
        acc += dot(convert_float4(lhs_buf[h*kslice + i + 0]), convert_float4(rhs_buf[w*kslice + i + 0]));
        // ...
    }
    out_buf[h*width + w] = convert_char((int)(acc) % 53);
}
```

这几种情况各自执行时间如下：

|  Local   | int8-arm-dot  |   int8-dot    |  int8-mad24   |  int8-manual  |   fp32-dot    |   fp32-mad    |  fp32-manual  |
| :------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: |
| (16, 16) |   3.33692ms   |   4.99925ms   |   4.44402ms   |   5.89832ms   |   40.4384ms   |   39.105ms    |   36.1468ms   |
| (16,  8) |   6.34827ms   |   5.67264ms   |   5.37374ms   |   6.09452ms   |   38.855ms    |   39.622ms    |   38.2853ms   |
| ( 8, 16) |   2.73936ms   |   4.5971ms    |   3.76884ms   |   3.80634ms   |   33.1867ms   |   32.3976ms   |   32.0624ms   |
| ( 8,  8) |   5.89299ms   |   4.55334ms   |   4.48045ms   |   4.85107ms   |   39.3126ms   |   36.0992ms   |   38.2589ms   |
| ( 8,  4) |   8.31789ms   |   8.19391ms   |   9.1291ms    |   9.46768ms   |   43.9243ms   |   40.8333ms   |   42.6117ms   |
| ( 4,  8) |   2.85678ms   | **4.14336ms** |   3.72654ms   | **3.77767ms** |   24.6318ms   |   22.5776ms   |   24.1126ms   |
| ( 4,  4) |   4.92346ms   |   4.72122ms   |   5.03256ms   |   5.29631ms   |   25.4268ms   |   23.7878ms   |   25.225ms    |
| ( 4,  2) |   5.88595ms   |   7.21479ms   |   6.74998ms   |   7.02894ms   |   26.4515ms   |   25.3086ms   |   25.6682ms   |
| ( 2,  4) | **2.55701ms** |   4.29158ms   | **3.23214ms** |   3.90662ms   | **16.2442ms** | **17.5204ms** | **15.3311ms** |
| ( 2,  2) |   3.90658ms   |   6.04247ms   |   4.19128ms   |   5.14344ms   |   23.2523ms   |   25.6814ms   |   22.1414ms   |

从整体上来看，`int8`的场景下，`arm_dot > mad24 > manual > dot`，主要`dot`只支持浮点数的计算，用作`int8`乘法时会需要一个显式的数据类型转换。然后这几种方式无论如何都比`fp32`快接近四倍，其中`arm_dot`会比`fp32`快6倍。

备注：之前用天玑跑实验时，`arm_dot, mad24, manual, dot`间速度差异很小，在`1024*1024`和`1024*1024`情况的矩阵相乘情况下，`arm_dot`最快可以比其他实现快`25%`左右，然后天玑`G77`上跑`int8`性能其实和Mate20差不多，但是`FP32`的性能比Mate20好很多，所以天玑上这个加速比一般在4倍左右。

##### Case 2

相对于case1，case2中将右操作数的存储类型换成了`image2d`，以降低`buffer`的访存压力

|  Local   | int8-arm-dot  |   int8-dot    |  int8-mad24   |   fp32-dot    |   fp32-mad   |
| :------: | :-----------: | :-----------: | :-----------: | :-----------: | :----------: |
| (16, 16) |   4.38292ms   |   4.47575ms   |   4.03735ms   |   36.0364ms   |  27.0738ms   |
| (16,  8) |   4.40281ms   |   4.73731ms   |   4.39425ms   |   40.5976ms   |  36.1419ms   |
| ( 8, 16) |   4.40189ms   |   4.75617ms   |   4.47263ms   |   24.4539ms   |  22.9527ms   |
| ( 8,  8) |   4.15579ms   |   5.03797ms   |    4.838ms    |   30.1811ms   |  33.8364ms   |
| ( 8,  4) |   3.90697ms   |   5.30463ms   |   5.41718ms   |   41.3982ms   |  39.3226ms   |
| ( 4,  8) |   3.9354ms    |   4.76769ms   |   4.41875ms   |   13.7882ms   |  12.1441ms   |
| ( 4,  4) |   3.92877ms   |   4.5415ms    |   4.17637ms   |   20.0423ms   |  18.6582ms   |
| ( 4,  2) |   5.87929ms   |   7.38848ms   |   7.17786ms   |   24.0229ms   |  21.5546ms   |
| ( 2,  4) | **3.05147ms** | **4.09124ms** | **3.78036ms** | **8.08149ms** | **7.8989ms** |
| ( 2,  2) |   3.61766ms   |   5.93934ms   |   4.3334ms    |   15.0426ms   |  14.1541ms   |

令人意外的情况发生了，在这种情况下，`int8`算子相比全用`buffer`反而变得更慢了，而`fp32`算子速度反而提升了非常多，在最差情况下`int8`比`fp32`大约快2.6倍，不过在大多数情况下，`int8`算子比`fp32`还是要快3.5~6倍的。