概述*

1. NPU硬件概述*

NPU 处理器专门为物联网人工智能而设计，用于加速神经网络的运算，解决传统芯片在神经网络运算时效率低下的问题。 GX8002 是国芯微推出的超低功耗 AI 芯片，具备体积小、功耗低、成本低等优势，内部包含了一颗高性能低功耗的 NPU 处理器。该 NPU 处理器内含矩阵乘加、卷积、通用计算、拷贝、解压缩等子模块。

2. 工具链和 API*

国芯神经网络处理单元编译器（Guo-Xin Neural network Processing Unit Compiler，简称 gxnpuc）是异构计算架构下的模型转换工具，该编译工具在 Linux 环境下使用，可将开源框架的网络模型转换为适配国芯 AI 处理器的离线模型文件。

开发者可基于 LPV 框架提供的 NPU 相关 API 实现离线模型在 GX8002 计算平台上的推理部署，并完成语音识别、目标检测等应用开发。

3. 使用步骤*

在 PC 上生成指定格式的开源框架网络模型文件。

目前支持的前端深度学习框架包含 TensorFlow、PyTorch, 指定模型格式的具体转换流程，详见模型编译使用示例: TensorFlow示例、PyTorch示例;
编写模型转换配置文件，使用 gxnpuc 模型转换工具，将开源框架网络模型编译输出为 NPU 所支持的离线模型文件。
调用 NPU 相关 API，完成从资源加载、数据传输，模型推理，资源释放等操作，以实现相关应用的开发。

4. GX8002 的 NPU 支持的 TensorFlow op算子列表*

OP名字	限制
Abs
Add
AddV2
AvgPool	数据格式只支持NHWC 池化窗口H和W范围必须是1-15 池化窗口H和W不能都是1 池化窗口H和stride_h必须相同，池化窗口W和stride_w必须相同
BatchMatMulV2	第2个参数（权重）的H和W分别向上32取整后的乘积必须小于65536
BatchToSpaceND	只支持转成卷积dilation>1的情况
BiasAdd
Concat	维度信息必须编译时确定
ConcatV2	维度信息必须编译时确定
Const
Conv2D	第2个参数（权重）数值必须编译时确定数据格式只支持NHWC input channel为1时，只支持VALID，卷积核H*W<=49，H<=11，W<=11，stride<=4 input channel不为1时，卷积核H<=15，W<=15，stride<=15
DepthwiseConv2dNative	第2个参数（权重）数值必须编译时确定数据格式只支持NHWC 只支持VALID，卷积核H*W<=49，H<=11，W<=11，stride<=4
Div
Exp
ExpandDims	第2个参数数值必须编译时确定
FusedBatchNorm
FusedBatchNormV2
FusedBatchNormV3
Identity
Log
MatMul	第2个参数（权重）的H和W分别向上32取整后的乘积必须小于65536
MaxPool	数据格式只支持NHWC 池化窗口H和W范围必须是1-15 池化窗口H和W不能都是1 池化窗口H和stride_h必须相同，池化窗口W和stride_w必须相同
Mean	维度信息必须编译时确定
Mul
Neg
Pack
Pad
Placeholder
Pow	第2个参数（指数）数值必须编译时确定第1个参数（数据）必须大于0
RealDiv
Reciprocal
Relu
Relu6
Reshape	第2个参数数值必须编译时确定
Rsqrt
Selu
Shape
Sigmoid
Slice	维度信息必须编译时确定
SpaceToBatchND	只支持转成卷积dilation>1的情况
Split
Sqrt
Square
SquaredDifference	两个输入Tensor必须shape一致
Squeeze
StridedSlice	维度信息必须编译时确定
Sub
Sum	维度信息必须编译时确定
Tanh
Transpose	第2个参数数值必须编译时确定只支持二维转置或可以当成二维转置的操作

5. GX8002 的 NPU 支持的 PyTorch op算子列表*

op type	support torch api	Limitations
Conv2d	1. torch.nn.Conv2d 2. torch.nn.functional.conv2d	Conv2d kernel_h and kernel_w must <= 15 Conv2d stride_h and stride_w must <= 15 Conv2d dilation_h and dilation_w must <= 15
DepthwiseConv2d	1. torch.nn.Conv2d 2. torch.nn.functional.conv2d	DepthwiseConv2d kernel_h and kernel_w must <= 11 DepthwiseConv2d kernel_h * kernel_w must <= 49 DepthwiseConv2d stride_h and stride_w must <= 4 DepthwiseConv2d dilation_h and dilation_w must == 1 DepthwiseConv2d don't supported padding
Conv1d	1. torch.nn.Conv1d 2. torch.nn.functional.conv1d	Conv1d stride must <= 15 Conv1d dilation must <= 15 Conv1d kernel must <= 15
DepthwiseConv1d	1. torch.nn.Conv1d 2. torch.nn.functional.conv1d	DepthwiseConv1d stride size must <= 4 DepthwiseConv1d dilation size must == 1 DepthwiseConv1d kernel must <= 11 DepthwiseConv1d don't supported padding
MaxPool2d	1. torch.nn.MaxPool2d 2. torch.nn.functional.max_pool2d	MaxPool2d kernel_h and kernel_w must <= 15 MaxPool2d kernel_h and kernel_w shouldn't be both == 1 MaxPool2d kernel_h must be equal with stride_h MaxPool2d kernel_w must be equal with stride_w MaxPool2d input height must be divisible by kernel_h MaxPool2d input width must be divisible by kernel_w MaxPool2d dilation_h and dilation_w must == 1
AvgPool2d	1. torch.nn.AvgPool2d 2. torch.nn.functional.avg_pool2d	MaxPool2d kernel_h and kernel_w must <= 15 MaxPool2d kernel_h and kernel_w shouldn't be both == 1 MaxPool2d kernel_h must be equal with stride_h MaxPool2d kernel_w must be equal with stride_w MaxPool2d input height must be divisible by kernel_h MaxPool2d input width must be divisible by kernel_w MaxPool2d dilation_h and dilation_w must == 1
Relu	1. torch.nn.ReLU 2. torch.nn.functional.relu
Relu6	1. torch.nn.ReLU6 2. torch.nn.functional.relu6
PRelu	1. torch.nn.PReLU 2. torch.nn.functional.prelu
Selu	1. torch.nn.SELU 2. torch.nn.functional.selu
HardTanh	1. torch.nn.Hardtanh 2. torch.nn.functional.hardtanh	HardTanh min_val param must be 0
Sigmoid	1. torch.nn.Sigmoid 2. torch.nn.functional.sigmoid
Tanh	1. torch.nn.Tanh 2. torch.nn.functional.tanh
Flatten	1. torch.nn.Flatten 2. torch.flatten	only support reshaping input tensor into a one-dimensional tensor
Linear	1. torch.nn.Linear 2. torch.nn.functional.linear
Permute	1. torch.permute 2. Tensor.permute 3. torch.transpose 4. Tensor.transpose
BatchNorm2d	1. torch.nn.BatchNorm2d 2. torch.nn.functional.batch_norm
BatchNorm1d	1. torch.nn.BatchNorm1d 2. torch.nn.functional.batch_norm
Pad	1. torch.nn.ZeroPad2d 2. torch.nn.ConstantPad2d 3. torch.nn.ConstantPad1d	Pad input tensor dimensions must <= 4 Pad value must be 0
Reshape	1. torch.reshape 2. Tensor.reshape
Concat	1. torch.concatenate 2. torch.concat 3. torch.cat
Squeeze	1. torch.squeeze
UnSqueeze	1. torch.unsqueeze
Add	1. torch.add 2. + operator	Add output tensor dimensions must <= 5
Mul	1. torch.mul 2. * operator 3. torch.multiply	Mul output tensor dimensions must <= 5
Sub	1. torch.sub 2. - operator 3. torch.subtract	Sub output tensor dimensions must <= 5
Div	1. torch.div 2. / operator 3. torch.divide	Div output tensor dimensions must <= 5
Slice	1. Tensor[x0:y0, ..., xn:yn]
ReduceSum	1. torch.sum
ReduceMean	1. torch.mean
Exp	1. torch.exp
Log	1. torch.log	new tensor with the natural logarithm of the elements of input.
Sqrt	1. torch.sqrt
Square	1. torch.square
Reciprocal	1. torch.reciprocal
Neg	1. torch.neg 2. torch.negative
Rsqrt	1. torch.rsqrt
Abs	1. torch.abs 2. torch.absolute
Pow	1. torch.pow
UpSample	1. torch.nn.functional.upsample 2. torch.nn.functional.upsample_nearest	UpSample only support use scale_factor param UpSample scale_h and scale_w must be same UpSample input tensor dimension must be 4
Split	1. torch.split

6. GX8010/GX8009/GX8008 的 NPU支持的op算子列表*

OP名字	限制
Abs
Add
AddN
All	第2个参数数值必须编译时确定
Any	第2个参数数值必须编译时确定
Assert
AvgPool	数据格式为NCHW时计算效率更高 stride<=63
BatchMatMul	第2个参数（权重）数值必须编译时确定
BatchToSpaceND	第2第3个参数数值必须编译时确定
BiasAdd
Cast	只支持编译时计算
Concat	维度信息必须编译时确定
ConcatV2	维度信息必须编译时确定
Const
Conv2D	第2个参数（权重）数值必须编译时确定数据格式为NCHW时计算效率更高卷积核H<=11，W<=11，H和W相等且是奇数时效率较高，stride<=63
Conv2DBackpropInput	第1第2个参数数值必须编译时确定只支持数据格式为NCHW 卷积核H<=11，W<=11，H和W相等且是奇数时效率较高，stride<=63
DepthwiseConv2dNative	第2个参数（权重）数值必须编译时确定只支持数据格式为NCHW 卷积核H<=11，W<=11，H和W相等且是奇数时效率较高，stride<=63
Div
Enter
Equal
Exit
Exp
ExpandDims	第2个参数数值必须编译时确定
Fill	只支持编译时计算
FloorDiv
FloorMod
Gather	第2个参数数值必须编译时确定
GatherV2	第2第3个参数数值必须编译时确定
GreaterEqual
Identity
Less
LessEqual
ListDiff	只支持编译时计算
Log
LogSoftmax	只支持编译时计算
LogicalAnd
LogicalNot
LogicalOr
LogicalXor
LoopCond
MatMul	第2个参数（权重）数值必须编译时确定
Max	第2个参数数值必须编译时确定
MaxPool	数据格式为NCHW时计算效率更高 stride<=63
Maximum
Mean	第2个参数数值必须编译时确定
Merge
Min	第2个参数数值必须编译时确定
Minimum
Mul
Neg
NextIteration
Pack
Pad	第2个参数数值必须编译时确定
Placeholder
Pow	第2个参数数值必须编译时确定
Print
Prod	只支持编译时计算
Range	只支持编译时计算
Rank	只支持编译时计算
RealDiv
Reciprocal
Relu
Relu6
Reshape	第2个参数数值必须编译时确定
ReverseV2
Rsqrt
Select
Selu
Shape	只支持编译时计算
Sigmoid
Slice	第2第3个参数数值必须编译时确定
Softmax
SpaceToBatchND	第2第3个参数数值必须编译时确定
Split
Sqrt
Square
SquaredDifference
Squeeze
StopGradient
StridedSlice	维度信息必须编译时确定
Sub
Sum	维度信息必须编译时确定
Switch
Tanh
TensorArrayGatherV3
TensorArrayReadV3
TensorArrayScatterV3
TensorArraySizeV3
TensorArrayV3
TensorArrayWriteV3
Tile	第2个参数数值必须编译时确定
Transpose	第2个参数数值必须编译时确定
Unpack