NPU Compiler Usage*
Before using the NPU compiler gxnpuc
, please carefully read the following two technical documents:
gxnpuc can convert open-source framework network models into offline model files that are compatible with Guoxin NPU processors.
1. Overview of gxnpuc Toolchain Functions and Corresponding Parameters*
1.1 General Function Parameters*
--help
-
Parameter Description:
Print relevant parameter information for the gxnpuc toolchain.
-
Usage Example:
$ gxnpuc --help usage: gxnpuc [-h] [--cmpt] [--list] [-c {LEO,APUS,GRUS,V100,V120,V150}] [-f {TF,PT}] [-V] [-v] [-m] [-w] [-s] [-q] [config_filename] NPU Compiler positional arguments: config_filename config file optional arguments: -h, --help show this help message and exit --cmpt get version compatibility information between npu- core, python, and frameworks --list list supported ops -c {LEO,APUS,GRUS,V100,V120,V150}, --core_name {LEO,APUS,GRUS,V100,V120,V150} subparameter of --list, specify NPU Core for listing supported ops -f {TF,PT}, --framework {TF,PT} subparameter of --list, specify Deep Learning Framework for listing supported ops -V, --version show program's version number and exit -v, --verbose verbosely list the processed ops -m, --meminfo verbosely list memory info of ops -w, --weights print compressed weights (GRUS only) -s, --save_hist save histograms of weights value to 'npu_jpgs' directory (GRUS only) -q, --quant inference and generate quant file
--version
-
Parameter Description:
Display current compiler version information.
-
Usage Example:
gxnpuc --version
--list
-
Parameter Description:
List information about the operators supported by the current NPU compiler.
-
Associated Sub-parameters:
Sub-parameters are only effective when the --list parameter is used and are not mandatory.
-c
Specify the chip version. parameter values in the range: {LEO, APUS, GRUS, V100, V120, V150}
-f
Specify the front-end deep learning framework. parameter values in the range: {TF, PT}
-
Usage Examples:
List all the operators supported by the current NPU compiler:
gxnpuc --list
List all the operators supported by the current NPU GRUS compiler:
gxnpuc --list -c GRUS
List all PyTorch operators supported by the current NPU GRUS compiler:
gxnpuc --list -c GRUS -f PT
--cmpt
-
Parameter Description:
Print information about the compatibility between the compiler's supported Python versions, chip models, front-end DL framework versions, and their compatibility.
-
Usage Example:
gxnpuc --cmpt
1.2 Model Compilation Parameters*
config_filename
-
Parameter Description:
Specify the path and filename of the compilation configuration file; the model conversion and optimization will read this configuration file.
-
Associated Sub-parameters (only effective when config_filename is correctly configured):
-v
Print the NPU model structure information.
-m
Print the memory status of each operator node in the NPU model.
-w
Print compressed weights in the NPU model (GRUS only).
-s
Save the histogram of model weights to the 'npu_jpgs' folder (GRUS only).
-
Usage Example:
Use the NPU GRUS compiler with
config.yaml
as the conversion configuration file and enable all sub-parameter functions:gxnpuc config.yaml -v -m -w -s
Notes on gxnpuc Function Parameter Usage
The four groups of parameters --version/-V, --list, --cmpt, config_filename are mutually exclusive and cannot be used simultaneously.
2. Model Compilation Configuration File Explanation*
2.1 TensorFlow Configuration Items*
Configuration Item | Parameter Values | Parameter Description |
---|---|---|
CORENAME | GRUS | Chip model |
NPU_UNIT | NPU32 | Specify the NPU model |
FRAMEWORK | TF | Specify the front-end DL framework type for the model to be converted |
MODEL_FILE | Model file name and path, e.g., ./model.pb | Specify the file name and path of the model to be converted |
OUTPUT_TYPE | c_code | Specify the format of the NPU file output by the compiler |
OUTPUT_FILE | NPU file name, e.g., npu.h | Specify the name of the NPU file output by the compiler |
INPUT_OPS | op_name: shape | Specify information about all input nodes in the NPU model |
OUTPUT_OPS | [output_name, ...] | Specify information about all output nodes in the NPU model |
FP16_OUT_OPS | [out_state_name, ...] | Specify the output nodes in the NPU model output in Float16 format (FP16) |
FUSE_BN | true / false (default: false) | Enable or disable batch normalization (BN) parameter fusion |
COMPRESS | true / false (default: false) | Enable or disable full connection weight quantization compression |
CONV2D_COMPRESS | true / false (default: false) | Enable or disable convolution weight quantization compression |
EXCLUDE_COMPRESS_OPS | [weight_op_name, ...] | Specify weight nodes that can be excluded from quantization compression |
WEIGHT_MIN_MAX | weight_op_name: [min, max] | Specify the minimum and maximum values for weight node quantization compression |
WEIGHT_CACHE_SIZE | Specific allocated memory value, e.g., 10240 | Specify the size of memory allocated in SRAM to store weights |
Notes
- The original model file to be converted must be in FrozenPB format.
-
Parameter Format
op_name: shape
op_name —— The name of the input node in the model shape —— The shape of the input with the node name op_name during inference
-
Example
In the TensorFlow framework, when building a model, placeholders are usually defined as input to the computation graph, and users need to specify specific identifiers for the placeholders as input names.
In this example code, the model defines four sets of placeholders (model inputs) and assigns "Feats", "State_c0", "State_c1", "State_c2" as identifiers for input names.
Where state0_in, state1_in, state2_in are the output state values from the previous frame (all-zero tensor for the initial frame).
inputs = tf.placeholder(tf.float32, [1, 1, 64], name="Feats") state0_in = tf.placeholder(tf.float32, [1, 3, 64], name="State_c0") state1_in = tf.placeholder(tf.float32, [1, 4, 64], name="State_c1") state2_in = tf.placeholder(tf.float32, [1, 5, 64], name="State_c2")
Therefore, the INPUT_OPS parameter in the model configuration file can be configured as follows:
config.yamlINPUT_OPS: Feats: [1, 1, 64] State_c0: [1, 3, 64] State_c1: [1, 4, 64] State_c2: [1, 5, 64]
-
Parameter Format
[output_name, ...]
output_name is the name of the output node in the model
-
In the TensorFlow framework, for ease of NPU compilation configuration, users can use the tf.identity interface to copy and rename output tensors.
In this example code, identifiers "Result", "State_c0_out", "State_c1_out", "State_c2_out" are assigned to four sets of output tensors.
Where state0_out, state1_out, state2_out are the output state values for the current frame.
outputs, states = fsmn_layer(...) result_out = tf.identity(outputs, name="Result") state1_out = tf.identity(states[0], name="State_c0_out") state2_out = tf.identity(states[1], name="State_c1_out") state3_out = tf.identity(states[2], name="State_c2_out")
Therefore, the OUTPUT_OPS parameter in the model configuration file can be configured as follows:
config.yamlOUTPUT_OPS: [State_c0_out, State_c1_out, State_c2_out, Result]
Notes
- The output state nodes must be placed before the predicted output nodes.
-
Related Overview
NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.
NPU supports the FP16_TO_FP32 format conversion feature but does not support the FP32_TO_FP16 format conversion feature.
In the processing flow of recurrent neural networks, at each time step, the network receives the input of the current time and the hidden state of the previous time step, With this information, it generates the hidden state and corresponding prediction result for the current time step.
In practical applications, the output tensor corresponding to the prediction result will be converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state will be directly used as the input for the next frame in FP16 format.
Users need to divide the data format of output nodes into FP32 and FP16 according to the specific model structure.
Users can use the FP16_OUT_OPS parameter to specify which output nodes are in FP16 format.
-
Parameter Format
[output_state_name, ...]
output_state_name is the name of the output node corresponding to the model hidden state
-
Example
Continuing from the OUTPUT_OPS Example
The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:
config.yamlFP16_OUT_OPS: [State_c0_out, State_c1_out, State_c2_out]
-
Related Overview
After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze whether the statistics of the data distribution range for each weight node are reasonable based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.
-
Parameter Format
[weight_op_name, ...]
weight_op_name is the name of the weight node for which quantization processing needs to be disabled.
-
Usage Example
In this example, assume that quantizing some convolution weights in the model leads to overall poor performance. Therefore, quantization is not applied to these weights.
The configuration file's EXCLUDE_COMPRESS_OPS parameter can be configured as follows:
config.yamlEXCLUDE_COMPRESS_OPS: [conv2d_5/Conv2D/ReadVariableOp/_74__cf__74, conv2d_6/Conv2D/ReadVariableOp/_75__cf__75]
-
Related Overview
The NPU compiler uses Post Train Quantization (PTQ) and adopts the MinMax method to quantify weights.
After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze the reasonableness of the statistics of the data distribution range for each weight node based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.
-
Parameter Format
weight_op_name: [min, max]
weight_op_name —— Name of a specific weight node in the model. min, max —— Specify the minimum and maximum values for the tensor named weight_op_name.
-
Usage Example
In this example, the quantization ranges for two convolutions are specified. The configuration file's WEIGHT_MIN_MAX parameter can be configured as follows:
config.yamlWEIGHT_MIN_MAX: conv2d_5/Conv2D/ReadVariableOp/_74__cf__74: [min0, max0] conv2d_6/Conv2D/ReadVariableOp/_75__cf__75: [min1, max1]
For specific usage scenarios and methods of using EXCLUDE_COMPRESS_OPS and WEIGHT_MIN_MAX parameters, please refer to NPU Quantization Accuracy Debugging.
2.2 PyTorch Configuration*
Notes
-
If the user needs to compile and transform a PyTorch model, the NPU compiler version must be at least 1.6.0b0 or above, and Python 3.7 must be used.
-
The original model to be converted must be in the
jit.ScriptModule
format.
Configuration Item | Parameter Values | Parameter Description |
---|---|---|
CORENAME | GRUS | Chip model |
NPU_UNIT | NPU32 | Specify NPU model |
FRAMEWORK | PT | Specify the front-end DL framework type for the model to be converted |
MODEL_FILE | Model file name and path, e.g., ./model.pb |
Specify the file name and path of the model to be converted |
OUTPUT_TYPE | c_code | Specify the format of the NPU file output by the compiler |
OUTPUT_FILE | NPU file name (e.g., npu.h ) |
Specify the name of the NPU file output by the compiler |
INPUT_OPS | input_index: shape | Specify information about all input nodes of the NPU model |
INPUT_NCX_TO_NXC | [input_index, ...] | Whether to convert the data layout format of the input tensor of the NPU model |
FP16_OUT_OPS | [out_state_index, ...] | Specify the output nodes in the NPU model output in Float16 format (FP16) |
FUSE_BN | true / false (default false) | Enable BN parameter fusion functionality |
COMPRESS | true / false (default false) | Enable quantization compression functionality for fully connected layers |
CONV2D_COMPRESS | true / false (default false) | Enable quantization compression functionality for convolutional layers |
EXCLUDE_COMPRESS_OPS | [weight_op_name, ...] | Specify weight nodes that can be excluded from quantization compression |
WEIGHT_MIN_MAX | weight_op_name: [min, max] | Specify the minimum and maximum values when quantizing compressed weight nodes |
WEIGHT_CACHE_SIZE | Allocated memory value (e.g., 10240) | Specify the size of the memory allocated in SRAM to store weights |
-
Parameter Format
input_index: shape
input_index —— Index of the input node shape —— Inference shape of the corresponding input tensor
-
Due to the dynamic computation graph used in the PyTorch framework, the names of the operator nodes are automatically generated, making it impossible to use identifiers for input configurations.
When converting PyTorch models with the NPU compiler, the INPUT_OPS parameter in the configuration file uses index values for mapping input tensors.
In this example, a custom PyTorch model is built by inheriting from the
nn.Module
base class, and the configuration of the INPUT_OPS parameter is explained:class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3) self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3) self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1)) self.flatten = nn.Flatten() self.linear1 = nn.Linear(64,32) self.relu = nn.ReLU() self.linear2 = nn.Linear(32,1) def forward(self, x, y): x = self.conv1(x) x = self.pool1(x) y = self.conv2(y) y = self.pool2(y) z = torch.concat([x,y], dim=1) z = self.adaptive_pool(z) z = self.flatten(z) z = self.linear1(z) z = self.relu(z) y = self.linear2(z) return y net = Net() input0_tensor = torch.randn([1, 3, 32, 32]) input1_tensor = torch.randn([1, 1, 32, 32]) output_tensor = net(input0_tensor, input1_tensor)
From the
forward
method, it can be seen that the model defines two sets of inputsx
andy
, where the index of inputx
is 0, and the index of inputy
is 1.Therefore, the INPUT_OPS parameter in the configuration file for this model can be configured as follows:
config.yamlINPUT_OPS: 0: [1, 3, 32, 32] 1: [1, 1, 32, 32]
-
Constraints
When converting PyTorch models, it is necessary to ensure that the input is a tensor and cannot be a list or tuple of tensors.
The NPU compiler does not support the input formats mentioned above, so users need to split tensor lists and tensor tuples into tensors for input.
In this example model, the input uses a tensor list/tensor tuple (which is different from the model in the usage example):
class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3) self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3) self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1)) self.flatten = nn.Flatten() self.linear1 = nn.Linear(64,32) self.relu = nn.ReLU() self.linear2 = nn.Linear(32,1) def forward(self, xy): x = self.conv1(xy[0]) x = self.pool1(x) y = self.conv2(xy[1]) y = self.pool2(y) z = torch.concat([x, y], dim=1) z = self.adaptive_pool(z) z = self.flatten(z) z = self.linear1(z) z = self.relu(z) y = self.linear2(z) return y net = Net() input0_tensor = torch.randn([1, 3, 32, 32]) input1_tensor = torch.randn([1, 1, 32, 32]) list_tensor = [input0_tensor, input1_tensor] tuple_tensor = (input0_tensor, input1_tensor) # Input is a tensor list output_tensor = net(list_tensor) # Input is a tensor tuple output_tensor = net(tuple_tensor)
-
Related Overview
In the field of deep learning, multi-dimensional tensors are typically used for data transmission between model operator nodes. For example, the feature maps of convolutional neural networks are usually stored in four-dimensional tensors.
The dimensions of a four-dimensional tensor can be represented as follows: N (batch), H (height), W (width), C (channels). As data is stored in memory linearly, changing the order of dimension access will result in different memory layouts. In the PyTorch framework, the NCHW order is used, while in the TensorFlow framework, the NHWC order is used. These two access orders can be referred to as the data format.
The NPU computationally intensive operator is stored in NHWC and NLC data layout formats. If the input tensor of the NPU computationally intensive operator is in the NCHW or NCL data layout format, the compiler will insert a transpose node before the operator for data format conversion.
-
Parameter Description
When the data format of the original model input tensor is NCHW or NCL, this parameter can be configured to determine whether the input tensor should maintain the original data layout format of the PyTorch model or should be converted to the data layout format already processed (NCHW -> NHWC or NCL -> NLC) during NPU model inference.
-
Parameter Format
[input_index, ...]
input_index —— Index of the input nodes that need data layout format conversion
Notes
When enabling data layout format conversion for a specific input tensor, it is necessary to adjust the shape parameter corresponding to the input in the INPUT_OPS parameter. See the usage example for details.
-
Usage Example
import torch.nn as nn class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.gru = torch.nn.GRU(128, 128, batch_first=True, bias=True) self.conv = torch.nn.Conv1d(128, 128, 1, 1) def forward(self, x, h): x = self.conv(x) x = x.permute(0, 2, 1) z = self.gru(x, h) return z model = Model() batch = 1 seq_length = 32 channel = 128 input_tensor = torch.randn([batch, channel, seq_length]) input_state = torch.randn([1, batch, channel]) output_tensor, output_state = model(input_tensor, input_state)
In the above model, during PyTorch inference, the
input_state
data has a non-NCL data layout format, so data layout format conversion is not required.The
input_tensor
data has an NCL data layout format, so it can choose whether to enable data layout format conversion before providing it to the NPU model.Below are descriptions of the parameters when the conversion is enabled and disabled:
-
Enabled parameter:
input_tensor
undergoes format conversionThe original model input tensor
input_tensor
has a shape of [1, 128, 32]. After enabling the format conversion function, the actual shape required by the NPU model is [1, 32, 128].The model configuration file's
INPUT_NCX_TO_NXC
andINPUT_OPS
parameters can be configured as follows:config.yamlINPUT_NCX_TO_NXC: [0] INPUT_OPS: 0: [1, 32, 128] 1: [1, 1, 128]
The model structure diagram is as follows:
-
Disabled parameter:
input_tensor
maintains the original data layout formatThe model configuration file's
INPUT_NCX_TO_NXC
andINPUT_OPS
parameters can be configured as follows:config.yamlINPUT_NCX_TO_NXC: [] INPUT_OPS: 0: [1, 128, 32] 1: [1, 1, 128]
The model structure diagram is as follows:
By comparing the two NPU model structure diagrams above, it can be seen that enabling the format conversion function can optimize the inserted transpose nodes on the input side.
This parameter provides more flexible configuration of the data format of input tensors based on user requirements.
-
-
Related Overview
NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.
NPU supports the FP16_TO_FP32 format conversion function but does not support the FP32_TO_FP16 format conversion function.
In the processing flow of recurrent neural networks, at each time step, the network receives input for the current time and the hidden state from the previous time step. With this information, it generates the hidden state for the current time and the corresponding prediction result.
In practical applications, the output tensor corresponding to the prediction result is converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state is directly used as input for the next frame in FP16 format.
Users need to divide the output nodes of the model into FP32 and FP16 according to the specific model structure. Users can specify which output nodes are in FP16 format using the FP16_OUT_OPS parameter.
-
Parameter Values
[out_state_index, ...]
out_state_index —— Index of the output nodes corresponding to the hidden state of the model
-
Usage Example
PyTorch model with multiple outputs:
import torch.nn as nn class TestModel(torch.nn.Module): def __init__(self): super(TestModel, self).__init__() self.gru = nn.GRU(32, 32, batch_first=True) self.conv = nn.Conv1d(32, 32, 1, 1) def forward(self, x, h): t = torch.split(x, [1, 2, 3, 4], dim=1) y = self.gru(x, h) z = torch.sigmoid(y[0]) z = z.permute(0, 2, 1) o = self.conv(z) return t, y, o net = TestModel() input_tensor = torch.randn([1, 10, 32]) state_tensor = torch.randn([1, 1, 32]) output_t, output_y, output_o = model(input_tensor, input_state)
From the
forward
method, it can be seen that the model defines three sets of outputst
,y
, ando
, where outputst
andy
are tensor lists, and outputo
is a tensor.The NPU compiler will expand tensor list outputs into tensor outputs. For the above model, the NPU compiler will consider that there are 7 outputs:
t[0]
,t[1]
,t[2]
,t[3]
,y[0]
,y[1]
,o
.The model output
y[1]
is the output node corresponding to the hidden state of the gru module, which needs to be used as the input to the gru module for the next frame of inference. This output should not undergo FP16 -> FP32 conversion.The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:
config.yamlFP16_OUT_OPS: [5]
3. Compile Model*
3.1 Model File Preparation*
The NPU compiler strictly enforces model file formats, and different frameworks need to export them in specific ways.
TensorFlow
- Prepare CKPT and PB files generated by TensorFlow or model files generated in saved_model format.
- Use the
freeze_graph.py
script provided by TensorFlow to generate a FROZEN_PB file.
PyTorch
- After training the model, export the model weight file.
- Build a PyTorch inference model script and generate a PyTorch Module instance.
-
Convert the custom PyTorch Module to Torch ScriptModule and serialize the ScriptModule instance for the compiler.
For specific steps, refer to the PyTorch Model Conversion Example
3.2 Write Configuration File*
- Write a YAML configuration file, including the model file name, output file name, output file type, compression status, input node names and dimensions, output node names, etc.
3.3 Compile and Generate Model File*
Compile the model using the following command:
$ gxnpuc config.yaml
Note
When the NPU toolchain compiles model files for different deep frameworks, it must have the corresponding framework's runtime environment installed. Refer to NPU Model Format Specification for information on the generated model file format.
4. Explanation of Some Ops*
4.1 Softmax*
NPU cannot directly support Softmax, but under certain conditions, you can modify the model to make NPU support Softmax computation.
The conditions are:
- The input tensor of softmax must be 2-dimentions and batch size equal to 1.
The Softmax function in the model needs to be replaced with the following function:
TensorFlow
def factorize(n):
for i in range(1, 16):
if n % i == 0 and n // i <= 15:
return (n // i, i)
return ()
def split_and_factorize(n):
result = []
while not factorize(n):
for i in range(n-1, 0, -1):
if factorize(i):
result.append(i)
n -= i
break
result.append(n)
return result
def npu_softmax(x, name=None):
""" NPU Softmax
Args:
x: A non-empty `Tensor`.
name: A name for the operation (optional).
Returns
A `Tensor`.
"""
# x' = x - max(x)
# y = exp(x') / sum(exp(x'))
assert len(x.shape) == 2 and x.shape[0] == 1
partitions = split_and_factorize(x.shape[1])
if len(partitions) == 1:
a, b = factorize(partitions[0])
pool_shape = [1, a, b, 1]
x = tf.reshape(x, pool_shape)
max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
else:
cnt = 0
tmp_max_list = []
for p in partitions:
a, b = factorize(p)
pool_shape = [1, a, b, 1]
tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
tmp_max_list.append(tmp_max)
cnt += p
tmp_max_len = len(tmp_max_list)
assert tmp_max_len <= 15
max_ = tf.concat(tmp_max_list, axis=2)
max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
x = tf.reshape(x, [-1, 1])
x = tf.math.subtract(x, max_)
exp_x = tf.exp(x)
exp_x = tf.reshape(exp_x, [1, -1])
return tf.math.divide(exp_x, tf.math.reduce_sum(exp_x, axis=-1), name=name)
Important
During training, use TensorFlow's softmax, and before exporting CKPT, replace it with npu_softmax.
PyTorch
def factorize(n):
for i in range(1, 16):
if n % i == 0 and n // i <= 15:
return (n // i, i)
return ()
def split_and_factorize(n):
result = []
while not factorize(n):
for i in range(n-1, 0, -1):
if factorize(i):
result.append(i)
n -= i
break
result.append(n)
return result
def npu_softmax(x, partitions):
""" NPU Softmax
Args:
x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
partitions: A list of integers specifying the partition sizes for the softmax operation.
Returns
A `Tensor` representing the real softmax output.
"""
if len(partitions) == 1:
a, b = factorize(partitions[0])
pool_shape = [1, 1, a, b]
x_ = x.reshape(pool_shape)
max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
else:
cnt = 0
tmp_max_list = []
for p in partitions:
a, b = factorize(p)
pool_shape = [1, 1, a, b]
x_ = x[:,cnt:cnt+p]
tmp_x = x_.reshape(pool_shape)
tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
tmp_max_list.append(tmp_max)
cnt += p
tmp_max_len = len(tmp_max_list)
max_ = torch.concat(tmp_max_list, axis=3)
max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))
max_ = max_.squeeze(0).squeeze(0)
x = x - max_
exp_x = torch.exp(x)
sum_x = torch.sum(exp_x, dim=-1)
return exp_x / sum_x
Important
During training, use PyTorch's softmax, and when building the inference script, replace it with npu_softmax.
4.2 LogSoftmax*
NPU cannot directly support LogSoftmax, but under certain conditions, you can modify the model to make NPU support LogSoftmax computation.
The conditions are:
- The input tensor of log_softmax must be 2-dimentions and batch size equal to 1.
The LogSoftmax function in the model needs to be replaced with the following function:
TensorFlow
def factorize(n):
for i in range(1, 16):
if n % i == 0 and n // i <= 15:
return (n // i, i)
return ()
def split_and_factorize(n):
result = []
while not factorize(n):
for i in range(n-1, 0, -1):
if factorize(i):
result.append(i)
n -= i
break
result.append(n)
return result
def npu_softmax(x, name=None):
""" NPU LogSoftmax
Args:
x: A non-empty `Tensor`.
name: A name for the operation (optional).
Returns
A `Tensor`.
"""
# x' = x - max(x)
# y = x' - log(sum(exp(x')))
assert len(x.shape) == 2 and x.shape[0] == 1
partitions = split_and_factorize(x.shape[1])
if len(partitions) == 1:
a, b = factorize(partitions[0])
pool_shape = [1, a, b, 1]
x = tf.reshape(x, pool_shape)
max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
else:
cnt = 0
tmp_max_list = []
for p in partitions:
a, b = factorize(p)
pool_shape = [1, a, b, 1]
tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
tmp_max_list.append(tmp_max)
cnt += p
tmp_max_len = len(tmp_max_list)
assert tmp_max_len <= 15
max_ = tf.concat(tmp_max_list, axis=2)
max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
x = tf.reshape(x, [-1, 1])
x = tf.math.subtract(x, max_)
x = tf.reshape(x, [1, -1])
exp_x = tf.exp(x)
exp_sum = tf.math.reduce_sum(exp_x, axis=-1)
exp_sum_log = tf.log(exp_sum)
return tf.math.subtract(x, exp_sum_log, name=name)
Important
During training, use TensorFlow's log_softmax, and before exporting CKPT, replace it with npu_log_softmax.
PyTorch
def factorize(n):
for i in range(1, 16):
if n % i == 0 and n // i <= 15:
return (n // i, i)
return ()
def split_and_factorize(n):
result = []
while not factorize(n):
for i in range(n-1, 0, -1):
if factorize(i):
result.append(i)
n -= i
break
result.append(n)
return result
def npu_log_softmax(x, partitions):
""" NPU LogSoftmax
Args:
x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
partitions: A list of integers specifying the partition sizes for the log_softmax operation.
Returns
A `Tensor` representing the real log_softmax output.
"""
if len(partitions) == 1:
a, b = factorize(partitions[0])
pool_shape = [1, 1, a, b]
x_ = x.reshape(pool_shape)
max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
else:
cnt = 0
tmp_max_list = []
for p in partitions:
a, b = factorize(p)
pool_shape = [1, 1, a, b]
x_ = x[:,cnt:cnt+p]
tmp_x = x_.reshape(pool_shape)
tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
tmp_max_list.append(tmp_max)
cnt += p
tmp_max_len = len(tmp_max_list)
max_ = torch.concat(tmp_max_list, axis=3)
max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))
max_ = max_.squeeze(0).squeeze(0)
x = x - max_
exp_x = torch.exp(x)
exp_sum = torch.sum(exp_x, dim=-1)
exp_sum_log = torch.log(exp_sum)
return x - exp_sum_log
Important
During training, use PyTorch's log_softmax, and when building the inference script, replace it with npu_log_softmax.
4.3 BatchNorm*
You can set the FUSE_BN configuration item to choose whether to merge BatchNorm parameters into the convolution.
Note
Merging BatchNorm parameters into the convolution may lead to a significant drop in accuracy if convolution weight compression is enabled.