NPU Compiler Usage*

Before using the NPU compiler gxnpuc, please carefully read the following two technical documents:

gxnpuc can convert open-source framework network models into offline model files that are compatible with Guoxin NPU processors.

1. Overview of gxnpuc Toolchain Functions and Corresponding Parameters*

1.1 General Function Parameters*

--help

Parameter Description:

Print relevant parameter information for the gxnpuc toolchain.

Usage Example:

$ gxnpuc --help

usage: gxnpuc [-h] [--cmpt] [--list] [-c {LEO,APUS,GRUS,V100,V120,V150}]
              [-f {TF,PT}] [-V] [-v] [-m] [-w] [-s] [-q]
              [config_filename]

NPU Compiler

positional arguments:
  config_filename       config file

optional arguments:
  -h, --help            show this help message and exit
  --cmpt                get version compatibility information between npu-
                        core, python, and frameworks
  --list                list supported ops
  -c {LEO,APUS,GRUS,V100,V120,V150}, --core_name {LEO,APUS,GRUS,V100,V120,V150}
                        subparameter of --list, specify NPU Core for listing
                        supported ops
  -f {TF,PT}, --framework {TF,PT}
                        subparameter of --list, specify Deep Learning
                        Framework for listing supported ops
  -V, --version         show program's version number and exit
  -v, --verbose         verbosely list the processed ops
  -m, --meminfo         verbosely list memory info of ops
  -w, --weights         print compressed weights (GRUS only)
  -s, --save_hist       save histograms of weights value to 'npu_jpgs'
                        directory (GRUS only)
  -q, --quant           inference and generate quant file

--version

Parameter Description:

Display current compiler version information.
Usage Example:
```
gxnpuc --version
```

--list

Parameter Description:

List information about the operators supported by the current NPU compiler.

Associated Sub-parameters:

Sub-parameters are only effective when the --list parameter is used and are not mandatory.

-c

Specify the chip version.

parameter values in the range: {LEO, APUS, GRUS, V100, V120, V150}

-f

Specify the front-end deep learning framework.

parameter values in the range: {TF, PT}

Usage Examples:

List all the operators supported by the current NPU compiler:
```
gxnpuc --list
```
List all the operators supported by the current NPU GRUS compiler:
```
gxnpuc --list -c GRUS
```
List all PyTorch operators supported by the current NPU GRUS compiler:
```
gxnpuc --list -c GRUS -f PT
```

--cmpt

Parameter Description:

Print information about the compatibility between the compiler's supported Python versions, chip models, front-end DL framework versions, and their compatibility.
Usage Example:
```
gxnpuc --cmpt
```

1.2 Model Compilation Parameters*

config_filename

Parameter Description:

Specify the path and filename of the compilation configuration file; the model conversion and optimization will read this configuration file.

Associated Sub-parameters (only effective when config_filename is correctly configured):

-v

Print the NPU model structure information.

-m

Print the memory status of each operator node in the NPU model.

-w

Print compressed weights in the NPU model (GRUS only).

-s

Save the histogram of model weights to the 'npu_jpgs' folder (GRUS only).

Usage Example:

Use the NPU GRUS compiler with config.yaml as the conversion configuration file and enable all sub-parameter functions:
```
gxnpuc config.yaml -v -m -w -s
```

Notes on gxnpuc Function Parameter Usage

The four groups of parameters --version/-V, --list, --cmpt, config_filename are mutually exclusive and cannot be used simultaneously.

2. Model Compilation Configuration File Explanation*

2.1 TensorFlow Configuration Items*

Configuration Item	Parameter Values	Parameter Description
CORENAME	GRUS	Chip model
NPU_UNIT	NPU32	Specify the NPU model
FRAMEWORK	TF	Specify the front-end DL framework type for the model to be converted
MODEL_FILE	Model file name and path, e.g., ./model.pb	Specify the file name and path of the model to be converted
OUTPUT_TYPE	c_code	Specify the format of the NPU file output by the compiler
OUTPUT_FILE	NPU file name, e.g., npu.h	Specify the name of the NPU file output by the compiler
INPUT_OPS	op_name: shape	Specify information about all input nodes in the NPU model
OUTPUT_OPS	[output_name, ...]	Specify information about all output nodes in the NPU model
FP16_OUT_OPS	[out_state_name, ...]	Specify the output nodes in the NPU model output in Float16 format (FP16)
FUSE_BN	true / false (default: false)	Enable or disable batch normalization (BN) parameter fusion
COMPRESS	true / false (default: false)	Enable or disable full connection weight quantization compression
CONV2D_COMPRESS	true / false (default: false)	Enable or disable convolution weight quantization compression
EXCLUDE_COMPRESS_OPS	[weight_op_name, ...]	Specify weight nodes that can be excluded from quantization compression
WEIGHT_MIN_MAX	weight_op_name: [min, max]	Specify the minimum and maximum values for weight node quantization compression
WEIGHT_CACHE_SIZE	Specific allocated memory value, e.g., 10240	Specify the size of memory allocated in SRAM to store weights

Notes

The original model file to be converted must be in FrozenPB format.

INPUT_OPS

Parameter Format

op_name: shape

op_name —— The name of the input node in the model

shape   —— The shape of the input with the node name op_name during inference

Example

In the TensorFlow framework, when building a model, placeholders are usually defined as input to the computation graph, and users need to specify specific identifiers for the placeholders as input names.

In this example code, the model defines four sets of placeholders (model inputs) and assigns "Feats", "State_c0", "State_c1", "State_c2" as identifiers for input names.

Where state0_in, state1_in, state2_in are the output state values from the previous frame (all-zero tensor for the initial frame).
```
inputs    = tf.placeholder(tf.float32, [1, 1, 64], name="Feats")
state0_in = tf.placeholder(tf.float32, [1, 3, 64], name="State_c0")
state1_in = tf.placeholder(tf.float32, [1, 4, 64], name="State_c1")
state2_in = tf.placeholder(tf.float32, [1, 5, 64], name="State_c2")
```
Therefore, the INPUT_OPS parameter in the model configuration file can be configured as follows:
config.yaml
```
INPUT_OPS:
    Feats:    [1, 1, 64]
    State_c0: [1, 3, 64]
    State_c1: [1, 4, 64]
    State_c2: [1, 5, 64]
```

OUTPUT_OPS

Parameter Format

[output_name, ...]

output_name is the name of the output node in the model

Example

In the TensorFlow framework, for ease of NPU compilation configuration, users can use the tf.identity interface to copy and rename output tensors.

In this example code, identifiers "Result", "State_c0_out", "State_c1_out", "State_c2_out" are assigned to four sets of output tensors.

Where state0_out, state1_out, state2_out are the output state values for the current frame.
```
outputs, states = fsmn_layer(...)

result_out = tf.identity(outputs,   name="Result")
state1_out = tf.identity(states[0], name="State_c0_out")
state2_out = tf.identity(states[1], name="State_c1_out")
state3_out = tf.identity(states[2], name="State_c2_out")
```
Therefore, the OUTPUT_OPS parameter in the model configuration file can be configured as follows:
config.yaml
```
OUTPUT_OPS: [State_c0_out, State_c1_out, State_c2_out, Result]
```
Notes
- The output state nodes must be placed before the predicted output nodes.

FP16_OUT_OPS

Related Overview

NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.

NPU supports the FP16_TO_FP32 format conversion feature but does not support the FP32_TO_FP16 format conversion feature.

In the processing flow of recurrent neural networks, at each time step, the network receives the input of the current time and the hidden state of the previous time step, With this information, it generates the hidden state and corresponding prediction result for the current time step.

In practical applications, the output tensor corresponding to the prediction result will be converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state will be directly used as the input for the next frame in FP16 format.

Users need to divide the data format of output nodes into FP32 and FP16 according to the specific model structure.

Users can use the FP16_OUT_OPS parameter to specify which output nodes are in FP16 format.

Parameter Format

[output_state_name, ...]

output_state_name is the name of the output node corresponding to the model hidden state

Example

Continuing from the OUTPUT_OPS Example

The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:
config.yaml
```
FP16_OUT_OPS: [State_c0_out, State_c1_out, State_c2_out]
```

EXCLUDE_COMPRESS_OPS

Related Overview

After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze whether the statistics of the data distribution range for each weight node are reasonable based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.

Parameter Format

[weight_op_name, ...]

weight_op_name is the name of the weight node for which quantization processing needs to be disabled.

Usage Example

In this example, assume that quantizing some convolution weights in the model leads to overall poor performance. Therefore, quantization is not applied to these weights.

The configuration file's EXCLUDE_COMPRESS_OPS parameter can be configured as follows:
config.yaml
```
EXCLUDE_COMPRESS_OPS: [conv2d_5/Conv2D/ReadVariableOp/_74__cf__74,
                       conv2d_6/Conv2D/ReadVariableOp/_75__cf__75]
```

WEIGHT_MIN_MAX

Related Overview

The NPU compiler uses Post Train Quantization (PTQ) and adopts the MinMax method to quantify weights.

After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze the reasonableness of the statistics of the data distribution range for each weight node based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.

Parameter Format

weight_op_name: [min, max]

weight_op_name —— Name of a specific weight node in the model.

min, max       —— Specify the minimum and maximum values for the tensor named weight_op_name.

Usage Example

In this example, the quantization ranges for two convolutions are specified. The configuration file's WEIGHT_MIN_MAX parameter can be configured as follows:
config.yaml
```
WEIGHT_MIN_MAX:
    conv2d_5/Conv2D/ReadVariableOp/_74__cf__74: [min0, max0]
    conv2d_6/Conv2D/ReadVariableOp/_75__cf__75: [min1, max1]
```

For specific usage scenarios and methods of using EXCLUDE_COMPRESS_OPS and WEIGHT_MIN_MAX parameters, please refer to NPU Quantization Accuracy Debugging.

2.2 PyTorch Configuration *

Notes

If the user needs to compile and transform a PyTorch model, the NPU compiler version must be at least 1.6.0b0 or above, and Python 3.7 must be used.
The original model to be converted must be in the jit.ScriptModule format.

Configuration Item	Parameter Values	Parameter Description
CORENAME	GRUS	Chip model
NPU_UNIT	NPU32	Specify NPU model
FRAMEWORK	PT	Specify the front-end DL framework type for the model to be converted
MODEL_FILE	Model file name and path, e.g., `./model.pb`	Specify the file name and path of the model to be converted
OUTPUT_TYPE	c_code	Specify the format of the NPU file output by the compiler
OUTPUT_FILE	NPU file name (e.g., `npu.h`)	Specify the name of the NPU file output by the compiler
INPUT_OPS	input_index: shape	Specify information about all input nodes of the NPU model
INPUT_NCX_TO_NXC	[input_index, ...]	Whether to convert the data layout format of the input tensor of the NPU model
FP16_OUT_OPS	[out_state_index, ...]	Specify the output nodes in the NPU model output in Float16 format (FP16)
FUSE_BN	true / false (default false)	Enable BN parameter fusion functionality
COMPRESS	true / false (default false)	Enable quantization compression functionality for fully connected layers
CONV2D_COMPRESS	true / false (default false)	Enable quantization compression functionality for convolutional layers
EXCLUDE_COMPRESS_OPS	[weight_op_name, ...]	Specify weight nodes that can be excluded from quantization compression
WEIGHT_MIN_MAX	weight_op_name: [min, max]	Specify the minimum and maximum values when quantizing compressed weight nodes
WEIGHT_CACHE_SIZE	Allocated memory value (e.g., 10240)	Specify the size of the memory allocated in SRAM to store weights

INPUT_OPS

Parameter Format

input_index: shape

input_index —— Index of the input node

shape       —— Inference shape of the corresponding input tensor

Usage Example

Due to the dynamic computation graph used in the PyTorch framework, the names of the operator nodes are automatically generated, making it impossible to use identifiers for input configurations.

When converting PyTorch models with the NPU compiler, the INPUT_OPS parameter in the configuration file uses index values for mapping input tensors.

In this example, a custom PyTorch model is built by inheriting from the nn.Module base class, and the configuration of the INPUT_OPS parameter is explained:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1))
        self.flatten = nn.Flatten()
        self.linear1 = nn.Linear(64,32)
        self.relu    = nn.ReLU()
        self.linear2 = nn.Linear(32,1)

    def forward(self, x, y):
        x = self.conv1(x)
        x = self.pool1(x)
        y = self.conv2(y)
        y = self.pool2(y)
        z = torch.concat([x,y], dim=1)
        z = self.adaptive_pool(z)
        z = self.flatten(z)
        z = self.linear1(z)
        z = self.relu(z)
        y = self.linear2(z)
        return y

net           = Net()
input0_tensor = torch.randn([1, 3, 32, 32])
input1_tensor = torch.randn([1, 1, 32, 32])

output_tensor = net(input0_tensor, input1_tensor)

From the forward method, it can be seen that the model defines two sets of inputs x and y, where the index of input x is 0, and the index of input y is 1.

Therefore, the INPUT_OPS parameter in the configuration file for this model can be configured as follows:

config.yaml

INPUT_OPS:
    0: [1, 3, 32, 32]
    1: [1, 1, 32, 32]

Constraints

When converting PyTorch models, it is necessary to ensure that the input is a tensor and cannot be a list or tuple of tensors.

The NPU compiler does not support the input formats mentioned above, so users need to split tensor lists and tensor tuples into tensors for input.

In this example model, the input uses a tensor list/tensor tuple (which is different from the model in the usage example):

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1))
        self.flatten = nn.Flatten()
        self.linear1 = nn.Linear(64,32)
        self.relu    = nn.ReLU()
        self.linear2 = nn.Linear(32,1)

    def forward(self, xy):
        x = self.conv1(xy[0])
        x = self.pool1(x)
        y = self.conv2(xy[1])
        y = self.pool2(y)
        z = torch.concat([x, y], dim=1)
        z = self.adaptive_pool(z)
        z = self.flatten(z)
        z = self.linear1(z)
        z = self.relu(z)
        y = self.linear2(z)
        return y

net           = Net()
input0_tensor = torch.randn([1, 3, 32, 32])
input1_tensor = torch.randn([1, 1, 32, 32])
list_tensor   = [input0_tensor, input1_tensor]
tuple_tensor  = (input0_tensor, input1_tensor)

# Input is a tensor list
output_tensor = net(list_tensor)

# Input is a tensor tuple
output_tensor = net(tuple_tensor)

INPUT_NCX_TO_NXC

Related Overview

In the field of deep learning, multi-dimensional tensors are typically used for data transmission between model operator nodes. For example, the feature maps of convolutional neural networks are usually stored in four-dimensional tensors.

The dimensions of a four-dimensional tensor can be represented as follows: N (batch), H (height), W (width), C (channels). As data is stored in memory linearly, changing the order of dimension access will result in different memory layouts. In the PyTorch framework, the NCHW order is used, while in the TensorFlow framework, the NHWC order is used. These two access orders can be referred to as the data format.

The NPU computationally intensive operator is stored in NHWC and NLC data layout formats. If the input tensor of the NPU computationally intensive operator is in the NCHW or NCL data layout format, the compiler will insert a transpose node before the operator for data format conversion.
Parameter Description

When the data format of the original model input tensor is NCHW or NCL, this parameter can be configured to determine whether the input tensor should maintain the original data layout format of the PyTorch model or should be converted to the data layout format already processed (NCHW -> NHWC or NCL -> NLC) during NPU model inference.
Parameter Format

[input_index, ...]
```
input_index —— Index of the input nodes that need data layout format conversion
```
Notes

When enabling data layout format conversion for a specific input tensor, it is necessary to adjust the shape parameter corresponding to the input in the INPUT_OPS parameter. See the usage example for details.
Usage Example
```
import torch.nn as nn

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.gru  = torch.nn.GRU(128, 128, batch_first=True, bias=True)
        self.conv = torch.nn.Conv1d(128, 128, 1, 1)

    def forward(self, x, h):
        x = self.conv(x)
        x = x.permute(0, 2, 1)
        z = self.gru(x, h)

        return z

model        = Model()
batch        = 1
seq_length   = 32
channel      = 128
input_tensor = torch.randn([batch, channel, seq_length])
input_state  = torch.randn([1, batch, channel])

output_tensor, output_state = model(input_tensor, input_state)
```
In the above model, during PyTorch inference, the input_state data has a non-NCL data layout format, so data layout format conversion is not required.

The input_tensor data has an NCL data layout format, so it can choose whether to enable data layout format conversion before providing it to the NPU model.

Below are descriptions of the parameters when the conversion is enabled and disabled:
- Enabled parameter: input_tensor undergoes format conversion
  
  The original model input tensor input_tensor has a shape of [1, 128, 32]. After enabling the format conversion function, the actual shape required by the NPU model is [1, 32, 128].
  
  The model configuration file's INPUT_NCX_TO_NXC and INPUT_OPS parameters can be configured as follows:
  config.yaml
```
INPUT_NCX_TO_NXC: [0]

INPUT_OPS:
    0: [1, 32, 128]
    1: [1, 1, 128]
```
  The model structure diagram is as follows:
- Disabled parameter: input_tensor maintains the original data layout format
  
  The model configuration file's INPUT_NCX_TO_NXC and INPUT_OPS parameters can be configured as follows:
  config.yaml
```
INPUT_NCX_TO_NXC: []

INPUT_OPS:
    0: [1, 128, 32]
    1: [1, 1, 128]
```
  The model structure diagram is as follows:
By comparing the two NPU model structure diagrams above, it can be seen that enabling the format conversion function can optimize the inserted transpose nodes on the input side.

This parameter provides more flexible configuration of the data format of input tensors based on user requirements.

FP16_OUT_OPS

Related Overview

NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.

NPU supports the FP16_TO_FP32 format conversion function but does not support the FP32_TO_FP16 format conversion function.

In the processing flow of recurrent neural networks, at each time step, the network receives input for the current time and the hidden state from the previous time step. With this information, it generates the hidden state for the current time and the corresponding prediction result.

In practical applications, the output tensor corresponding to the prediction result is converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state is directly used as input for the next frame in FP16 format.

Users need to divide the output nodes of the model into FP32 and FP16 according to the specific model structure. Users can specify which output nodes are in FP16 format using the FP16_OUT_OPS parameter.

Parameter Values

[out_state_index, ...]

out_state_index —— Index of the output nodes corresponding to the hidden state of the model

Usage Example

PyTorch model with multiple outputs:

import torch.nn as nn

class TestModel(torch.nn.Module):
    def __init__(self):
        super(TestModel, self).__init__()
        self.gru  = nn.GRU(32, 32, batch_first=True)
        self.conv = nn.Conv1d(32, 32, 1, 1)

    def forward(self, x, h):
        t = torch.split(x, [1, 2, 3, 4], dim=1)
        y = self.gru(x, h)
        z = torch.sigmoid(y[0])
        z = z.permute(0, 2, 1)
        o = self.conv(z)

        return t, y, o

net          = TestModel()
input_tensor = torch.randn([1, 10, 32])
state_tensor = torch.randn([1, 1, 32])

output_t, output_y, output_o = model(input_tensor, input_state)

From the forward method, it can be seen that the model defines three sets of outputs t, y, and o, where outputs t and y are tensor lists, and output o is a tensor.

The NPU compiler will expand tensor list outputs into tensor outputs. For the above model, the NPU compiler will consider that there are 7 outputs: t[0], t[1], t[2], t[3], y[0], y[1], o.

The model output y[1] is the output node corresponding to the hidden state of the gru module, which needs to be used as the input to the gru module for the next frame of inference. This output should not undergo FP16 -> FP32 conversion.

The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:

config.yaml

    FP16_OUT_OPS: [5]

3. Compile Model*

3.1 Model File Preparation*

The NPU compiler strictly enforces model file formats, and different frameworks need to export them in specific ways.

TensorFlow

Prepare CKPT and PB files generated by TensorFlow or model files generated in saved_model format.
Use the freeze_graph.py script provided by TensorFlow to generate a FROZEN_PB file.

PyTorch

After training the model, export the model weight file.
Build a PyTorch inference model script and generate a PyTorch Module instance.
Convert the custom PyTorch Module to Torch ScriptModule and serialize the ScriptModule instance for the compiler.

For specific steps, refer to the PyTorch Model Conversion Example

3.2 Write Configuration File*

Write a YAML configuration file, including the model file name, output file name, output file type, compression status, input node names and dimensions, output node names, etc.

3.3 Compile and Generate Model File*

Compile the model using the following command:

$ gxnpuc config.yaml

Note

When the NPU toolchain compiles model files for different deep frameworks, it must have the corresponding framework's runtime environment installed. Refer to NPU Model Format Specification for information on the generated model file format.

4. Explanation of Some Ops*

4.1 Softmax *

NPU cannot directly support Softmax, but under certain conditions, you can modify the model to make NPU support Softmax computation.

The conditions are:

The input tensor of softmax must be 2-dimentions and batch size equal to 1.

The Softmax function in the model needs to be replaced with the following function:

TensorFlow

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, name=None):
    """ NPU Softmax
    Args:
      x: A non-empty `Tensor`.
      name: A name for the operation (optional).
    Returns
      A `Tensor`.
    """ 
    # x' = x - max(x)
    # y = exp(x') / sum(exp(x'))
    assert len(x.shape) == 2 and x.shape[0] == 1
    partitions = split_and_factorize(x.shape[1])
    if len(partitions) == 1:
        a, b = factorize(partitions[0])
        pool_shape = [1, a, b, 1]
        x = tf.reshape(x, pool_shape)
        max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
    else:
        cnt = 0
        tmp_max_list = []
        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, a, b, 1]
            tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
            tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
            tmp_max_list.append(tmp_max)
            cnt += p
        tmp_max_len = len(tmp_max_list)
        assert tmp_max_len <= 15
        max_ = tf.concat(tmp_max_list, axis=2)
        max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
    x = tf.reshape(x, [-1, 1])
    x = tf.math.subtract(x, max_)
    exp_x = tf.exp(x)
    exp_x = tf.reshape(exp_x, [1, -1])
    return tf.math.divide(exp_x, tf.math.reduce_sum(exp_x, axis=-1), name=name)

Important

During training, use TensorFlow's softmax, and before exporting CKPT, replace it with npu_softmax.

PyTorch

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, partitions):
    """ NPU Softmax
    Args:
      x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
      partitions: A list of integers specifying the partition sizes for the softmax operation.
    Returns
      A `Tensor` representing the real softmax output.
    """ 

    if len(partitions) == 1:
        a, b       = factorize(partitions[0])
        pool_shape = [1, 1, a, b]
        x_   = x.reshape(pool_shape)
        max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
    else:
        cnt = 0 
        tmp_max_list = []

        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, 1, a, b]

            x_      = x[:,cnt:cnt+p]
            tmp_x   = x_.reshape(pool_shape)
            tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
            tmp_max_list.append(tmp_max)

            cnt += p

        tmp_max_len = len(tmp_max_list)

        max_ = torch.concat(tmp_max_list, axis=3)
        max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))

    max_  = max_.squeeze(0).squeeze(0)
    x     = x - max_
    exp_x = torch.exp(x)
    sum_x = torch.sum(exp_x, dim=-1)

    return exp_x / sum_x

Important

During training, use PyTorch's softmax, and when building the inference script, replace it with npu_softmax.

4.2 LogSoftmax*

NPU cannot directly support LogSoftmax, but under certain conditions, you can modify the model to make NPU support LogSoftmax computation.

The conditions are:

The input tensor of log_softmax must be 2-dimentions and batch size equal to 1.

The LogSoftmax function in the model needs to be replaced with the following function:

TensorFlow

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, name=None):
    """ NPU LogSoftmax
    Args:
      x: A non-empty `Tensor`.
      name: A name for the operation (optional).
    Returns
      A `Tensor`.
    """
    # x' = x - max(x)
    # y  = x' - log(sum(exp(x')))
    assert len(x.shape) == 2 and x.shape[0] == 1
    partitions = split_and_factorize(x.shape[1])
    if len(partitions) == 1:
        a, b = factorize(partitions[0])
        pool_shape = [1, a, b, 1]
        x = tf.reshape(x, pool_shape)
        max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
    else:
        cnt = 0
        tmp_max_list = []
        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, a, b, 1]
            tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
            tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
            tmp_max_list.append(tmp_max)
            cnt += p
        tmp_max_len = len(tmp_max_list)
        assert tmp_max_len <= 15
        max_ = tf.concat(tmp_max_list, axis=2)
        max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
    x = tf.reshape(x, [-1, 1])
    x = tf.math.subtract(x, max_)

    x = tf.reshape(x, [1, -1])

    exp_x       = tf.exp(x)
    exp_sum     = tf.math.reduce_sum(exp_x, axis=-1)
    exp_sum_log = tf.log(exp_sum)

    return tf.math.subtract(x, exp_sum_log, name=name)

Important

During training, use TensorFlow's log_softmax, and before exporting CKPT, replace it with npu_log_softmax.

PyTorch

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_log_softmax(x, partitions):
    """ NPU LogSoftmax
    Args:
      x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
      partitions: A list of integers specifying the partition sizes for the log_softmax operation.
    Returns
      A `Tensor` representing the real log_softmax output.
    """

    if len(partitions) == 1:
        a, b       = factorize(partitions[0])
        pool_shape = [1, 1, a, b]
        x_   = x.reshape(pool_shape)
        max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
    else:
        cnt = 0
        tmp_max_list = []

        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, 1, a, b]

            x_      = x[:,cnt:cnt+p]
            tmp_x   = x_.reshape(pool_shape)
            tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
            tmp_max_list.append(tmp_max)

            cnt += p

        tmp_max_len = len(tmp_max_list)

        max_ = torch.concat(tmp_max_list, axis=3)
        max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))

    max_  = max_.squeeze(0).squeeze(0)
    x     = x - max_

    exp_x       = torch.exp(x)
    exp_sum     = torch.sum(exp_x, dim=-1)
    exp_sum_log = torch.log(exp_sum)

    return x - exp_sum_log

Important

During training, use PyTorch's log_softmax, and when building the inference script, replace it with npu_log_softmax.

4.3 BatchNorm*

You can set the FUSE_BN configuration item to choose whether to merge BatchNorm parameters into the convolution.

Note

Merging BatchNorm parameters into the convolution may lead to a significant drop in accuracy if convolution weight compression is enabled.

NPU Compiler Usage*

1. Overview of gxnpuc Toolchain Functions and Corresponding Parameters*

1.1 General Function Parameters*

1.2 Model Compilation Parameters*

2. Model Compilation Configuration File Explanation*

2.1 TensorFlow Configuration Items*

2.2 PyTorch Configuration*

3. Compile Model*

3.1 Model File Preparation*

3.2 Write Configuration File*

3.3 Compile and Generate Model File*

4. Explanation of Some Ops*

4.1 Softmax*

4.2 LogSoftmax*

4.3 BatchNorm*

2.2 PyTorch Configuration *

4.1 Softmax *