Skip to content

NPU Compiler Usage*

Before using the NPU compiler gxnpuc, please carefully read the following two technical documents:

gxnpuc can convert open-source framework network models into offline model files that are compatible with Guoxin NPU processors.

1. Overview of gxnpuc Toolchain Functions and Corresponding Parameters*

1.1 General Function Parameters*

--help

  • Parameter Description:

    Print relevant parameter information for the gxnpuc toolchain.

  • Usage Example:

    $ gxnpuc --help
    
    usage: gxnpuc [-h] [--cmpt] [--list] [-c {LEO,APUS,GRUS,V100,V120,V150}]
                  [-f {TF,PT}] [-V] [-v] [-m] [-w] [-s] [-q]
                  [config_filename]
    
    NPU Compiler
    
    positional arguments:
      config_filename       config file
    
    optional arguments:
      -h, --help            show this help message and exit
      --cmpt                get version compatibility information between npu-
                            core, python, and frameworks
      --list                list supported ops
      -c {LEO,APUS,GRUS,V100,V120,V150}, --core_name {LEO,APUS,GRUS,V100,V120,V150}
                            subparameter of --list, specify NPU Core for listing
                            supported ops
      -f {TF,PT}, --framework {TF,PT}
                            subparameter of --list, specify Deep Learning
                            Framework for listing supported ops
      -V, --version         show program's version number and exit
      -v, --verbose         verbosely list the processed ops
      -m, --meminfo         verbosely list memory info of ops
      -w, --weights         print compressed weights (GRUS only)
      -s, --save_hist       save histograms of weights value to 'npu_jpgs'
                            directory (GRUS only)
      -q, --quant           inference and generate quant file
    

--version

  • Parameter Description:

    Display current compiler version information.

  • Usage Example:

    gxnpuc --version
    

--list

  • Parameter Description:

    List information about the operators supported by the current NPU compiler.

  • Associated Sub-parameters:

    Sub-parameters are only effective when the --list parameter is used and are not mandatory.

    -c

    Specify the chip version.
    
    parameter values in the range: {LEO, APUS, GRUS, V100, V120, V150}
    

    -f

    Specify the front-end deep learning framework.
    
    parameter values in the range: {TF, PT}
    
  • Usage Examples:

    List all the operators supported by the current NPU compiler:

    gxnpuc --list
    

    List all the operators supported by the current NPU GRUS compiler:

    gxnpuc --list -c GRUS
    

    List all PyTorch operators supported by the current NPU GRUS compiler:

    gxnpuc --list -c GRUS -f PT
    

--cmpt

  • Parameter Description:

    Print information about the compatibility between the compiler's supported Python versions, chip models, front-end DL framework versions, and their compatibility.

  • Usage Example:

    gxnpuc --cmpt
    

1.2 Model Compilation Parameters*

config_filename

  • Parameter Description:

    Specify the path and filename of the compilation configuration file; the model conversion and optimization will read this configuration file.

  • Associated Sub-parameters (only effective when config_filename is correctly configured):

    -v

    Print the NPU model structure information.
    

    -m

    Print the memory status of each operator node in the NPU model.
    

    -w

    Print compressed weights in the NPU model (GRUS only).
    

    -s

    Save the histogram of model weights to the 'npu_jpgs' folder (GRUS only).
    
  • Usage Example:

    Use the NPU GRUS compiler with config.yaml as the conversion configuration file and enable all sub-parameter functions:

    gxnpuc config.yaml -v -m -w -s
    

Notes on gxnpuc Function Parameter Usage

The four groups of parameters --version/-V, --list, --cmpt, config_filename are mutually exclusive and cannot be used simultaneously.

2. Model Compilation Configuration File Explanation*

2.1 TensorFlow Configuration Items*

Configuration Item Parameter Values Parameter Description
CORENAME GRUS Chip model
NPU_UNIT NPU32 Specify the NPU model
FRAMEWORK TF Specify the front-end DL framework type for the model to be converted
MODEL_FILE Model file name and path, e.g., ./model.pb Specify the file name and path of the model to be converted
OUTPUT_TYPE c_code Specify the format of the NPU file output by the compiler
OUTPUT_FILE NPU file name, e.g., npu.h Specify the name of the NPU file output by the compiler
INPUT_OPS op_name: shape Specify information about all input nodes in the NPU model
OUTPUT_OPS [output_name, ...] Specify information about all output nodes in the NPU model
FP16_OUT_OPS [out_state_name, ...] Specify the output nodes in the NPU model output in Float16 format (FP16)
FUSE_BN true / false (default: false) Enable or disable batch normalization (BN) parameter fusion
COMPRESS true / false (default: false) Enable or disable full connection weight quantization compression
CONV2D_COMPRESS true / false (default: false) Enable or disable convolution weight quantization compression
EXCLUDE_COMPRESS_OPS [weight_op_name, ...] Specify weight nodes that can be excluded from quantization compression
WEIGHT_MIN_MAX weight_op_name: [min, max] Specify the minimum and maximum values for weight node quantization compression
WEIGHT_CACHE_SIZE Specific allocated memory value, e.g., 10240 Specify the size of memory allocated in SRAM to store weights

Notes

  • The original model file to be converted must be in FrozenPB format.

INPUT_OPS

  • Parameter Format

    op_name: shape

    op_name —— The name of the input node in the model
    
    shape   —— The shape of the input with the node name op_name during inference
    
  • Example

    In the TensorFlow framework, when building a model, placeholders are usually defined as input to the computation graph, and users need to specify specific identifiers for the placeholders as input names.

    In this example code, the model defines four sets of placeholders (model inputs) and assigns "Feats", "State_c0", "State_c1", "State_c2" as identifiers for input names.

    Where state0_in, state1_in, state2_in are the output state values from the previous frame (all-zero tensor for the initial frame).

    inputs    = tf.placeholder(tf.float32, [1, 1, 64], name="Feats")
    state0_in = tf.placeholder(tf.float32, [1, 3, 64], name="State_c0")
    state1_in = tf.placeholder(tf.float32, [1, 4, 64], name="State_c1")
    state2_in = tf.placeholder(tf.float32, [1, 5, 64], name="State_c2")
    

    Therefore, the INPUT_OPS parameter in the model configuration file can be configured as follows:

    config.yaml
    INPUT_OPS:
        Feats:    [1, 1, 64]
        State_c0: [1, 3, 64]
        State_c1: [1, 4, 64]
        State_c2: [1, 5, 64]
    

OUTPUT_OPS

  • Parameter Format

    [output_name, ...]

    output_name is the name of the output node in the model
    
  • Example

    In the TensorFlow framework, for ease of NPU compilation configuration, users can use the tf.identity interface to copy and rename output tensors.

    In this example code, identifiers "Result", "State_c0_out", "State_c1_out", "State_c2_out" are assigned to four sets of output tensors.

    Where state0_out, state1_out, state2_out are the output state values for the current frame.

    outputs, states = fsmn_layer(...)
    
    result_out = tf.identity(outputs,   name="Result")
    state1_out = tf.identity(states[0], name="State_c0_out")
    state2_out = tf.identity(states[1], name="State_c1_out")
    state3_out = tf.identity(states[2], name="State_c2_out")
    

    Therefore, the OUTPUT_OPS parameter in the model configuration file can be configured as follows:

    config.yaml
    OUTPUT_OPS: [State_c0_out, State_c1_out, State_c2_out, Result]
    

    Notes

    • The output state nodes must be placed before the predicted output nodes.

FP16_OUT_OPS

  • Related Overview

    NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.

    NPU supports the FP16_TO_FP32 format conversion feature but does not support the FP32_TO_FP16 format conversion feature.

    In the processing flow of recurrent neural networks, at each time step, the network receives the input of the current time and the hidden state of the previous time step, With this information, it generates the hidden state and corresponding prediction result for the current time step.

    In practical applications, the output tensor corresponding to the prediction result will be converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state will be directly used as the input for the next frame in FP16 format.

    Users need to divide the data format of output nodes into FP32 and FP16 according to the specific model structure.

    Users can use the FP16_OUT_OPS parameter to specify which output nodes are in FP16 format.

  • Parameter Format

    [output_state_name, ...]

    output_state_name is the name of the output node corresponding to the model hidden state
    
  • Example

    Continuing from the OUTPUT_OPS Example

    The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:

    config.yaml
    FP16_OUT_OPS: [State_c0_out, State_c1_out, State_c2_out]
    

EXCLUDE_COMPRESS_OPS

  • Related Overview

    After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze whether the statistics of the data distribution range for each weight node are reasonable based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.

  • Parameter Format

    [weight_op_name, ...]

    weight_op_name is the name of the weight node for which quantization processing needs to be disabled.
    

  • Usage Example

    In this example, assume that quantizing some convolution weights in the model leads to overall poor performance. Therefore, quantization is not applied to these weights.

    The configuration file's EXCLUDE_COMPRESS_OPS parameter can be configured as follows:

    config.yaml
    EXCLUDE_COMPRESS_OPS: [conv2d_5/Conv2D/ReadVariableOp/_74__cf__74,
                           conv2d_6/Conv2D/ReadVariableOp/_75__cf__75]
    

WEIGHT_MIN_MAX

  • Related Overview

    The NPU compiler uses Post Train Quantization (PTQ) and adopts the MinMax method to quantify weights.

    After enabling the NPU compiler's weight quantization feature, if the model's inference performance is poor, users can analyze the reasonableness of the statistics of the data distribution range for each weight node based on the weight distribution histogram. This parameter allows users to directly configure the data distribution range for specified weight nodes.

  • Parameter Format

    weight_op_name: [min, max]

    weight_op_name —— Name of a specific weight node in the model.
    
    min, max       —— Specify the minimum and maximum values for the tensor named weight_op_name.
    

  • Usage Example

    In this example, the quantization ranges for two convolutions are specified. The configuration file's WEIGHT_MIN_MAX parameter can be configured as follows:

    config.yaml
    WEIGHT_MIN_MAX:
        conv2d_5/Conv2D/ReadVariableOp/_74__cf__74: [min0, max0]
        conv2d_6/Conv2D/ReadVariableOp/_75__cf__75: [min1, max1]
    

For specific usage scenarios and methods of using EXCLUDE_COMPRESS_OPS and WEIGHT_MIN_MAX parameters, please refer to NPU Quantization Accuracy Debugging.

2.2 PyTorch Configuration*

Notes

  • If the user needs to compile and transform a PyTorch model, the NPU compiler version must be at least 1.6.0b0 or above, and Python 3.7 must be used.

  • The original model to be converted must be in the jit.ScriptModule format.

Configuration Item Parameter Values Parameter Description
CORENAME GRUS Chip model
NPU_UNIT NPU32 Specify NPU model
FRAMEWORK PT Specify the front-end DL framework type for the model to be converted
MODEL_FILE Model file name and path, e.g., ./model.pb Specify the file name and path of the model to be converted
OUTPUT_TYPE c_code Specify the format of the NPU file output by the compiler
OUTPUT_FILE NPU file name (e.g., npu.h) Specify the name of the NPU file output by the compiler
INPUT_OPS input_index: shape Specify information about all input nodes of the NPU model
INPUT_NCX_TO_NXC [input_index, ...] Whether to convert the data layout format of the input tensor of the NPU model
FP16_OUT_OPS [out_state_index, ...] Specify the output nodes in the NPU model output in Float16 format (FP16)
FUSE_BN true / false (default false) Enable BN parameter fusion functionality
COMPRESS true / false (default false) Enable quantization compression functionality for fully connected layers
CONV2D_COMPRESS true / false (default false) Enable quantization compression functionality for convolutional layers
EXCLUDE_COMPRESS_OPS [weight_op_name, ...] Specify weight nodes that can be excluded from quantization compression
WEIGHT_MIN_MAX weight_op_name: [min, max] Specify the minimum and maximum values when quantizing compressed weight nodes
WEIGHT_CACHE_SIZE Allocated memory value (e.g., 10240) Specify the size of the memory allocated in SRAM to store weights

INPUT_OPS

  • Parameter Format

    input_index: shape

    input_index —— Index of the input node
    
    shape       —— Inference shape of the corresponding input tensor
    
  • Usage Example

    Due to the dynamic computation graph used in the PyTorch framework, the names of the operator nodes are automatically generated, making it impossible to use identifiers for input configurations.

    When converting PyTorch models with the NPU compiler, the INPUT_OPS parameter in the configuration file uses index values for mapping input tensors.

    In this example, a custom PyTorch model is built by inheriting from the nn.Module base class, and the configuration of the INPUT_OPS parameter is explained:

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
            self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
            self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1))
            self.flatten = nn.Flatten()
            self.linear1 = nn.Linear(64,32)
            self.relu    = nn.ReLU()
            self.linear2 = nn.Linear(32,1)
    
        def forward(self, x, y):
            x = self.conv1(x)
            x = self.pool1(x)
            y = self.conv2(y)
            y = self.pool2(y)
            z = torch.concat([x,y], dim=1)
            z = self.adaptive_pool(z)
            z = self.flatten(z)
            z = self.linear1(z)
            z = self.relu(z)
            y = self.linear2(z)
            return y
    
    net           = Net()
    input0_tensor = torch.randn([1, 3, 32, 32])
    input1_tensor = torch.randn([1, 1, 32, 32])
    
    output_tensor = net(input0_tensor, input1_tensor)
    

    From the forward method, it can be seen that the model defines two sets of inputs x and y, where the index of input x is 0, and the index of input y is 1.

    Therefore, the INPUT_OPS parameter in the configuration file for this model can be configured as follows:

    config.yaml
    INPUT_OPS:
        0: [1, 3, 32, 32]
        1: [1, 1, 32, 32]
    
  • Constraints

    When converting PyTorch models, it is necessary to ensure that the input is a tensor and cannot be a list or tuple of tensors.

    The NPU compiler does not support the input formats mentioned above, so users need to split tensor lists and tensor tuples into tensors for input.

    In this example model, the input uses a tensor list/tensor tuple (which is different from the model in the usage example):

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
            self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.conv2 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3)
            self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
            self.adaptive_pool = nn.AdaptiveMaxPool2d((1,1))
            self.flatten = nn.Flatten()
            self.linear1 = nn.Linear(64,32)
            self.relu    = nn.ReLU()
            self.linear2 = nn.Linear(32,1)
    
        def forward(self, xy):
            x = self.conv1(xy[0])
            x = self.pool1(x)
            y = self.conv2(xy[1])
            y = self.pool2(y)
            z = torch.concat([x, y], dim=1)
            z = self.adaptive_pool(z)
            z = self.flatten(z)
            z = self.linear1(z)
            z = self.relu(z)
            y = self.linear2(z)
            return y
    
    net           = Net()
    input0_tensor = torch.randn([1, 3, 32, 32])
    input1_tensor = torch.randn([1, 1, 32, 32])
    list_tensor   = [input0_tensor, input1_tensor]
    tuple_tensor  = (input0_tensor, input1_tensor)
    
    # Input is a tensor list
    output_tensor = net(list_tensor)
    
    # Input is a tensor tuple
    output_tensor = net(tuple_tensor)
    

INPUT_NCX_TO_NXC

  • Related Overview

    In the field of deep learning, multi-dimensional tensors are typically used for data transmission between model operator nodes. For example, the feature maps of convolutional neural networks are usually stored in four-dimensional tensors.

    The dimensions of a four-dimensional tensor can be represented as follows: N (batch), H (height), W (width), C (channels). As data is stored in memory linearly, changing the order of dimension access will result in different memory layouts. In the PyTorch framework, the NCHW order is used, while in the TensorFlow framework, the NHWC order is used. These two access orders can be referred to as the data format.

    The NPU computationally intensive operator is stored in NHWC and NLC data layout formats. If the input tensor of the NPU computationally intensive operator is in the NCHW or NCL data layout format, the compiler will insert a transpose node before the operator for data format conversion.

  • Parameter Description

    When the data format of the original model input tensor is NCHW or NCL, this parameter can be configured to determine whether the input tensor should maintain the original data layout format of the PyTorch model or should be converted to the data layout format already processed (NCHW -> NHWC or NCL -> NLC) during NPU model inference.

  • Parameter Format

    [input_index, ...]

    input_index —— Index of the input nodes that need data layout format conversion
    

    Notes

    When enabling data layout format conversion for a specific input tensor, it is necessary to adjust the shape parameter corresponding to the input in the INPUT_OPS parameter. See the usage example for details.

  • Usage Example

    import torch.nn as nn
    
    class Model(torch.nn.Module):
        def __init__(self):
            super(Model, self).__init__()
            self.gru  = torch.nn.GRU(128, 128, batch_first=True, bias=True)
            self.conv = torch.nn.Conv1d(128, 128, 1, 1)
    
        def forward(self, x, h):
            x = self.conv(x)
            x = x.permute(0, 2, 1)
            z = self.gru(x, h)
    
            return z
    
    model        = Model()
    batch        = 1
    seq_length   = 32
    channel      = 128
    input_tensor = torch.randn([batch, channel, seq_length])
    input_state  = torch.randn([1, batch, channel])
    
    output_tensor, output_state = model(input_tensor, input_state)
    

    In the above model, during PyTorch inference, the input_state data has a non-NCL data layout format, so data layout format conversion is not required.

    The input_tensor data has an NCL data layout format, so it can choose whether to enable data layout format conversion before providing it to the NPU model.

    Below are descriptions of the parameters when the conversion is enabled and disabled:

    • Enabled parameter: input_tensor undergoes format conversion

      The original model input tensor input_tensor has a shape of [1, 128, 32]. After enabling the format conversion function, the actual shape required by the NPU model is [1, 32, 128].

      The model configuration file's INPUT_NCX_TO_NXC and INPUT_OPS parameters can be configured as follows:

      config.yaml
      INPUT_NCX_TO_NXC: [0]
      
      INPUT_OPS:
          0: [1, 32, 128]
          1: [1, 1, 128]
      

      The model structure diagram is as follows:

    • Disabled parameter: input_tensor maintains the original data layout format

      The model configuration file's INPUT_NCX_TO_NXC and INPUT_OPS parameters can be configured as follows:

      config.yaml
      INPUT_NCX_TO_NXC: []
      
      INPUT_OPS:
          0: [1, 128, 32]
          1: [1, 1, 128]
      

      The model structure diagram is as follows:

      Disabled INPUT_NCX_TO_NXC

    By comparing the two NPU model structure diagrams above, it can be seen that enabling the format conversion function can optimize the inserted transpose nodes on the input side.

    This parameter provides more flexible configuration of the data format of input tensors based on user requirements.

FP16_OUT_OPS

  • Related Overview

    NPU performs internal computations using data in the FP16 format, and both input and output tensors are in FP16 format.

    NPU supports the FP16_TO_FP32 format conversion function but does not support the FP32_TO_FP16 format conversion function.

    In the processing flow of recurrent neural networks, at each time step, the network receives input for the current time and the hidden state from the previous time step. With this information, it generates the hidden state for the current time and the corresponding prediction result.

    In practical applications, the output tensor corresponding to the prediction result is converted to FP32 format first and then processed for prediction. The output tensor corresponding to the hidden state is directly used as input for the next frame in FP16 format.

    Users need to divide the output nodes of the model into FP32 and FP16 according to the specific model structure. Users can specify which output nodes are in FP16 format using the FP16_OUT_OPS parameter.

  • Parameter Values

    [out_state_index, ...]

    out_state_index —— Index of the output nodes corresponding to the hidden state of the model
    
  • Usage Example

    PyTorch model with multiple outputs:

    import torch.nn as nn
    
    class TestModel(torch.nn.Module):
        def __init__(self):
            super(TestModel, self).__init__()
            self.gru  = nn.GRU(32, 32, batch_first=True)
            self.conv = nn.Conv1d(32, 32, 1, 1)
    
        def forward(self, x, h):
            t = torch.split(x, [1, 2, 3, 4], dim=1)
            y = self.gru(x, h)
            z = torch.sigmoid(y[0])
            z = z.permute(0, 2, 1)
            o = self.conv(z)
    
            return t, y, o
    
    net          = TestModel()
    input_tensor = torch.randn([1, 10, 32])
    state_tensor = torch.randn([1, 1, 32])
    
    output_t, output_y, output_o = model(input_tensor, input_state)
    

    From the forward method, it can be seen that the model defines three sets of outputs t, y, and o, where outputs t and y are tensor lists, and output o is a tensor.

    The NPU compiler will expand tensor list outputs into tensor outputs. For the above model, the NPU compiler will consider that there are 7 outputs: t[0], t[1], t[2], t[3], y[0], y[1], o.

    The model output y[1] is the output node corresponding to the hidden state of the gru module, which needs to be used as the input to the gru module for the next frame of inference. This output should not undergo FP16 -> FP32 conversion.

    The FP16_OUT_OPS parameter in the model configuration file can be configured as follows:

    config.yaml
        FP16_OUT_OPS: [5]
    

3. Compile Model*

3.1 Model File Preparation*

The NPU compiler strictly enforces model file formats, and different frameworks need to export them in specific ways.

TensorFlow

  • Prepare CKPT and PB files generated by TensorFlow or model files generated in saved_model format.
  • Use the freeze_graph.py script provided by TensorFlow to generate a FROZEN_PB file.

PyTorch

  • After training the model, export the model weight file.
  • Build a PyTorch inference model script and generate a PyTorch Module instance.
  • Convert the custom PyTorch Module to Torch ScriptModule and serialize the ScriptModule instance for the compiler.

    For specific steps, refer to the PyTorch Model Conversion Example

3.2 Write Configuration File*

  • Write a YAML configuration file, including the model file name, output file name, output file type, compression status, input node names and dimensions, output node names, etc.

3.3 Compile and Generate Model File*

Compile the model using the following command:

$ gxnpuc config.yaml

Note

When the NPU toolchain compiles model files for different deep frameworks, it must have the corresponding framework's runtime environment installed. Refer to NPU Model Format Specification for information on the generated model file format.

4. Explanation of Some Ops*

4.1 Softmax*

NPU cannot directly support Softmax, but under certain conditions, you can modify the model to make NPU support Softmax computation.

The conditions are:

  • The input tensor of softmax must be 2-dimentions and batch size equal to 1.

The Softmax function in the model needs to be replaced with the following function:

TensorFlow

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, name=None):
    """ NPU Softmax
    Args:
      x: A non-empty `Tensor`.
      name: A name for the operation (optional).
    Returns
      A `Tensor`.
    """ 
    # x' = x - max(x)
    # y = exp(x') / sum(exp(x'))
    assert len(x.shape) == 2 and x.shape[0] == 1
    partitions = split_and_factorize(x.shape[1])
    if len(partitions) == 1:
        a, b = factorize(partitions[0])
        pool_shape = [1, a, b, 1]
        x = tf.reshape(x, pool_shape)
        max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
    else:
        cnt = 0
        tmp_max_list = []
        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, a, b, 1]
            tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
            tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
            tmp_max_list.append(tmp_max)
            cnt += p
        tmp_max_len = len(tmp_max_list)
        assert tmp_max_len <= 15
        max_ = tf.concat(tmp_max_list, axis=2)
        max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
    x = tf.reshape(x, [-1, 1])
    x = tf.math.subtract(x, max_)
    exp_x = tf.exp(x)
    exp_x = tf.reshape(exp_x, [1, -1])
    return tf.math.divide(exp_x, tf.math.reduce_sum(exp_x, axis=-1), name=name)

Important

During training, use TensorFlow's softmax, and before exporting CKPT, replace it with npu_softmax.

PyTorch

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, partitions):
    """ NPU Softmax
    Args:
      x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
      partitions: A list of integers specifying the partition sizes for the softmax operation.
    Returns
      A `Tensor` representing the real softmax output.
    """ 

    if len(partitions) == 1:
        a, b       = factorize(partitions[0])
        pool_shape = [1, 1, a, b]
        x_   = x.reshape(pool_shape)
        max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
    else:
        cnt = 0 
        tmp_max_list = []

        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, 1, a, b]

            x_      = x[:,cnt:cnt+p]
            tmp_x   = x_.reshape(pool_shape)
            tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
            tmp_max_list.append(tmp_max)

            cnt += p

        tmp_max_len = len(tmp_max_list)

        max_ = torch.concat(tmp_max_list, axis=3)
        max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))

    max_  = max_.squeeze(0).squeeze(0)
    x     = x - max_
    exp_x = torch.exp(x)
    sum_x = torch.sum(exp_x, dim=-1)

    return exp_x / sum_x

Important

During training, use PyTorch's softmax, and when building the inference script, replace it with npu_softmax.

4.2 LogSoftmax*

NPU cannot directly support LogSoftmax, but under certain conditions, you can modify the model to make NPU support LogSoftmax computation.

The conditions are:

  • The input tensor of log_softmax must be 2-dimentions and batch size equal to 1.

The LogSoftmax function in the model needs to be replaced with the following function:

TensorFlow

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_softmax(x, name=None):
    """ NPU LogSoftmax
    Args:
      x: A non-empty `Tensor`.
      name: A name for the operation (optional).
    Returns
      A `Tensor`.
    """
    # x' = x - max(x)
    # y  = x' - log(sum(exp(x')))
    assert len(x.shape) == 2 and x.shape[0] == 1
    partitions = split_and_factorize(x.shape[1])
    if len(partitions) == 1:
        a, b = factorize(partitions[0])
        pool_shape = [1, a, b, 1]
        x = tf.reshape(x, pool_shape)
        max_ = tf.nn.max_pool(x, ksize=pool_shape, strides=pool_shape, padding='VALID')
    else:
        cnt = 0
        tmp_max_list = []
        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, a, b, 1]
            tmp_x = tf.reshape(x[:,cnt:cnt+p], pool_shape)
            tmp_max = tf.nn.max_pool(tmp_x, ksize=pool_shape, strides=pool_shape, padding='VALID')
            tmp_max_list.append(tmp_max)
            cnt += p
        tmp_max_len = len(tmp_max_list)
        assert tmp_max_len <= 15
        max_ = tf.concat(tmp_max_list, axis=2)
        max_ = tf.nn.max_pool(max_, ksize=[1,1,tmp_max_len,1], strides=[1,1,tmp_max_len,1], padding='VALID')
    x = tf.reshape(x, [-1, 1])
    x = tf.math.subtract(x, max_)

    x = tf.reshape(x, [1, -1])

    exp_x       = tf.exp(x)
    exp_sum     = tf.math.reduce_sum(exp_x, axis=-1)
    exp_sum_log = tf.log(exp_sum)

    return tf.math.subtract(x, exp_sum_log, name=name)

Important

During training, use TensorFlow's log_softmax, and before exporting CKPT, replace it with npu_log_softmax.

PyTorch

def factorize(n):
    for i in range(1, 16):
        if n % i == 0 and n // i <= 15:
            return (n // i, i)
    return ()

def split_and_factorize(n):
    result = []
    while not factorize(n):
        for i in range(n-1, 0, -1):
            if factorize(i):
                result.append(i)
                n -= i
                break
    result.append(n)
    return result

def npu_log_softmax(x, partitions):
    """ NPU LogSoftmax
    Args:
      x: A non-empty `Tensor`. limitation: len(x.shape) == 2 and x.shape[0] == 1
      partitions: A list of integers specifying the partition sizes for the log_softmax operation.
    Returns
      A `Tensor` representing the real log_softmax output.
    """

    if len(partitions) == 1:
        a, b       = factorize(partitions[0])
        pool_shape = [1, 1, a, b]
        x_   = x.reshape(pool_shape)
        max_ = torch.nn.functional.max_pool2d(x_, (a,b), (a,b))
    else:
        cnt = 0
        tmp_max_list = []

        for p in partitions:
            a, b = factorize(p)
            pool_shape = [1, 1, a, b]

            x_      = x[:,cnt:cnt+p]
            tmp_x   = x_.reshape(pool_shape)
            tmp_max = torch.nn.functional.max_pool2d(tmp_x, (a,b), (a,b))
            tmp_max_list.append(tmp_max)

            cnt += p

        tmp_max_len = len(tmp_max_list)

        max_ = torch.concat(tmp_max_list, axis=3)
        max_ = torch.nn.functional.max_pool2d(max_, (1,tmp_max_len), (1,tmp_max_len))

    max_  = max_.squeeze(0).squeeze(0)
    x     = x - max_

    exp_x       = torch.exp(x)
    exp_sum     = torch.sum(exp_x, dim=-1)
    exp_sum_log = torch.log(exp_sum)

    return x - exp_sum_log

Important

During training, use PyTorch's log_softmax, and when building the inference script, replace it with npu_log_softmax.

4.3 BatchNorm*

You can set the FUSE_BN configuration item to choose whether to merge BatchNorm parameters into the convolution.

Note

Merging BatchNorm parameters into the convolution may lead to a significant drop in accuracy if convolution weight compression is enabled.