NPU Quantization Accuracy Debugging*

In order to reduce the storage size of model weights, the NPU compiler can be configured to use int8 quantization for model weights. During NPU runtime, the int8 weight data will be dequantized to fp16 format for computation. Quantization may result in a loss of model performance, so the NPU compiler provides some parameters and configuration options to debug precision issues.

Configuration Options*

Configuration Option	Options	Description
COMPRESS	true / false	Enable or disable full connection weight compression
CONV2D_COMPRESS	true / false	Enable or disable convolution weight compression (default is false)
FUSE_BN	true / false	Merge BN parameters into convolution (default is false) (1.5.2rc6 and above)
EXCLUDE_COMPRESS_OPS	[op_name, ...]	Specify weights not to be quantized (1.5.6 and above)
WEIGHT_MIN_MAX	op_name:[min, max] ...	Specify the minimum and maximum values for weight quantization (1.5.7rc0 and above)

Parameters*

Parameter Name	Description
-w / --weight	Print the names of quantized weights during compilation
-s / --save	Save weight distribution histograms in the `npu_jpgs` directory

Example of Quantization Precision Debugging*

Scenario 1*

The user sets the configuration options COMPRESS and CONV2D_COMPRESS to true, and the test model accuracy differs significantly from TensorFlow.

Debugging steps:

Set both configuration options COMPRESS and CONV2D_COMPRESS to false, test the model, and verify that the test accuracy is roughly consistent with TensorFlow. This suggests that the model performance issue is caused by quantization.
Set COMPRESS to true and CONV2D_COMPRESS to false, and test the model again. If the results are still consistent with TensorFlow, this suggests that the problem is caused by convolution weight quantization.
Set CONV2D_COMPRESS to true and FUSE_BN to false, which means that BN parameters are not fused into the convolution weights. If the test results show significant differences, continue debugging the convolution weight quantization issue.

Set CONV2D_COMPRESS to true and compile with the -w parameter to print the names of quantized weights. Compile with the -s parameter to output the weight distribution histograms.

$ gxnpuc -w config.yaml
tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7: [1, 3, 40, 208]
prefinal-l/conv1d/ExpandDims_1/_5__cf__5: [1, 1, 208, 120]
...

$ gxnpuc -s config.yaml

Check the histograms in the npu_jpgs directory. If the histogram of the weight prefinal-l/conv1d/ExpandDims_1/_5__cf__5 does not follow a long-tail distribution (as shown above), it indicates that quantization has limited impact on the accuracy.

If the histogram of the weight tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7 shows a long-tail distribution (as shown above), it is not suitable for quantization. Try not quantizing this weight.

To exclude this weight from quantization, add the following configuration option:

EXCLUDE_COMPRESS_OPS: [tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7]

Recompile and test. If the results are roughly consistent with TensorFlow, the debugging is complete. 5. Based on the weight histogram, the weight values of this layer are mainly distributed between -0.06 and 0.06, but the quantization range is -0.28 to 0.37, which results in many data points with insufficient quantization precision. Now, set the minimum and maximum quantization values for this weight to -0.06 and 0.06, respectively, and saturate values outside this range. Add the following configuration option:

WEIGHT_MIN_MAX:
    tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7: [-0.06, 0.06]

Recompile. Now the histogram looks like this:

After recompiling and testing the model, the results are now similar to TensorFlow, and the debugging process is concluded.

NPU Quantization Accuracy Debugging*

Quantization-Related Configuration Options*

Configuration Options*

Parameters*

Example of Quantization Precision Debugging*

Scenario 1*