NPU Quantization Accuracy Debugging*
In order to reduce the storage size of model weights, the NPU compiler can be configured to use int8 quantization for model weights. During NPU runtime, the int8 weight data will be dequantized to fp16 format for computation. Quantization may result in a loss of model performance, so the NPU compiler provides some parameters and configuration options to debug precision issues.
Quantization-Related Configuration Options*
Configuration Options*
Configuration Option | Options | Description |
---|---|---|
COMPRESS | true / false | Enable or disable full connection weight compression |
CONV2D_COMPRESS | true / false | Enable or disable convolution weight compression (default is false) |
FUSE_BN | true / false | Merge BN parameters into convolution (default is false) (1.5.2rc6 and above) |
EXCLUDE_COMPRESS_OPS | [op_name, ...] | Specify weights not to be quantized (1.5.6 and above) |
WEIGHT_MIN_MAX | op_name:[min, max] ... | Specify the minimum and maximum values for weight quantization (1.5.7rc0 and above) |
Parameters*
Parameter Name | Description |
---|---|
-w / --weight | Print the names of quantized weights during compilation |
-s / --save | Save weight distribution histograms in the npu_jpgs directory |
Example of Quantization Precision Debugging*
Scenario 1*
The user sets the configuration options COMPRESS
and CONV2D_COMPRESS
to true
, and the test model accuracy differs significantly from TensorFlow.
Debugging steps:
- Set both configuration options
COMPRESS
andCONV2D_COMPRESS
tofalse
, test the model, and verify that the test accuracy is roughly consistent with TensorFlow. This suggests that the model performance issue is caused by quantization. - Set
COMPRESS
totrue
andCONV2D_COMPRESS
tofalse
, and test the model again. If the results are still consistent with TensorFlow, this suggests that the problem is caused by convolution weight quantization. - Set
CONV2D_COMPRESS
totrue
andFUSE_BN
tofalse
, which means that BN parameters are not fused into the convolution weights. If the test results show significant differences, continue debugging the convolution weight quantization issue. -
Set
CONV2D_COMPRESS
totrue
and compile with the-w
parameter to print the names of quantized weights. Compile with the-s
parameter to output the weight distribution histograms.$ gxnpuc -w config.yaml tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7: [1, 3, 40, 208] prefinal-l/conv1d/ExpandDims_1/_5__cf__5: [1, 1, 208, 120] ... $ gxnpuc -s config.yaml
Check the histograms in the npu_jpgs
directory. If the histogram of the weight prefinal-l/conv1d/ExpandDims_1/_5__cf__5
does not follow a long-tail distribution (as shown above), it indicates that quantization has limited impact on the accuracy.
If the histogram of the weight tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7
shows a long-tail distribution (as shown above), it is not suitable for quantization. Try not quantizing this weight.
To exclude this weight from quantization, add the following configuration option:
EXCLUDE_COMPRESS_OPS: [tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7]
Recompile and test. If the results are roughly consistent with TensorFlow, the debugging is complete.
5. Based on the weight histogram, the weight values of this layer are mainly distributed between -0.06
and 0.06
, but the quantization range is -0.28
to 0.37
, which results in many data points with insufficient quantization precision. Now, set the minimum and maximum quantization values for this weight to -0.06
and 0.06
, respectively, and saturate values outside this range. Add the following configuration option:
WEIGHT_MIN_MAX:
tdnn1.affine/conv1d/ExpandDims_1/_7__cf__7: [-0.06, 0.06]
Recompile. Now the histogram looks like this:
After recompiling and testing the model, the results are now similar to TensorFlow, and the debugging process is concluded.