-
Notifications
You must be signed in to change notification settings - Fork 46
Numeric Representation (Quantization)
PUMA does not handles floating point number.
All accelerators for NN work only with fixed point, dynamic fixed point and proprietary variations (like Intel flex-point). It has been proven in several studies that NNs can handle the loss of precision caused by the quantization with some cares.
For simple models, like MLP, small RNN, the simplest quantization, just converting directly float to fixed, almost does not affect the accuracy. For complex models, like CNN, it is needed some calibrations in the process (it is a rich topic of study).
PUMA specifically works today with 16-bit fixed-point in signed magnitude format. And it assumes a Q1.15 representation.
It means:
1 integer bit that is actually the sign bit + 15 fractional bits.
So it can represent numbers from (-1 to +1), exclusive.
In summary, the simulator expect to receive values in the interval (-1, +1). It quantizes (converts to fixed-point), make the computations and in the end it dequantizes (converts back to floating) the memory content and dump the output.txt file.