In this work, we present an innovative approach for neural network quantization using second-order Taylor approximations of the loss function to predict quantization error. Specifically, we calculate the Hessian of the cost function using second-order directional derivatives to model the problem as a linear programming problem. This allows us to solve it with standard solvers, finding the optimal bit assignment for each layer or group of layers. Unlike previous approaches that rely on heuristics, our method accurately computes the Hessian, considering both inter-layer and intra-layer relationships. To ensure efficiency, we compute the second-order directional derivatives, making it feasible to calculate on typical machine learning GPUs within minutes. This enables effective mixed-precision quantization of weights ranging from 2 to 8 bits. While our approach demonstrates promising capabilities, further work remains to fully explore its potential.
Adrián Gras López (Wed,) studied this question.