March 3, 2026Open Access

Mixed position neural networks with second order Taylor for the bit assignment

Key Points

The approach allows for optimal bit assignment in neural networks, increasing efficiency with quantization.
Key evidence shows that the method accurately predicts quantization error using second-order Taylor approximations.
Assessment using second-order directional derivatives frames the problem as linear programming, enabling standard solver use.
While demonstrating promising capabilities, further exploration of the approach is needed to realize its full potential.

Abstract

In this work, we present an innovative approach for neural network quantization using second-order Taylor approximations of the loss function to predict quantization error. Specifically, we calculate the Hessian of the cost function using second-order directional derivatives to model the problem as a linear programming problem. This allows us to solve it with standard solvers, finding the optimal bit assignment for each layer or group of layers. Unlike previous approaches that rely on heuristics, our method accurately computes the Hessian, considering both inter-layer and intra-layer relationships. To ensure efficiency, we compute the second-order directional derivatives, making it feasible to calculate on typical machine learning GPUs within minutes. This enables effective mixed-precision quantization of weights ranging from 2 to 8 bits. While our approach demonstrates promising capabilities, further work remains to fully explore its potential.

Mixed position neural networks with second order Taylor for the bit assignment

Key Points

Abstract

Cite This Study