Compute and memory requirements can be lowered by using reduced precision representation and arithmetic. Half precision math (fp16) throughput in GPUs can be much higher for mixed-precision than for single-precision.
In mixed precision training, a master copy of weights is stored in fp-32. Loss scaling is used to avoid gradient values becoming zero in fp-16. fp-32 is used selectively for specific steps (i.e., vector dot products, reductions, pointwise operations).
Mixed precision training has been shown to work for a variety of architectures including the [[convolutional neural network]], [[recurrent neural network]], and [[transformer]].