Knowledge distillation is the process of a student model learning from a larger teacher model. Methods include
- supervised finetuning: train on examples generated or annotated by the teacher model
- divergence and similarity: reduce divergence between probability distributions or increase similarity between hidden states
- reinforcement learning: first train a reward model, then train the student model with reinforcement learning
- rank similarity: train on ranking outputs similarly to the teacher