Post
1057
minimal single script implementation of knowledge distillation in LLMs. In this implementation, we use GPT-2 (124M) as student model and GPT-2 Medium (340M) as teacher via reverse Kullback-Leibler (KL) divergence, trained on a small chunk of openwebtext.
Code: https://github.com/Jaykef/ai-algorithms/blob/main/llm_knowledge_distillation.ipynb
Code: https://github.com/Jaykef/ai-algorithms/blob/main/llm_knowledge_distillation.ipynb