Cross Architecture Knowledge Distillation

Cross-architecture knowledge distillation (CAKD) aims to transfer knowledge from a computationally expensive "teacher" model (e.g., a transformer) to a more efficient "student" model (e.g., a CNN) with different architectures, improving the student's performance. Current research focuses on addressing the challenges of aligning features between disparate architectures, often employing techniques like feature projection, receptive field mapping, and adaptive loss functions to bridge the gap. This approach is significant because it allows leveraging the strengths of powerful, complex models while deploying them on resource-constrained devices or improving the efficiency of existing models across various applications, including image processing, speech recognition, and graph classification.

Papers