Academic Journal

Gradient Methods for Optimizing Metaparameters in the Knowledge Distillation Problem.

Bibliographic Details
Title: Gradient Methods for Optimizing Metaparameters in the Knowledge Distillation Problem.
Authors: Gorpinich, M.1 (AUTHOR) gorpinich.m@phystech.edu, Bakhteev, O. Yu.2 (AUTHOR) bakhteev@phystech.edu, Strijov, V. V.2 (AUTHOR) strijov@gmail.com
Superior Title: Automation & Remote Control. Oct2022, Vol. 83 Issue 10, p1544-1554. 11p.
Subject Terms: *MACHINE learning, *TRAJECTORY optimization, *DEEP learning, *INFORMATION modeling
Abstract: The paper investigates the distillation problem for deep learning models. Knowledge distillation is a metaparameter optimization problem in which information from a model of a more complex structure, called a teacher model, is transferred to a model of a simpler structure, called a student model. The paper proposes a generalization of the distillation problem for the case of optimization of metaparameters by gradient methods. Metaparameters are the parameters of the distillation optimization problem. The loss function for such a problem is the sum of the classification term and the cross-entropy between the responses of the student model and the teacher model. Assigning optimal metaparameters to the distillation loss function is a computationally difficult task. The properties of the optimization problem are investigated so as to predict the metaparameter update trajectory. An analysis of the trajectory of the gradient optimization of metaparameters is carried out, and their value is predicted using linear functions. The proposed approach is illustrated using a computational experiment on CIFAR-10 and Fashion-MNIST samples as well as on synthetic data. [ABSTRACT FROM AUTHOR]
Copyright of Automation & Remote Control is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Academic Search Premier
Description
Description not available.