Seminar: Learning Theory of Transformers: Generalization and Optimization of In-Context Learning by Prof. Taiji Suzuki
Abstract: We introduce recent theoretical development that elucidates the learning capabilities of Transformers, focusing on in-context learning as the main subject. First, regarding statistical efficiency and approximation ability, we show that Transformers can achieve the minimax optimality for in-context learning, and achieves superiority against non-pretrained methods. Next, in terms of optimization theory, we demonstrate that nonlinear feature learning for in-context learning can be done with optimization guarantee. More concretely, the objective becomes strict-saddle in a mean field setting, and if the target is a single index model, then its computational efficiency can be evaluated based on the information exponent of the true function.
Biography: Taiji Suzuki is currently a full Professor in the Department of Mathematical Informatics at the University of Tokyo. He also serves as the team leader of “Deep learning theory” team in AIP-RIKEN. He received his Ph.D. degree in information science and technology from the University of Tokyo in 2009. He worked as an assistant professor in the department of mathematical informatics, the University of Tokyo between 2009 and 2013, and then he was an associate professor in the department of mathematical and computing science, Tokyo Institute of Technology between 2013 and 2017. After that, he was an associate professor in the department of mathematical informatics at the University of Tokyo between 2017 and 2024.
He served as area chairs of premier conferences such as NeurIPS, ICML, ICLR and
AISTATS, a program chair of ACML2019, and an action editor of the Annals of
Statistics. He received the Outstanding Paper Award at ICLR in 2021, the MEXT
Young Scientists’ Prize, and Outstanding Achievement Award in 2017 from the
Japan Statistical Society. He is interested in deep learning theory,
nonparametric statistics, high dimensional statistics, and stochastic
optimization. In particular, he is mainly working on deep learning theory from
several aspects such as representation ability, generalization ability and
optimization ability. He also has devoted stochastic optimization to accelerate
large scale machine learning problems including variance reduction methods,
Nesterov’s acceleration, federated learning and non-convex noisy optimization.