Sign Up

ESE Seminar | Marco Mondelli

This is a past event.

Friday, November 7, 2025 10 AM to 11 AM

 Learning and Data in the Age of LLMs: Theoretical Insights from High-Dimensional Regression

 The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training paradigms, and this talk presents three vignettes giving theoretical insights through the lens of high-dimensional regression. The first vignette is about knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. I will be particularly interested in the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. More precisely, I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the data scaling law. The second vignette is about test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. I will investigate a gradient-based TTT algorithm for in-context learning, where a linear transformer model is trained via a single gradient step on the in-context demonstrations provided in the test prompt. This shows how TTT can significantly reduce the sample size required for in-context learning, and it delineates the role of the alignment between pre-training distribution and target task. Finally, the third vignette is about synthetic data selection where one uses data obtained from a generative model to augment the training dataset. I will prove that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Remarkably, selecting synthetic data that match the covariance of the target distribution is optimal not only theoretically in the context of linear models, but also empirically: this procedure performs well against the state of the art, across training paradigms, architectures, datasets and generative models used for augmentation.

 

0 people are interested in this event


User Activity

No recent activity

 Learning and Data in the Age of LLMs: Theoretical Insights from High-Dimensional Regression

 The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training paradigms, and this talk presents three vignettes giving theoretical insights through the lens of high-dimensional regression. The first vignette is about knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. I will be particularly interested in the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. More precisely, I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the data scaling law. The second vignette is about test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. I will investigate a gradient-based TTT algorithm for in-context learning, where a linear transformer model is trained via a single gradient step on the in-context demonstrations provided in the test prompt. This shows how TTT can significantly reduce the sample size required for in-context learning, and it delineates the role of the alignment between pre-training distribution and target task. Finally, the third vignette is about synthetic data selection where one uses data obtained from a generative model to augment the training dataset. I will prove that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Remarkably, selecting synthetic data that match the covariance of the target distribution is optimal not only theoretically in the context of linear models, but also empirically: this procedure performs well against the state of the art, across training paradigms, architectures, datasets and generative models used for augmentation.