ESE Seminar | Marco Mondelli
Friday, November 7, 2025 10 AM to 11 AM
About this Event
Learning and Data in the Age of LLMs: Theoretical Insights from High-Dimensional Regression
The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training paradigms, and this talk presents three vignettes giving theoretical insights through the lens of high-dimensional regression. The first vignette is about knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. I will be particularly interested in the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. More precisely, I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the data scaling law. The second vignette is about test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. I will investigate a gradient-based TTT algorithm for in-context learning, where a linear transformer model is trained via a single gradient step on the in-context demonstrations provided in the test prompt. This shows how TTT can significantly reduce the sample size required for in-context learning, and it delineates the role of the alignment between pre-training distribution and target task. Finally, the third vignette is about synthetic data selection where one uses data obtained from a generative model to augment the training dataset. I will prove that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Remarkably, selecting synthetic data that match the covariance of the target distribution is optimal not only theoretically in the context of linear models, but also empirically: this procedure performs well against the state of the art, across training paradigms, architectures, datasets and generative models used for augmentation.
Event Details
See Who Is Interested
0 people are interested in this event
Dial-In Information
https://wustl.zoom.us/j/92486983543?pwd=i3G2PIOMpM1Z9U9RaT1A4xF4SZuTbD.1
Meeting ID: 924 8698 3543
Passcode: 628094
User Activity
No recent activity
About this Event
Learning and Data in the Age of LLMs: Theoretical Insights from High-Dimensional Regression
The availability of powerful models pre-trained on a vast corpus of data has spurred research on alternative training paradigms, and this talk presents three vignettes giving theoretical insights through the lens of high-dimensional regression. The first vignette is about knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. I will be particularly interested in the phenomenon of weak-to-strong generalization in which a strong student outperforms the weak teacher from which the task is learned. More precisely, I will provide a sharp characterization of the risk of the target model when the surrogate model is either arbitrary or obtained via empirical risk minimization. This shows that weak-to-strong training, with the surrogate as the weak model, provably outperforms training with strong labels under the same data budget, but it is unable to improve the data scaling law. The second vignette is about test-time training (TTT) where one explicitly updates the weights of a model to adapt to the specific test instance. I will investigate a gradient-based TTT algorithm for in-context learning, where a linear transformer model is trained via a single gradient step on the in-context demonstrations provided in the test prompt. This shows how TTT can significantly reduce the sample size required for in-context learning, and it delineates the role of the alignment between pre-training distribution and target task. Finally, the third vignette is about synthetic data selection where one uses data obtained from a generative model to augment the training dataset. I will prove that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Remarkably, selecting synthetic data that match the covariance of the target distribution is optimal not only theoretically in the context of linear models, but also empirically: this procedure performs well against the state of the art, across training paradigms, architectures, datasets and generative models used for augmentation.