Pietro Lesci

avatar.png

I am a PhD student in Computer Science at the University of Cambridge advised by Prof Andreas Vlachos. I have a background in economics and 3+ years experience across research labs, consulting firms, and international institutions training custom models and developing data science solutions. Find my cv here .

I am currently interested in understanding how training data influences a model’s behaviour. To study this question, I draw methods from econometrics. I aim to build models that can autonomously select and craft their own training data to continually acquire new capabilities as needed, thus blurring the line between data curation and learning. My research intersects causal methods, active learning, tokenisation, and pre-training.

My work has been presented at major machine learning conferences such as ICLR, ACL, NAACL, and EMNLP. I received a Best Paper Award at ACL 2024 and funding from the Translated Imminent Research Grant for my research contributions. You can find the complete list of publications and related links on the /research page. I also keep my Hugging Face profile updated: each paper is linked to a collection listing all the relevant artefacts.

news

Jan 22, 2025 The papers PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs (first author) and Self-Training Large Language Models for Tool-Use Without Demonstrations have been accepted, respectively, at ICLR 2025 and NAACL 2025 (Findings) πŸŽ‰
Nov 14, 2024 I have been recognised as one of the Outstanding Reviewers for EMNLP 2024!
Nov 06, 2024 🌟 Our β€œLarge Language Model Memorization (L2M2)” workshop proposal has been accepted at ACL 2025 πŸŽ‰ Jointly proposed with Robin Jia, Verna Dankers, Johnny Tian-Zheng Wei, Pratyush Maini, Yangsibo Huang, Eric Wallace, and Tiago Pimentel.
Oct 01, 2024 πŸ”¬ Our work studying the challenges of training small language models has been accepted at EMNLP 2024 (Findings)! Joint work with @richarddm1 and Paula Buttery.
Aug 15, 2024 πŸš€ Our work on estimating memorisation in language models from only observational data, Causal Estimation of Memorisation Profiles, has won the Best Paper Award at ACL 2024 (main)! Joint work with @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms. Details in post.
Mar 15, 2024 πŸ“š Happy to share that our work AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets has been accepted to NAACL 2024 (main)! Joint work with my supervisor @vlachos_nlp. Details in post.
Jun 15, 2023 I am happy to share that my internship work and first-ever paper, Diable: Efficient Dialogue State Tracking as Operations on Tables has been accepted at ACL 2023 (Findings)! :rocket:
Sep 12, 2022 Happy to join Amazon AWS AI Labs in Barcelona, working on efficient dialogue state tracking with Lluis Marquez, Yoshinari Fujinuma, and the fantastic AWS team in Barcelona, NYC, and Seattle!
Jan 04, 2022 Wordify 2.0 is out! :sparkles:
Oct 01, 2021 I joined Cambridge University as a PhD student in Computer Science