Pietro Lesci

I am a final-year PhD student in Computer Science at the University of Cambridge advised by Prof Andreas Vlachos. I have a background in economics and 3+ years of experience across research labs, consulting firms, and international institutions training custom models and developing data science solutions. Find my cv here .

I am currently interested in understanding how training data influences a model’s behaviour. To study this question, I draw methods from econometrics. I aim to build models that can autonomously select and craft their own training data to continually acquire new capabilities as needed, thus blurring the line between data curation and learning. My research intersects causal methods, active learning, tokenisation, and pre-training.

My work has been presented at major machine learning conferences such as ICLR, ACL, NAACL, and EMNLP. I received the Best Paper Award at ACL 2024, the Paper of the Year Award from Cambridge’s Department of Computer Science and Technology, and funding from Translated’s Imminent Research Grant for my research contributions. You can find the complete list of publications and related links on the /research page. I also keep my Hugging Face profile updated: each paper is linked to a collection listing all the relevant artefacts.

news

Jun 10, 2025	Our work on estimating tokenisation bias in language models from only observational data, Causal Estimation of Tokenisation Bias, has been accepted at ACL 2025! Joint work with @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms. Details in post.
Apr 29, 2025	Our papers Causal Estimation of Memorisation Profiles was recognised as Paper of the Year by Cambridge’s Department of Computer Science and Technology (1 out all publications from the Deparment in 2024). Official announcement on BlueSky and LinkedIn.
Jan 22, 2025	The papers PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs (first author) and Self-Training Large Language Models for Tool-Use Without Demonstrations have been accepted, respectively, at ICLR 2025 and NAACL 2025 (Findings) 🎉
Nov 14, 2024	I have been recognised as one of the Outstanding Reviewers for EMNLP 2024!
Nov 06, 2024	🌟 Our “Large Language Model Memorization (L2M2)” workshop proposal has been accepted at ACL 2025 🎉 Jointly proposed with Robin Jia, Verna Dankers, Johnny Tian-Zheng Wei, Pratyush Maini, Yangsibo Huang, Eric Wallace, and Tiago Pimentel.
Oct 01, 2024	🔬 Our work studying the challenges of training small language models has been accepted at EMNLP 2024 (Findings)! Joint work with @richarddm1 and Paula Buttery.
Aug 15, 2024	🚀 Our work on estimating memorisation in language models from only observational data, Causal Estimation of Memorisation Profiles, has won the Best Paper Award at ACL 2024! Joint work with @clara__meister, Thomas Hofmann, @vlachos_nlp, and @tpimentelms. Details in post.
Mar 15, 2024	📚 Happy to share that our work AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets has been accepted to NAACL 2024 (main)! Joint work with my supervisor @vlachos_nlp. Details in post.
Jun 15, 2023	I am happy to share that my internship work and first-ever paper, Diable: Efficient Dialogue State Tracking as Operations on Tables has been accepted at ACL 2023 (Findings)!
Sep 12, 2022	Happy to join Amazon AWS AI Labs in Barcelona, working on efficient dialogue state tracking with Lluis Marquez, Yoshinari Fujinuma, and the fantastic AWS team in Barcelona, NYC, and Seattle!
Jan 04, 2022	Wordify 2.0 is out!
Oct 01, 2021	I joined Cambridge University as a PhD student in Computer Science