I am a Young Investigator (Postdoc) at the Allen Institute for AI, where I work with Noah Smith and Hannaneh Hajishirzi. Previously, I was a DPhil (PhD) student at the University of Oxford and a researcher at LMU Munich, advised by Janet Pierrehumbert and Hinrich Schütze. During my DPhil, I also spent time as a research intern at DeepMind and as a visiting scholar at Stanford University.
The two key questions driving my research are: What can computational models like GPT-4 reveal about the cognitive and social mechanisms that shape human language? And conversely, how can insights from linguistics help identify and address limitations in language technology? I explore these questions through the lenses of morphology and sociolinguistics, with a focus on large language models.
DPhil (PhD) in Linguistics
University of Oxford, 2024
MSc in Computational Linguistics and Computer Science
LMU Munich, 2020
MSt in Linguistics
University of Oxford, 2018
2024
AI generates covertly racist decisions about people based on their dialect
Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King
Nature
MAGNET: Improving the multilingual fairness of language models with adaptive gradient-based tokenization
Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, and Noah Smith
NeurIPS 2024
Paloma: A benchmark for evaluating language model fit
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah Smith, Kyle Richardson, and Jesse Dodge
NeurIPS 2024
Political compass or spinning arrow? Towards more meaningful evaluations for values and opinions in large language models
Paul Röttger*, Valentin Hofmann*, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schütze, and Dirk Hovy
ACL 2024
🏆 Outstanding Paper Award
Dolma: An open corpus of three trillion tokens for language model pretraining research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo
ACL 2024
🏆 Best Resource Paper Award
Graph-enhanced large language models in asynchronous plan reasoning
Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony Cohn, and Janet Pierrehumbert
ICML 2024
Geographic adaptation of pretrained language models
Valentin Hofmann, Goran Glavas, Nikola Ljubesic, Janet Pierrehumbert, and Hinrich Schütze
TACL
2023
Counting the bugs in ChatGPT's wugs: A multilingual investigation into the morphological capabilities of a large language model
Leonie Weissweiler*, Valentin Hofmann*, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, and David Mortensen
EMNLP 2023
Explaining pretrained language models' understanding of linguistic structures using construction grammar
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze
Frontiers in Artificial Intelligence
2022
The better your syntax, the better your semantics? Probing pretrained language models for the English comparative correlative
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze
EMNLP 2022
Unsupervised detection of contextualized embedding bias with application to ideology
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ICML 2022
Modeling ideological salience and framing in polarized online groups with graph neural networks and structured sparsity
Valentin Hofmann, Xiaowen Dong, Janet Pierrehumbert, and Hinrich Schütze
NAACL 2022 (Findings)
The Reddit Politosphere: A large-scale text and network resource of online political discourse
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ICWSM 2022
An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ACL 2022
CaMEL: Case marker extraction without labels
Leonie Weissweiler, Valentin Hofmann, Masoud Jalili Sabet, and Hinrich Schütze
ACL 2022
2021
Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2021
Dynamic contextualized word embeddings
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2021
2020
DagoBERT: Generating derivational morphology with a pretrained language model
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
EMNLP 2020
Predicting the growth of morphological families from social and linguistic factors
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2020
A graph auto-encoder model of derivational morphology
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ACL 2020