Valentin Hofmann

Valentin Hofmann

Young Investigator (Postdoc)

Allen Institute for AI

About

I am a Young Investigator (Postdoc) at the Allen Institute for AI, where I work with Noah Smith and Hanna Hajishirzi on the AllenNLP team. Previously, I was a PhD student at the University of Oxford and a researcher at LMU Munich, co-advised by Janet Pierrehumbert and Hinrich Schütze. During my PhD, I also spent time as a research intern at DeepMind and as a visiting scholar at Stanford University. My research focuses on the link between natural language processing and the social and cognitive sciences.

Interests
  • NLP and Language Variation
  • Language Models
  • Tokenization
  • Computational Morphology

News

  • 02/2024: Giving a talk at University of Cambridge’s Language Technology Lab
  • 01/2024: Paper accepted to TACL 2024
  • 10/2023: Paper accepted to EMNLP 2023
  • 10/2023: Starting a postdoc at the Allen Institute for AI
  • 02/2023: Giving a talk at University of Oxford’s OxfordXML Seminar Series
  • 01/2023: Giving a talk at University of Toronto’s Computational Linguistics Group
  • 11/2022: Selected as a Rising Star in Data Science at the University of Chicago
  • 11/2022: Giving a talk at UC Berkeley’s NLP Seminar
  • 10/2022: Paper accepted to EMNLP 2022
  • 09/2022: Starting a research visit to the Stanford NLP Group
  • 08/2022: Giving a talk at University of Würzburg’s Natural Language Processing Group
  • 07/2022: Giving a talk at MilaNLP Lab
  • 05/2022: Paper accepted to ICML 2022
  • 05/2022: Giving a talk at University of Mannheim’s Data and Web Science Group
  • 04/2022: Paper accepted to NAACL 2022 (Findings)
  • 03/2022: Paper accepted to ICWSM 2022
  • 03/2022: Starting an internship in DeepMind’s Language Team
  • 02/2022: Two papers accepted to ACL 2022

Publications

2024

Dolma: An open corpus of three trillion tokens for language model pretraining research
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo
arXiv 2402.00159

Geographic adaptation of pretrained language models
Valentin Hofmann, Goran Glavas, Nikola Ljubesic, Janet Pierrehumbert, and Hinrich Schütze
TACL 2024

2023

Paloma: A benchmark for evaluating language model fit
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah Smith, Kyle Richardson, and Jesse Dodge
arXiv 2312.10523

Counting the bugs in ChatGPT's wugs: A multilingual investigation into the morphological capabilities of a large language model
Leonie Weissweiler*, Valentin Hofmann*, Anjali Kantharuban, Anna Cai, Ritam Dutt, Amey Hengle, Anubha Kabra, Atharva Kulkarni, Abhishek Vijayakumar, Haofei Yu, Hinrich Schütze, Kemal Oflazer, and David Mortensen
EMNLP 2023

Explaining pretrained language models' understanding of linguistic structures using construction grammar
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze
Frontiers in Artificial Intelligence 2023

2022

The better your syntax, the better your semantics? Probing pretrained language models for the English comparative correlative
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal, and Hinrich Schütze
EMNLP 2022

Unsupervised detection of contextualized embedding bias with application to ideology
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ICML 2022

Modeling ideological salience and framing in polarized online groups with graph neural networks and structured sparsity
Valentin Hofmann, Xiaowen Dong, Janet Pierrehumbert, and Hinrich Schütze
NAACL 2022 (Findings)

The Reddit Politosphere: A large-scale text and network resource of online political discourse
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ICWSM 2022

An embarrassingly simple method to mitigate undesirable properties of pretrained language model tokenizers
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ACL 2022

CaMEL: Case marker extraction without labels
Leonie Weissweiler, Valentin Hofmann, Masoud Jalili Sabet, and Hinrich Schütze
ACL 2022

2021

Superbizarre is not superb: Derivational morphology improves BERT’s interpretation of complex words
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2021

Dynamic contextualized word embeddings
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2021

2020

DagoBERT: Generating derivational morphology with a pretrained language model
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
EMNLP 2020

Predicting the growth of morphological families from social and linguistic factors
Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze
ACL 2020

A graph auto-encoder model of derivational morphology
Valentin Hofmann, Hinrich Schütze, and Janet Pierrehumbert
ACL 2020