Brian Thompson

<firstname>@alumni.caltech.edu
<firstname>jt@amazon.com

About Me

I am currently a Senior Applied Scientist in Amazon's Artificial General Intelligence (AGI) org, where I work on large language model (LLM) training. I have previously worked at Apple, Johns Hopkins University (where I also completed my PhD), MIT Lincoln Laboratory, and Rincon Research Corporation, on topics including machine translation (MT), automatic dubbing, text-to-speech (TTS), data curation and filtering, MT evaluation, multilingual modeling, paraphrasing, cross-language information retrieval, domain adaptation, and digital signal processing.

My recent work exploring the impact of machine translation on the web has been covered by Politico, The Atlantic, Slator, Vice, TechInsider, Futurism, and others.

Open Source Projects

I developed Vecalign for the ParaCrawl parallel data acquisition project. Vecalign is an accurate sentence alignment algorithm based on multilingual sentence embeddings which is linear in complexity with respect to the number of sentences being aligned. In conjunction with a multilingual sentence embedding model like LASER or LaBSE, Vecalign makes it easy to perform sentence alignment in about 100 languages (i.e. 100^2 language pairs), without the need for a machine translation system or lexicon. At the time of writing, Vecalign has the best reported performance on the test set released with Bleualign.

I also developed Prism, an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references. Prism uses a multilingual neural MT model as a zero-shot paraphraser, which eliminates the need for synthetic paraphrase data and results in a single model which works in many languages (we release a model in 39 languages). At the time of publication, Prism outperformed or statistically tied with all metrics submitted to the WMT 2019 metrics shared task at segment-level human correlation. I developed bitext filtering code to preprocess the data used to train Prism, but the code is general enough to use for any MT training and is released here.

Education

The Johns Hopkins University

PhD, Computer Science, Center for Language and Speech Processing (CLSP)
Work completed while I was a Research Scientist at the JHU Human Language Technology Center of Excellence (HLTCOE)
I was advised by Philipp Koehn and funded by a National Defense Science and Engineering Graduate (NDSEG) Fellowship
Thesis: Sentence Similarity and Machine Translation

California Institute of Technology

MS, Electrical Engineering

Rose-Hulman Institute of Technology

BS, Electrical Engineering

Publications

Note: Google Scholar may be more up-to-date.

A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
arXiv preprint
[pdf] [BibTeX] [code]

Findings of the WMT 2023 Shared Task on Parallel Data Curation
Steve Sloto, Brian Thompson,Huda Khayrallah, Tobias Domhan, Thamme Gowda, Philipp Koehn
Proceedings of the Eighth Conference on Machine Translation (WMT 2023)
[pdf] [BibTeX]

Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, George Foster
Proceedings of the Eighth Conference on Machine Translation (WMT 2023)
[pdf] [BibTeX]

Findings of the IWSLT 2023 Evaluation Campaign
Milind Agarwal, Sweta Agrawal, Antonios Anastasopoulos, Luisa Bentivogli, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, Mingda Chen, William Chen, Khalid Choukri, Alexandra Chronopoulou, Anna Currey, Thierry Declerck, Qianqian Dong, Kevin Duh, Yannick Estève, Marcello Federico, Souhir Gahbiche, Barry Haddow, Benjamin Hsu, Phu Mon Htut, Hirofumi Inaguma, Dávid Javorský, John Judge, Yasumasa Kano, Tom Ko, Rishu Kumar, Pengwei Li, Xutai Ma, Prashant Mathur, Evgeny Matusov, Paul McNamee, John P. McCrae, Kenton Murray, Maria Nadejde, Satoshi Nakamura, Matteo Negri, Ha Nguyen, Jan Niehues, Xing Niu, Atul Kr. Ojha, John E. Ortega, Proyag Pal, Juan Pino, Lonneke van der Plas, Peter Polák Elijah Rippeth, Elizabeth Salesky, Jiatong Shi, Matthias Sperber, Sebastian Stüker, Katsuhito Sudoh, Yun Tang, Brian Thompson, Kevin Tran, Marco Turchi, Alex Waibel, Mingxuan Wang, Shinji Watanabe, Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
[pdf] [BibTeX]

End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
Juan Pablo Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, Rohit Paturi, Sundararajan Srinivasan, Prashant Mathur, Brian Thompson, Marcello Federico
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
[pdf] [BibTeX]

Speaker Diarization of Scripted Audiovisual Content
Yogesh Virkar, Brian Thompson, Rohit Paturi, Sundararajan Srinivasan, Marcello Federico
arXiv preprint
[pdf] [BibTeX]

Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters
Proyag Pal, Brian Thompson, Yogesh Virkar, Prashant Mathur, Alexandra Chronopoulou, Marcello Federico
Proceedings of INTERSPEECH 2023
[pdf] [BibTeX] [code]

Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing
Alexandra Chronopoulou, Brian Thompson, Prashant Mathur, Yogesh Virkar, Surafel M. Lakew, Marcello Federico
arXiv preprint
[pdf] [BibTeX]

Improving Robustness of Retrieval Augmented Translation via Shuffling of Suggestions
Cuong Hoang, Devendra Sachan, Prashant Mathur, Brian Thompson, Marcello Federico
arXiv preprint
[pdf] [BibTeX]

Improving Retrieval Augmented Neural Machine Translation by Controlling Source and Fuzzy-Match Interactions
Cuong Hoang, Devendra Sachan, Prashant Mathur, Brian Thompson, Marcello Federico
Findings of the Association for Computational Linguistics (EACL 2023)
[pdf] [BibTeX]

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing
William Brannon, Yogesh Virkar, Brian Thompson
Transactions of the Association for Computational Linguistics (TACL 2023)
[pdf] [BibTeX]

Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric into a Document-Level Metric
Giorgos Vernikos, Brian Thompson, Prashant Mathur, Marcello Federico
Proceedings of the Seventh Conference on Machine Translation (WMT 2022)
[pdf] [BibTeX] [code]

Improving Arabic Diacritization by Learning to Diacritize and Translate
Brian Thompson and Ali Alshehri
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)
[pdf] [BibTeX] [talk]

Paraphrase Generation as Zero-Shot Multilingual Translation: Disentangling Semantic Similarity from Lexical and Syntactic Diversity
Brian Thompson and Matt Post
Proceedings of the Fifth Conference on Machine Translation (WMT 2020)
[code] [pdf] [BibTeX] [talk]

Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing
Brian Thompson and Matt Post
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
[code] [pdf] [BibTeX] [talk]

Exploiting Sentence Order in Document Alignment
Brian Thompson and Philipp Koehn
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
[pdf] [BibTeX] [talk]

Simulated Multiple Reference Training Improves Low-Resource Machine Translation
Huda Khayrallah, Brian Thompson, Matt Post, and Philipp Koehn
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
[code] [pdf] [BibTeX] [talk]

ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz-Rojas, Leopoldo Pla, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, and Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020)
[website] [pdf] [BibTeX] [talk]

Benchmarking Neural and Statistical Machine Translation on Low-Resource African Languages
Kevin Duh, Paul McNamee, Matt Post, and Brian Thompson
Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020)
[pdf] [BibTeX]

Vecalign: Improved Sentence Alignment in Linear Time and Space
Brian Thompson and Philipp Koehn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
[code] [pdf] [BibTeX]

HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation
Brian Thompson, Rebecca Knowles, Xuan Zhang, Huda Khayrallah, Kevin Duh, and Philipp Koehn
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
[dataset] [pdf] [BibTeX]

Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation
Brian Thompson, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) 2019
[pdf] [BibTeX] [talk]

Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation
Brian Thompson, Huda Khayrallah, Antonios Anastasopoulos, Arya D. McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson, and Philipp Koehn
Proceedings of the Third Conference on Machine Translation (WMT) 2018
[pdf] [BibTeX]

Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation
Huda Khayrallah, Brian Thompson, Kevin Duh, and Philipp Koehn
Proceedings of the Workshop on Neural Machine Translation (WNMT) 2018
[pdf] [BibTeX]

The JHU Machine Translation Systems for WMT 2018
Philipp Koehn, Kevin Duh, and Brian Thompson
Proceedings of the Third Conference on Machine Translation (WMT) 2018: Shared Task Papers
[pdf] [BibTeX]

The AFRL-MITLL WMT17 Systems: Old, New, Borrowed, BLEU
Jeremy Gwinnup, Timothy Anderson, Grant Erdmann, Katherine Young, Michaeel Kazi, Elizabeth Salesky, Brian Thompson, and Jonathan Taylor
Proceedings of the Second Conference on Machine Translation (WMT) 2017
[pdf] [BibTeX]

Implicitly-Defined Neural Networks for Sequence Labeling
Michaeel Kazi and Brian Thompson
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) 2017
[pdf] [BibTeX] [talk]

The MITLL-AFRL IWSLT 2016 Systems
Michaeel Kazi, Elizabeth Salesky, Brian Thompson, Jonathan Taylor, Jeremy Gwinnup, Timothy Anderson, Grant Erdmann, Eric Hansen, Brian Ore, Katherine Young, and Michael Hutt
Proceedings of the ninth International Workshop on Spoken Language Translation (IWSLT) 2016
[pdf] [BibTeX]

The AFRL-MITLL WMT16 News-Translation Task Systems
Jeremy Gwinnup, Tim Anderson, Grant Erdmann, Katherine Young, Michaeel Kazi, Elizabeth Salesky, and Brian Thompson
Proceedings of the First Conference on Machine Translation (WMT) 2016
[pdf] [BibTeX]

The MITLL-AFRL IWSLT 2015 System
Michaeel Kazi, Brian Thompson, Elizabeth Salesky, Timothy Anderson, Grant Erdmann, Eric Hansen, Brian Ore, Katherine Young, Jeremy Gwinnup, Michael Hutt, and Christina May
Proceedings of the International Workshop on Spoken Language Translation (IWSLT) 2015
[pdf] [BibTeX]

The AFRL-MITLL WMT15 System: There’s More than One Way to Decode It!
Jeremy Gwinnup, Tim Anderson, Grant Erdmann, Katherine Young, Christina May, Michaeel Kazi, Elizabeth Salesky, and Brian Thompson
Proceedings of the Tenth Workshop on Statistical Machine Translation (WMT) 2015
[pdf] [BibTeX]

The MITLL-AFRL IWSLT 2014 MT System
Michaeel Kazi, Elizabeth Salesky, Brian Thompson, Jessica Ray, Michael Coury, Wade Shen, Tim Anderson, Grant Erdmann, Jeremy Gwinnup, Katherine Young, Brian Ore and Michael Hutt
Proceedings of the International Workshop on Spoken Language Translation (IWSLT) 2014
[pdf] [BibTeX]

Comparing a High and Low-Level Deep Neural Network Implementation for Automatic Speech Recognition
Jessica Ray, Brian Thompson, and Wade Shen
2014 First Workshop for High Performance Technical Computing in Dynamic Languages
[pdf] [BibTeX]

Discrimination Between Singing and Speech in Real-World Audio
Brian Thompson
IEEE Spoken Language Technology Workshop (SLT) 2014
[pdf] [BibTeX]