Tasks for LLMs and Their Evaluation

Ostapuk, Natalia; Audiffren, Julien

doi:10.1007/978-3-031-54827-7_6

Natalia Ostapuk⁶ &
Julien Audiffren⁶

14k Accesses

Abstract

Since their inception, LLMs have been evaluated on a wide range of natural language tasks. These tasks include Reading Comprehension, Question Answering, Reasoning, and Text Generation. While LLMs have shown promising results, in particular as general models, their capabilities vary depending on their architecture, training dataset, and the nature of the task. We will briefly define the natural language tasks and give an overview of LLMs’ current state of the art.

Download to read the full chapter text

Chapter PDF

References

Zoie Zhao et al. More human than human: Llm-generated narratives outperform human-llm interleaved narratives. In Creativity and Cognition, C&C 2023, Virtual Event, USA, June 19–21, 2023, pages 368–370. ACM, 2023.
Google Scholar
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1–6, 2021, pages 3816–3830. Association for Computational Linguistics, 2021.
Google Scholar
Alejandro Lopez-Lira and Yuehua Tang. Can chatgpt forecast stock price movements? return predictability and large language models. CoRR, abs/2304.07619, 2023.
Google Scholar
Zengzhi Wang et al. Is chatgpt a good sentiment analyzer? A preliminary study. CoRR, abs/2304.04339, 2023.
Google Scholar
Percy Liang et al. Holistic evaluation of language models. CoRR, abs/2211.09110, 2022.
Google Scholar
Alejandro Peña et al. Leveraging large language models for topic classification in the domain of public affairs. In Mickaël Coustaty and Alicia Fornés, editors, Document Analysis and Recognition - ICDAR 2023 Workshops - San José, CA, USA, August 24–26, 2023, Proceedings, Part I, volume 14193 of Lecture Notes in Computer Science, pages 20–33. Springer, 2023.
Google Scholar
Kai-Cheng Yang and Filippo Menczer. Large language models can rate news outlet credibility. CoRR, abs/2304.00228, 2023.
Google Scholar
Longyue Wang et al. Document-level machine translation with large language models. CoRR, abs/2304.02210, 2023.
Google Scholar
Amr Hendy et al. How good are GPT models at machine translation? A comprehensive evaluation. CoRR, abs/2302.09210, 2023.
Google Scholar
Md. Tahmid Rahman Laskar et al. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9–14, 2023, pages 431–469. Association for Computational Linguistics, 2023.
Google Scholar
Takeshi Onishi et al. Who did what: A large-scale person-centered cloze dataset. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pages 2230–2235. The Association for Computational Linguistics, 2016.
Google Scholar
Guokun Lai et al. RACE: large-scale reading comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, pages 785–794. Association for Computational Linguistics, 2017.
Google Scholar
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pages 2383–2392. The Association for Computational Linguistics, 2016.
Google Scholar
Tomás Kociský et al. The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguistics, 6:317–328, 2018.
Article Google Scholar
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 2: Short Papers, pages 784–789. Association for Computational Linguistics, 2018.
Google Scholar
Dheeru Dua et al. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pages 2368–2378. Association for Computational Linguistics, 2019.
Google Scholar
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
Google Scholar
Tom Kwiatkowski et al. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466, 2019.
Google Scholar
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics, 2017.
Google Scholar
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18–21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1533–1544. ACL, 2013.
Google Scholar
Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023.
Google Scholar
Yonatan Bisk et al. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pages 7432–7439. AAAI Press, 2020.
Google Scholar
Maarten Sap et al. Social iqa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 4462–4472. Association for Computational Linguistics, 2019.
Google Scholar
Hector J. Levesque. The winograd schema challenge. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21–23, 2011. AAAI, 2011.
Google Scholar
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, 2021.
Article Google Scholar
Rowan Zellers et al. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019.
Google Scholar
Yixin Nie et al. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, pages 4885–4901. Association for Computational Linguistics, 2020.
Google Scholar
Dan Hendrycks et al. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021.
Google Scholar
Karl Cobbe et al. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
Google Scholar
Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality. arXiv preprint arXiv:2305.00050, 2023.
Google Scholar
Hugo Touvron et al. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
Google Scholar
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. CoRR, abs/1911.12237, 2019.
Google Scholar
Karl Moritz Hermann et al. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pages 1693–1701, 2015.
Google Scholar
Mark Chen et al. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
Google Scholar
Jacob Austin et al. Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
Google Scholar
Yew Ken Chia et al. INSTRUCTEVAL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757, 2023.
Google Scholar
Piotr Mirowski, Kory W Mathewson, Jaylen Pittman, and Richard Evans. Co-writing screenplays and theatre scripts with language models: Evaluation by industry professionals. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–34, 2023.
Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA, pages 311–318. ACL, 2002.
Google Scholar
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
Google Scholar
Michael J. Denkowski and Alon Lavie. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan, editors, Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT@EMNLP 2011, Edinburgh, Scotland, UK, July 30–31, 2011, pages 85–91. Association for Computational Linguistics, 2011.
Google Scholar
Yejin Bang et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023, 2023.
Google Scholar
Tom B. Brown et al. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual, 2020.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Fribourg, Fribourg, Switzerland
Natalia Ostapuk & Julien Audiffren

Authors

Natalia Ostapuk
View author publications
You can also search for this author in PubMed Google Scholar
Julien Audiffren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalia Ostapuk .

Editor information

Editors and Affiliations

HES-SO Valais-Wallis, Sierre, Switzerland
Andrei Kucharavy
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Octave Plancherel
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Valentin Mulder
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Alain Mermoud
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Vincent Lenders

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ostapuk, N., Audiffren, J. (2024). Tasks for LLMs and Their Evaluation. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-54827-7_6
Published: 12 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics