Abstract
Large Language Models (LLMs) can memorize training data and, if specifically prompted, reproduce or leak information on their training data. Information leakage has been observed for all types of machine-learning models. However, this threat exists at a much larger scale for LLMs because of their various applications as generative AI. This chapter relates the threat of information leakage with other adversarial threats, provides an overview of the current state of research on the mechanisms involved in memorization in LLMs, and discusses adversarial attacks aiming to extract memorized information from LLMs.
Chapter PDF
References
Nicholas Carlini et al. Membership inference attacks from first principles, 2022.
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models, 2017.
Matthew Fredrikson et al. Privacy in pharmacogenetics: An End-to-End case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14), pages 17–32, San Diego, CA, August 2014. USENIX Association.
Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, page 1322–1333, New York, NY, USA, 2015. Association for Computing Machinery.
Paul Christiano et al. Deep reinforcement learning from human preferences, 2023.
Nicholas Carlini et al. Extracting training data from large language models, 2021.
Jie Huang, Hanyin Shao, and Kevin Chen-Chuan Chang. Are large pre-trained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2038–2047, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
Fatemehsadat Mireshghallah et al. An empirical analysis of memorization in fine-tuned autoregressive language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1816–1826, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331, 2020.
Nils Lukas et al. Analyzing leakage of personally identifiable information in language models, 2023.
Jacson Rodrigues Correia-Silva, Rodrigo F. Berriel, Claudine Badue, Alberto F. de Souza, and Thiago Oliveira-Santos. Copycat CNN: Stealing knowledge by persuading confession with random non-labeled data. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, jul 2018.
Matthew Jagielski et al. High accuracy and high fidelity extraction of neural networks, 2020.
Florian Tramèr et al. Truth serum: Poisoning machine learning models to reveal their secrets, 2022.
Matt Burgess. The hacking of chatgpt is just getting started. https://www.wired.co.uk/article/chatgpt-jailbreak-generative-ai-hacking, 2023. Accessed 28 Sep 2023.
Matt Burgess. The hacking of chatgpt is just getting started. https://www.jailbreakchat.com, 2023. Accessed 28 Sep 2023.
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.
Nicholas Carlini et al. Are aligned neural networks adversarially aligned?, 2023.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Buesser, B. (2024). Private Information Leakage in LLMs. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-54827-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)