Mary McGlohon, Google
A Machine Learning model is only as good as the data it trains on. However, as Large Language Models (LLMs) become more sophisticated and commonplace, the scale and complexity of ML training data and model output have dramatically increased. This presents new challenges to the already-difficult work of ensuring high-quality, reliable ML data. But we need not fear: In addressing these challenges, many foundational SRE principles still apply, such as managing tradeoffs between flexibility and stability, accounting for human operations within a system, and defining clear reliability requirements between systems' owners.
In this talk we describe common data reliability challenges in ML, and how they manifest in LLMs compared to the "classic" supervised ML systems. Drawing from experience and SRE principles, we recommend best practices for assessing and managing ML data risks in your production systems.
Mary McGlohon, Google
Mary McGlohon is a Site Reliability Engineer at Google, who has worked on large-scale ML systems for the past 6 years. Prior to that, her career included data mining research, software development, and distributed pipeline systems. She completed a B.S. in computer science from the University of Tulsa and a Ph.D. in machine learning from Carnegie Mellon University. She is interested in making ML data higher quality, more observable, and easier to debug.
author = {Mary McGlohon},
title = {Reliable Data for Large {ML} Models: Principles and Practices},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}