Measuring and Managing Machine Learning Performance

An Introduction to Site Reliability Engineering

Typically, we develop software because we have identified a specific user’s need and we believe that we can satisfy that need with our software. As part of the process, we must be able to describe what users should expect from using our software and we must be able to measure whether those expectations are met. This is true especially - but not exclusively - when we ask our users to pay us in exchange for the opportunity to use our software. This principle also applies irrespective of whether our users are persons (e.g., we are developing an application) or other software (e.g., we are developing a service).

Consider for example Amazon SageMaker, Amazon’s cloud Machine Learning (ML) platform. Similarly to other cloud service providers, Amazon identified a specific user’s need: the need for a platform to build and deploy ML models in the cloud. So, it built one. Paying users of Amazon SageMaker - who are themselves developers in this case - have expectations about their interactions with the platform. At a minimum, they expect that the platform will be “up” and operational when they need to use it to build and deploy their models. They would be happiest if the platform was always operational, 100% of the time. More realistically, they understand that incidents and outages are bound to happen, and that Amazon SageMaker - like any other platform - will unexpectedly cease to work from time to time (rarely, hopefully). While understanding, these users still expect Amazon to make a reasonable effort to maintain the “uptime” of SageMaker as close as possible to 100%: should SageMaker experience too much “downtime”, chances are that its users will likely become unhappy, ask Amazon for a refund, and look elsewhere to find a more reliable platform.

Besides uptime, users of Amazon SageMaker have expectations about other measurable characteristics as well. For example, if they are using it to build real-time ML models, they might have expectations about the platform’s latency (how fast is it to process requests for real-time model inference?), the platform’s throughput (how many real-time model inference requests can be handled in a given unit of time?), and so on.

This illustrates the fact that - as developers and system owners - we should always challenge ourselves with questions such as

  • What promises should we make to our users regarding important characteristics of our systems such as uptime, latency, throughput, etc.?

  • What practices can we adopt to help ourselves design, develop, and manage systems that are capable of fulfilling these promises?

  • How can we measure whether our systems are in fact fulfilling these promises?

  • What are the implications of our failure to operate our systems in such a way that these promises are fulfilled?

The importance of these questions is underscored by the existence of a dedicated set of principles and practices - Site Reliability Engineering (SRE) - that provides effective tools to answer them.

SRE originated at Google in 2003 and its birth is credited to Benjamin Treynor Sloss. For the first time, a team of software engineers was explicitly assigned the task of making Google services more reliable, more efficient, and more scalable. What followed was the adoption and expansion of these practices by several large technology companies and, ultimately, the influence of the discipline on the entire software industry.

SRE teams support software organizations in other ways as well. For example, they help define how code is deployed and monitored, they build and use software to manage systems and infrastructure, they create solutions that standardize and automate repetitive tasks (thus reducing the time that developers have to spend on manual operations and freeing up time that they can instead spend developing new user features), and more. In Google’s own words, “SRE is what you get when you treat operations as if it’s a software problem.”.

Today, most user experiences are powered by intelligent data-driven systems. This is why, now more than ever, being familiar with some SRE principles is relevant for the development and the management of ML systems.

SLAs, SLOs, and SLIs

If you are building ML systems - or any system, really - getting familiar with some SRE concepts is time well spent. SLAs, SLOs, and SLIs are a great starting point because they provide a mental framework that we can adopt to think about what promises we are making to our users, and how we are going to measure whether these promises are being fulfilled.

In the SRE jargon, a Service-Level Agreement (SLA) is an agreement between the owners of a system and the users of the same system about measurable characteristics of the system. This agreement may also specify how the system owners will compensate the users of the systems in the event that the system owners fail to satisfy the terms of the agreement (in this case, the agreement is often referred to as a financially-backed SLA).

For example, let’s consider Amazon SageMaker’s online inference function, which allows client applications to get inferences from ML models deployed using Amazon SageMaker. The SLA for this function formalizes the promise between Amazon and its paying users with respect to the function’s monthly uptime (i.e., the percentage of time in a month in which the function is operational and able to fulfill its intended task). At the time of writing, Amazon commits to ensuring that this function is available and operational for at least 99.95% of the time in any given month. The SLA also describes how Amazon will compensate its users if the commitment is not honored. For instance, if the monthly uptime of the function falls below 95% in a given month, users are entitled to a 100% refund of the charges that they paid to Amazon to use the function in that same month. Similarly, if the monthly uptime of the function is greater than 95% but lower than 99%, users are entitled to a 25% refund, and if the monthly uptime of the function is larger than 99% but lower than 99.95%, users are entitled to a 10% refund.

Amazon SageMaker Online Inference Function - SLA

Refund Policy from the Amazon SageMaker SLA for the Online Inference Function.

From a superficial read, one might think that 95% uptime in a given month is not that bad. However, to put things in perspective, less than 95% monthly uptime corresponds to more than 1.5 days of downtime (see Google’s handy availability table). For Amazon SageMaker users whose ML models depend on the online inference function, it would be unacceptable to have to plan for 1.5 days or more every month in which the function is not available. Hence the full refund clause in the SLA.

Writing an SLA for your system forces you to think critically about what you can and cannot guarantee to your users, and is never a bad idea. Furthermore, writing an SLA forces you to scrutinize the design of your system and focus on the ways in which it may fail to deliver on its promises. Based on what you learn in the process, you will be in a better position to make improvements to your design. Even when your users are other teams in your company (or their systems), writing SLAs provides a lot of value because SLAs manage expectations and help everyone build more robust software. 

System owners commonly break down an SLA into one or multiple Service-Level Objectives (SLOs). An SLO describes a precise numerical target for a given characteristic of a system that should be met or exceeded in order for the SLA to be fulfilled.

As we saw in its SLA, the target uptime of the Amazon SageMaker’s online inference function is at least 99.95% in any given month (i.e., no more than 21.6 minutes of monthly downtime). We can deduce that Amazon most likely has an SLO for the monthly uptime of this function that is greater than 99.95% (e.g., 99.99% perhaps, or no more than 4.32 minutes of monthly downtime). This is because system owners typically set their SLOs to slightly higher standards to try and ensure that the promises made in the SLAs can be fulfilled. This is a defensive mechanism: as a system owner, you want to avoid finding yourself in a position where the promises stated in your SLAs are not fulfilled and your organization has to make up for it with your users.

Finally, one or multiple Service-Level Indicators (SLIs) must be defined to measure the level of compliance of a system with respect to a given SLO. Ideally, SLIs make it possible to unequivocally determine whether or not a given SLO is met or exceeded. As a system owner, you will hold yourself accountable for making sure that your system’s SLIs hit the targets defined by your system’s SLOs. As part of your system monitoring and management, you will want to set up alarms in order to quickly detect when your system’s SLIs slide below their target SLOs. Furthermore, you will want to have remediation plans in place which can be followed to bring your system back to a normal state as quickly as possible when an alarm is triggered.

In the example of Amazon SageMaker’s online inference function, the SLI for monthly uptime is defined in a precise way as the average of the availability of all 5-minute intervals in a monthly billing cycle. Availability is defined for each 5-minute interval as the percentage of requests processed by the function within that interval that do not fail because of Amazon’s own fault.

Amazon SageMaker Online Inference Function - SLI

Definitions from the Amazon SageMaker SLA for the Online Inference Function.

Application to Machine Learning Systems

We used an SLA from Amazon SageMaker as an example and we focused on one important characteristic of that system: uptime. However, we can (and should!) apply the principles behind the SLA-SLO-SLI framework to any characteristic of our systems that influences the interactions between our systems and their users.

In this regard, ML systems are particularly interesting because they influence user experience through both system performance (e.g., uptime, latency, throughput, etc.) and model quality (e.g., accuracy, precision, recall, or other relevant statistical measures that are pertinent to a particular type of model). Accordingly, users have expectations on both. To illustrate this, let’s consider a real-world example.

Imagine that you and your team are tasked with building an ML-based speech recognition system for a smart assistant (e.g., Amazon Alexa). For this system, the model quality metrics will be as important as the system performance metrics: for example, your speech recognition system will need to understand correctly what users are saying (accuracy) and it will have to be quite fast at it (latency). Your users expect the smart assistant to understand what they are saying and to react accordingly, and they expect it to do so as close as possible to how a human would. If your speech recognition system can’t understand what users are saying, it will be worthless. And if the system is too slow at understanding, it will make the smart assistant sluggish and lead to an awkward user experience. In this case, it is therefore important to set aggressive system performance SLOs as well as aggressive model quality SLOs for your system.

It should be noted that it can be substantially harder to measure the model quality metrics of a live ML system than its system performance metrics. For example, you can certainly use historical speech data (with ground truth transcriptions) to measure the accuracy of your speech recognition system before making it live for your users. But measuring and monitoring its real-time accuracy on live traffic is a much harder task than measuring and monitoring its real-time latency (e.g., because you will not have ground truth transcriptions available in real-time for the utterances spoken by your users). Keeping an eye on the accuracy of a live ML system is important: the accuracy of a ML model can degrade over time compared to the accuracy that the model achieved when it was trained, a problem known as “model drift”. This is a fascinating topic and specific monitoring tools have been developed to help system owners detect drift in a ML system (see e.g., Amazon SageMaker Model Monitor).

Conclusion

Considering the growing importance of ML, there has never been a better time for ML developers to get familiar with SRE and to apply its principles.

If you are building a new ML system, some of your top priorities should be understanding what performance and accuracy goals your system needs to achieve in order to meet the needs of your users, how you are going to design your system in such a way that these goals are achievable, how you are going to measure whether your system achieves them, and how you are going to react when your system enters a state in which it is no longer able to achieve them. These are hard challenges. The SLA-SLO-SLI framework can help you be more intentional and more successful whenever these challenges need to be addressed.

If you want to learn more about SRE, Google’s dedicated blog and books are great resources.


Are you trying to build Machine Learning systems? Data Captains can help you!
Get in touch with us at info@datacaptains.com or schedule a free exploratory call.

$\setCounter{0}$
Previous
Previous

7 Experimentation Pitfalls

Next
Next

A Guide to Data Roles