Building and Load-Testing a Machine Learning Service

Apr 17

Introduction

The last two years have been a wild ride for the real estate market! This week we decided to try out some cool AWS technologies and build a real-time real estate price prediction service powered by Machine Learning (ML).

And yes, ahem, let’s pretend for a moment that the Zillow mishap never happened… 😬

In real-world applications, real-time ML services must be able to satisfy multiple prediction requests coming from a number of users at the same time. So, we also explored how to perform load tests using Python to test whether our ML service is capable of sustaining the desired load.

Curious to learn more about all this? Well, read on!
All the code associated with this post is available in a dedicated GitHub repository.

Building the ML Model

We start by building a simple linear regression model in a Jupyter notebook using the real estate valuation data set from the UCI Machine Learning repository. The dataset contains information about 414 houses in New Taipei City, Taiwan. For simplicity, in this post we only use one predictor: the logarithm of the distance between a house and the Mass Rapid Transit (MRT) station that is closest to it.

Note that we are intentionally oversimplifying the model building process and we are skipping all the usual best practices such as data splitting, model tuning, and evaluation of performance metrics. Our focus in this post is on how to deploy an ML model. The simplest ML model that we can build and play with will suffice. 😉

Linear regression of a house unit price (New Taiwan Dollars/Ping) on the logarithm of the distance (in meters) of the house from the closest Mass Rapid Transit station. — *Linear regression of a house unit price (10,000 New Taiwan Dollars/Ping) on the logarithm of the distance (in meters) of the house from the closest Mass Rapid Transit station.*

Deploying the ML Model

In the AWS realm, we can use Amazon SageMaker to deploy our ML models. SageMaker is a fully managed service to build, train, and deploy ML models. In our case, we already built and trained a model and we saved it to disk from the Jupyter notebook using the joblib library, so we won’t use SageMaker’s capabilities to build and train one. However, we can still use SageMaker to deploy it. To this end, we will use the AWS Cloud Development Kit (AWS CDK) to define all the necessary infrastructure.

At a high level, AWS CDK is a framework to define cloud infrastructure as code. It lets us define the AWS resources used by a cloud application using familiar programming languages such as TypeScript, JavaScript, Python, Java, or C#.

Now that we have pinned down a managed service to deploy our model (Amazon SageMaker) and a framework to describe and provision the necessary infrastructure in AWS (AWS CDK), all that’s left is to write the necessary code. We need to

write out the code that specifies the collection of AWS resources that make up our system (the “stack”)
create a SageMaker-compatible Docker image for our ML model. If you are new to Docker, think of Docker as a tool to create standalone executable packages that include all that is needed to run your application.

Writing Out the ML Stack

We use the TypeScript version of the AWS CDK to define our ML stack. If you prefer Python, remember that there is a Python version of the AWS CDK as well. Our ML stack consists of 3 main components.

1. A SageMaker model, referencing a Docker image that contains the logic describing how to interact with our ML model (we will define this Docker image in a later step):

const sageMakerModel = new CfnModel(this, "SageMakerModel", {
  executionRoleArn: sageMakerRole.roleArn,
  primaryContainer: { image: sageMakerDockerImage.imageUri },
});

2. A SageMaker endpoint (i.e., the point of contact between SageMaker and our ML model) against which we will be able to make requests to obtain predictions from our model:

new CfnEndpoint(this, "SageMakerEndpoint", {
  endpointConfigName: sageMakerEndpointConfig.attrEndpointConfigName,
  endpointName: endpointOptions.endpointName,
});

3. A SageMaker endpoint configuration that includes information such as the type and the number of EC2 instances to which we want to deploy our ML model (if you are new to AWS EC2, think of EC2 instances as virtual machines representing the physical servers to which our model will be deployed):

const sageMakerEndpointConfig = new CfnEndpointConfig(
  this,
  "SageMakerEndpointConfig",
  {
    productionVariants: [
      {
        initialInstanceCount:
          endpointConfigurationOptions.initialInstanceCount,
        initialVariantWeight:
          endpointConfigurationOptions.initialVariantWeight,
        instanceType: endpointConfigurationOptions.instanceType,
        modelName: sageMakerModel.attrModelName,
        variantName: endpointConfigurationOptions.variantName,
      },
    ],
  }
);

Creating a SageMaker-Compatible Docker Image

For SageMaker to know how to serve the prediction requests that it receives, we must describe

how our model should be loaded from disk into memory
how the request data (i.e. the predictors values) should be formatted
how the model should use the formatted request data to make predictions
how the prediction results should be returned to the client.

We can use the sagemaker-inference-toolkit to configure an “inference handler” and accomplish all of the above.

class InferenceHandler(DefaultInferenceHandler):
    def default_model_fn(self, model_dir):
        # Hard code the model artifact that ships
        # with the Docker container.
        model_path = "/opt/ml/model/model.joblib"
        return joblib.load(model_path)

    def default_input_fn(self, input_data, content_type):
        # Accept JSON, e.g. {"distance": 150.3}.
        # For simplicity, allow only scoring one
        # JSON object at a time.
        if content_type == content_types.JSON:
            data = json.loads(input_data)["distance"]
            return np.log(data).reshape(1, -1)
        raise UnsupportedFormatError(content_type)

    def default_predict_fn(self, data, model):
        # Invoke the sklearn model predict method.
        return model.predict(data)

    def default_output_fn(self, prediction, accept):
        # Create JSON response, e.g. {"cost": 25.7}.
        if accept == content_types.JSON:
            response = {"cost": round(prediction[0], 2)}
            return json.dumps(response)
        raise UnsupportedFormatError(accept)

According to the inference handler defined above, to satisfy a prediction request we:

load the model from disk to memory using the joblib library
format the request data by taking the logarithm of the predictor value
call the predict method of our linear regression model
return the model prediction as a JSON string.

Following the examples in the sagemaker-inference-toolkit documentation, we then configure a “handler service” and include it together with the model artifact and with the inference handler defined above into a Docker image that SageMaker can use for real-time inference. Finally, we add this Docker image to our CDK stack:

const sageMakerDockerImage = new DockerImageAsset(
  this,
  "SageMakerDockerImageAsset",
  {
    directory: "lib/docker",
  }
);

After the stack is deployed (i.e. once the AWS resources that make up our service are successfully provisioned after executing the cdk deploy command), we can verify that the newly created SageMaker endpoint works by making a prediction request with the CLI version of the SageMaker runtime client. For instance, what is the predicted price of a house in New Taipei City that is located 150 meters away from the closest MRT station?

aws sagemaker-runtime invoke-endpoint \
  --endpoint-name real-estate-evaluation \
  --content-type application/json \
  --body $(echo '{"distance": 150}' | base64) \
  ~/Desktop/prediction.json

The response from the endpoint is

{"cost": 50.3}

which should be interpreted as 503,000 New Taiwan Dollars per Ping (1 Ping = 35.58 sq. ft.).

Building a REST API

The next step is to build a REST API to expose our service to its end users (e.g., real estate agents, buyers, home owners, etc.). This can be accomplished using Amazon API Gateway. The most natural approach based on what we built so far would be to extend the current CDK stack with the necessary API Gateway components (we leave this as an exercise for the reader). For the purpose of exploration and learning, we instead decided to give AWS Chalice a try. AWS Chalice is a framework built on top of AWS CDK that can be used to write serverless AWS applications in Python.

To build an AWS API Gateway REST API using Chalice, all we need to do is create a new Chalice application and define a route (/evaluate, in this example) in which we invoke our SageMaker endpoint using the Python version of the SageMaker runtime client:

@app.route("/evaluate", methods=["POST"])
def evaluate():
    response_headers = {"Content-Type": "application/json"}

    def make_code_message_body(code, message):
        return {
            "Code": code,
            "Message": message,
        }

    try:
        distance = Distance(**app.current_request.json_body)
        response = client.invoke_endpoint(
            EndpointName=options["sageMaker"]["endpointName"],
            ContentType="application/json",
            Body=json.dumps(distance.dict()),
        )
        return Response(
            body=response["Body"].read(),
            headers=response_headers,
            status_code=200,
        )
    except ValidationError as exception:
        return Response(
            body=make_code_message_body(
                "ValidationError (pydantic)", exception.errors()
            ),
            headers=response_headers,
            status_code=400,
        )
    except ...

Here you can find the full file for this Chalice application and here is the relevant AWS Chalice documentation.

After deploying our Chalice application (the command is chalice deploy), we need to take note of the URL of the API. It will be printed out on your terminal screen once the deployment completes and its format will be https://<api-id>.<aws-region>.amazon.aws.com/api/.

Clients can now make real estate price prediction requests by means of this API. Front-end developers and app developers can use it to build a beautiful website or a mobile app that allows users to enter a value for the distance to the closest MRT station and get a prediction for the unit price of a house.

Here, we simply use the curl command to make a test request using the API that we just built:

curl -w "\n" \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"distance": 1000}' \
  https://<api-id>.<aws-region>.amazon.aws.com/api/evaluate/

The response that we get is

{"cost": 33.38}

which tells us that if we were to open a new Data Captains office in New Taipei City that is 1 kilometer away from the closest MRT station we should expect to pay 333,800 New Taiwan Dollars per Ping. One day, maybe!

Load-Testing the ML Service

Now that we have a live ML service and an API to interact with it, it’s a good idea to evaluate what kind of load the service can handle.

For the sake of the example, let’s assume that we forecast to receive up to 1,000 requests per second from our users. Currently, we have deployed our model to a single c4.large EC2 instance which features 2 vCPUs, 3.75 GiB of RAM and a “moderate” network performance (see the AWS documentation on EC2 instance types). Can this single instance handle 1,000 requests per second?

To answer this question we will load-test our ML service using locust, a convenient Python load-testing tool. Following the locust documentation, we write a locustfile containing the logic of our load test:

class MyLocust(FastHttpUser):
    host = options["apiUrl"]
    wait_time = constant_pacing(1)

    @task
    def evaluate(self):
        self.client.post(
            "/evaluate",
            json={"distance": 1.0 + random.random() * 6499.0},
        )

The approach that we use here is to simulate a collection of users who are each making 1 request per second. Note that our imaginary users are making POST requests using the API that we previously built using Chalice (its URL is the value of the host class attribute in the code snippet above).

After launching locust, we can point our favorite browser to its UI (locally, it’s at http://0.0.0.0:8089).

There, we set the number of users to 1,000 so to reach the target 1,000 requests per second at peak. Also, we set the spawn rate parameter to 1 so that - starting from a single user - a new user is added every second, thus slowly generating more and more pressure on the system as the test progresses. Finally, we start the test by clicking on the “Start swarming” button.

Here are the test results:

Locust requests per second and failed requests per second chart (single machine).

Locust number of users chart. — *Top chart: requests per second and failed requests per second (single machine). Bottom chart: number of users.*

From the top graph, we observe that our service begins to fail at about 600 requests per second (i.e. 600 users in this test). While this is not bad for a single EC2 instance, it’s clear that our service is not able to meet the target of 1,000 requests per second. The bottom graph confirms that we added 1 user every second until we reached 1,000 users.

So, what can we do? As the number of requests increase, we can try and distribute them across multiple EC2 instances that are all serving our model. This way, none of the single instances will get overloaded.

Auto-Scaling

We can use AWS Auto Scaling with SageMaker and dynamically scale the fleet of EC2 instances that is available to our ML service depending on the number of requests received by the service. When more requests come in, the fleet will be “scaled out” (i.e., more instances will be added to the fleet). When traffic decreases and fewer requests come in, the fleet will be “scaled in” (i.e., some instances will be shut down and removed from the fleet).

This is a simple extension to our CDK stack and AWS documents how to do this in one of their blog posts. All we need to do is declare SageMaker as a “scalable target” and set the minimum and maximum sizes allowed for the dynamically scalable EC2 fleet (we set the minimum to 1 and the maximum to 10 in our example). Additionally, we need to set up a “target tracking scaling policy” to dynamically scale in and scale out the fleet of EC2 instances based on the value of a target metric. In this case, the target metric is the average number of instance invocations per minute. In our example, we set its threshold conservatively to 60 (i.e., the EC2 fleet will be increased by an additional instance as soon as the average number of requests processed by a single instance reaches the rate of 1 request per second).

if (autoscalingOptions.useAutoScaling) {
  const resourceId = `endpoint/${endpointOptions.endpointName}/variant/${endpointConfigurationOptions.variantName}`;
  const sageMakerScalableTarget = new ScalableTarget(
    this,
    "SageMakerScalableTarget",
    {
      minCapacity: scalableTargetOptions.minCapacity,
      maxCapacity: scalableTargetOptions.maxCapacity,
      resourceId: resourceId,
      scalableDimension: scalableTargetOptions.scalableDimension,
      serviceNamespace: ServiceNamespace.SAGEMAKER,
    }
  );

  new TargetTrackingScalingPolicy(
    this,
    "SageMakerTargetTrakingScalingPolicy",
    {
      policyName: targetTrakingScalingPolicyOptions.policyName,
      predefinedMetric:
        PredefinedMetric.SAGEMAKER_VARIANT_INVOCATIONS_PER_INSTANCE,
      scaleInCooldown: Duration.seconds(
        targetTrakingScalingPolicyOptions.scaleInCooldown
      ),
      scaleOutCooldown: Duration.seconds(
        targetTrakingScalingPolicyOptions.scaleOutCooldown
      ),
      scalingTarget: sageMakerScalableTarget,
      targetValue: targetTrakingScalingPolicyOptions.targetValue,
    }
  );
}

Here is what we get after repeating the load test with this new configuration of the ML stack:

Locust requests per second and failed requests per second chart (with auto-scaling).

Locust number of users. — *Top chart: requests per second and failed requests per second (with auto-scaling). Bottom chart: number of users.*

Thanks to auto-scaling, our ML service can now handle 1,000 requests per second without breaking a sweat.

Remarks

It is worth taking a moment to briefly summarize the system that we built.

In this post, we built a real-time ML service using Amazon SageMaker and AWS Auto Scaling. We built and deployed this part of the system using the AWS CDK. While we could have implemented an API for this service by extending the same CDK stack, we decided to try out AWS Chalice (which also uses CDK behind the scenes). By means of a simple Python file, Chalice helped us implement a REST API for our ML service using Amazon API Gateway and its Lambda integration (if you are new to AWS Lambda, think of it in this case as a function that forwards the data received by API Gateway from the clients to the SageMaker endpoint). As a small side note, had we defined the API as part of the stack that we created using AWS CDK, we would have had to be explicit about the API Gateway-Lambda integration. Chalice did that for us behind the scenes in this case.

We then used the Python library locust to load-test our service. For the load test, we generated traffic directly from our laptop, but for larger and more complex load tests one can also use locust in a distributed fashion (see e.g. the locust-swarm and cdk-deployment-of-locust projects). Note that in this post we wanted to explore the use of AWS Auto Scaling, but another way to ensure that our service can handle a larger volume of requests is to select a single more powerful EC2 instance. In a real-world setting, there are considerations - including cost considerations - to be made in order to identify the most convenient option. For example, the choice may depend on the traffic patterns: is the traffic steady at 1,000 requests at most times or does it only reach this peak at certain times?

It is also worth pointing out that, besides its REST API, API Gateway also offers an HTTP API for applications that have low latency requirements. We didn’t focus on the latency characteristics of our service in this post, but locust provides latency information as part of the load test as well.

We didn’t use the capabilities that SageMaker offers to train ML models in this post, but we certainly invite interested readers to explore SageMaker in more depth. SageMaker also offers plenty of additional features beyond model building, training, and deployment. Finally, it is worth mentioning that if one is building systems in AWS and wants to estimate their costs, AWS offers a very convenient Pricing Calculator tool.

Conclusions

In this post, we showed how to deploy a pre-existing ML model using Amazon SageMaker and how to dynamically auto-scale SageMaker endpoints with Amazon Auto Scaling. We used AWS CDK to define the necessary AWS infrastructure as code. Also, we tried out AWS Chalice to build a REST API for our service. Finally, we described how to use the Python library locust to run load tests.

Are you trying to bring your ML models to production or need help developing ML infrastructure? Data Captains can help!
Get in touch with us at info@datacaptains.com or schedule a free exploratory call.

Edit: on April 21st, Amazon announced the GA launch of Amazon SageMaker Serverless Inference. It features up to 200 concurrent invocations per endpoint and built-in autoscaling.

$\setCounter{0}$

Mattia Ciollaro