A Wise Approach on Endpoint Selection

Vikas Rai
Thomson Reuters Labs
6 min readMay 24, 2023

Authors: Vikas Rai, Kofi Wang, Carolyn Liang, Krish Balakrishnan

This post illustrates some practical experiments and analysis required to find the most suitable endpoint per production requirements. Finding the optimal endpoint not only can handle processing most effectively but also it can save thousands or millions of dollars for a use case.

Endpoint implementation decision factors:

  • Processing Time: What is the maximum acceptable response duration, latency?
  • Payload Size: What is the size of input payload?
  • Processing Mode: Batch, Live, At specific time/intervals?
  • Usage Pattern: How are requests distributed over the year, month, and day?
  • Cost: Many implementations may satisfy the above factors, but some may be more cost effective than others.

Once we have a high-level understanding on all above points, we can get a rough estimate about costing for Lambda (CPU) and SageMaker (CPU and GPU) endpoints along with different instance type.

AWS Lambda is a serverless compute service, where cost depends on duration (from the time code begins executing until it returns or otherwise terminates) and memory allocated for execution.

AWS SageMaker Endpoint is a service that has a fixed cost for each instance per hour, and independent of execution time or memory allocation for execution.

SageMaker Serverless endpoint was released in Dec-2021 but has several limitations including not supporting VPC.

Let us go through a multi-level page classification use case to understand a bit more and how we decided which endpoint will be best suited in this case.

  • Processing Time: Within few minutes
  • Payload Size: JSON having 20 page- 5000 pages of OCR pdf
  • Processing Mode: On-demand
  • Usage Pattern: 170000 docs/year or 14,500 docs/month, on an average 25 pages/document, 4250000 pages/year
Figure 1: Analyze page distribution for 1-million training documents
  • Cost: Let us see the outcome for Lambda and SageMaker endpoints.
Figure 2: Endpoint cost and runtime analysis

Lambda cost calculation assumption: runtime = 0.668 s/page, you are not charged for the time it takes for Lambda to prepare the function (download your code and start new execution environment, cold start) but it does add latency to the overall invocation duration. Above calculations have been carried out for batch processing.

By looking into the analysis above can we confirm that Lambda is the best suited endpoint for requirements? No, look at the page distribution in Figure 1: Analyze page distribution for 1-million training documents, Around 99% of have payload size less than 80 pages and only 1% have > 80 pages.

For the case where inferences with large payload sizes or long processing times both Lambda and SageMaker endpoints have hard runtime limits for a single request. The limits are 900 seconds (about 15 minutes) for Lambda and 60 seconds for SageMaker.

Due to this limit, the services cannot process documents with extremely large sizes. For Lambda and SageMaker with GPU instance, the limit is ~1350 pages. (900 seconds / .668 = 1347 for lambda, 60/.044 = 1363 for SageMaker endpoint, so ~1350.)

SageMaker Asynchronous uses SageMaker endpoints but can queue incoming requests and process them asynchronously.

Amazon SageMaker Asynchronous Inference is a new inference option in Amazon SageMaker that queues incoming requests and processes them asynchronously. This option is ideal for:

  • Inferences with large payload sizes (up to 1GB)
  • Long processing times (up to 15 minutes) that need to be processed as requests arrive.
  • Near real-time latency requirements.

Along with that, asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

Figure 3: Asynchronous inference

Amazon SageMaker provides the following options to deploy trained machine learning models for generating inferences on new data.

  • Real-time inference is suitable for workloads where payload sizes are up to 6MB and need to be processed with low latency requirements in the order of milliseconds or seconds.
  • Batch transform inference is ideal for offline predictions on large batches of data that is available upfront. The new asynchronous inference option is ideal for workloads where the request sizes are large (up to 1GB), and inference processing times are in the order of minutes (up to 15 minutes).

For use cases that can tolerate a cold start penalty of a few minutes (up to 5 min), you can optionally scale down the endpoint instance count to zero when there are no outstanding requests and scale back up as new requests arrive so that you only pay for the duration that the endpoints are actively processing requests.

Create an Asynchronous Inference Endpoint: Creating an asynchronous inference endpoint is like creating a real-time endpoint. You can use your existing Amazon SageMaker Models and only need to specify additional asynchronous inference specific configuration parameters while creating your endpoint configuration.

Figure 4: Asynchronous configuration

To invoke the endpoint, you need to place the request payload in Amazon S3 and provide a pointer to the payload as a part of the invocation request. Upon invocation, Amazon SageMaker enqueues the request for processing and returns an output location as a response. Upon processing, Amazon SageMaker places the inference response in the previously returned Amazon S3 location.

Figure 5: Get the response back

You can optionally choose to get the response back and to receive success or error notifications via Simple Notification Service (SNS).​​​​​​​

Auto-scale asynchronous endpoint: To auto-scale your asynchronous endpoint you must at a minimum:

  • Register a deployed model (production variant).
  • Define a scaling policy.
  • Apply the autoscaling policy.

Define a Scaling Policy that Scales to 0: The following shows you how to both define and register your endpoint variant with application autoscaling using the AWS SDK for Python (Boto3). After defining a low-level client object representing application autoscaling with Boto3, we use the RegisterScalableTarget method to register the production variant. We set MinCapacity to 0 because Asynchronous Inference enables you to auto-scale to 0 when there are no requests to process.

Figure 6: Auto-scale register endpoint

Here’s a summary of the available options with their pros and cons:

Figure 7: Summary report

Conclusion: The best endpoint solution depends on key factors like processing time, processing mode, processing mode, usage pattern etc.

For our case we opted for a multi-endpoint solution to optimize the processing cost and deal with the time-out issue. On one hand, Lambda was the most cost-effective endpoint solution per business requirement but failed to process large documents (>80 pages) due to timeout issue. On the other hand, Async Inference was the clear winner to process large payload size where lambda and SageMaker failed to process, and to optimize the processing cost we choose to auto-scale.

By having multi-endpoint, we choose to effectively utilize best of both world solution and aligned with production requirements.

References:

--

--