AI Business

How much can we save through compression?

Estimating the cost savings from model compression.

Semin Cheon
SqueezeBits Team Blog

--

AI-generated image

Cost reduction in model serving is a major concern for companies that deliver AI-based products and services. The hefty price of AI deployment has made it challenging for companies to budget on inference-related expenses. Most of these expenses arise from heavily using clusters of GPUs. Squeezebits offers cost-effective, practical solutions based on model compression that lift those stressful burdens. This article attempts to articulate the benefits of employing model compression by estimating the cost deduction that can be achieved.

Even before discussing deployment, the initial step of model training comes at an exorbitant price. According to a source on Chatbot Pricing, developing an in-house custom AI chatbot can cost up to $20,000 a month and can even jump to $100,000 if the chatbot is extensively customized. Fortunately, breaking even on this hefty investment is achievable when revenues later arise from operating the AI model. Over time, training costs will be amortized over the profit made from the model’s inferences. Yet it is imperative to realize that for further profit maximization, not only should training expenses be recouped, but serving costs should also be minimized to the fullest extent. As inference recurs, serving costs can accumulate and become even more burdensome than training costs. It is a relief that utilizing model compression techniques can alleviate the painful burden of model serving costs. But how is it so?

To further illustrate this idea, the forthcoming section sets a hypothetical company and its business environment to calculate the ballpark figure of AI model serving. Though business circumstances over diverse industries are dissimilar, and making a single prediction on how the expenses will pan out can be a great generalization and oversimplification, the purpose of this speculative analysis is to offer a basic idea of what the expenditures will look like before and after model compression is employed. In clarifying how costs can be optimized, we hope your business can devise a more detailed, comprehensive plan on model serving expenditure using compression technology.

A hypothetical bank intends to employ a conversational AI chatbot that answers customers’ questions on their app. Clients would send a query as a prompt and receive an answer from the generated output. This assumes the following circumstances.

  • The content of the chatbot conversation would be general information on app utilities, details of the client’s savings account, personal financial advisory service, tax tracking, and more. The assumed input token count will be set to 1900 and the output token count to 160.
  • The MAU(monthly active users) of this app is 10 million, and at least 1% of the MAU visit the app on a daily basis. Thus, this app's DAU(daily active users) will be 100 thousand(K).
  • Out of the 100K people visiting this app daily, only about 20% will use the AI chatbot(=20K users). When interacting with the chatbot, users on average, will have around 3 transactions, where 1 transaction accounts for 1 input sent and 1 output generated. There will be a total of 60K transactions(requests) per day.
AI-generated image

Cost Estimations

To estimate model inference costs, we use OpenAI’s GPT-4 pricing for reference. For every 1 million input tokens, it costs $30; for every 1 million output tokens, it costs $60. Reflecting the number of daily requests and tokens, it will cost around $3,420 for input and $576 for output daily. The daily cost will total to $3,996, close to $4K. It will round up to roughly $12K a month minimum for a month.

What would the cost reductions look like if compression were applied to the language model? To see what the expenses look like after compression is applied, there are even more variables to consider: the complexity of the network’s architecture, model size, compression method, hardware resources, deployment environment, desired accuracy level—the list goes on. Because conclusively determining the myriad of factors that play a role in compression is challenging, we presume the compression level to be low, around 40%, for minimal losses in accuracy.

If model compression is successful in instigating a 40% reduction in computations, memory, and energy resources, cost savings of a comparable amount will be realized. Since model operations without compression were $12K, the ballpark figure for budget savings will be $4,800 per month. In a year, that would grow to a whopping $57,600.

This article estimates the cost savings in AI operations to understand how model compression can help companies financially. The operation of an AI model is painfully expensive and applying compression to facilitate operations is now more of a requirement than an option. SqueezeBits offers affordable solutions for model compression, facilitating AI deployment and operations for businesses. If you’re interested in compressing your model to maximize its potential, find us at the links below or contact us at info@squeezebits.com

--

--