Google-like Text-to-Speech API pricing with Project X

5 min readDec 9, 2022

Overview

Google’s Text-to-Speech API is priced based on the number of characters sent to the API to be synthesized into audio each month.

Let’s take a look at their pricing table:

There are three features (basically, three types of voices), each billed separately. A free tier allows up to 1–4 million synthesized characters depending on the used voice. Overages allowed for different prices.
Characters include alphanumeric characters, punctuation, and white spaces.

The endpoint is called as follows:

POST https://api.tts.vendor.com/v1/text:synthesize
Authorization: Bearer <Access token>
Content-Type: application/json; charset=utf-8

{
  "input": {
    "text": "I've added the event to your calendar."
  },
  "voice": {
    "languageCode": "en-gb",
    "name": "en-GB-Standard-A",
    "ssmlGender": "FEMALE"
  },
  "audioConfig": {
    "audioEncoding": "MP3"
  }
}

Implementation

A lot of other billing solutions would give up at this point. The problem here, there is only one endpoint and the distinguishing parameter (voice name) is not easily accessible, as it’s located within the JSON request body, and actually, it’s not the name of the voice type.

However, Project X has all the necessary tooling to implement the exact pricing model.

First of all, we need to add the endpoint in Project X UI:

Project X: TTS endpoint added in Project X UI — TTS endpoint added in Project X UI

Next, create the only Pay-as-you-go plan:

Project X: Pay-as-you-go subscription plan — Pay-as-you-go subscription plan

Basic quota configuration

Now, the quotas. Each of them is for the same endpoint.

For Standard voices:

For WaveNet voices:

For Neural2 voices:

Note that all three quotas have the checkbox “Is hard limit” unchecked, meaning, those are soft quotas and allow overages.

Quota list after adding all three looks like this:

Advanced quota configuration

API calls are made to the same endpoint but if the request is for Standard voice, the first quota should be used, if it’s for WaveNet voice, the WaveNet quota should be used, if it’s for Neural2 voice, the Neural2 voice quota should be used.

With Project X, implementing this kind of behavior is trivial.
Two main parameters help here: quota quantity used by each API call and the condition for the quota to be used.

For Standard voices it would be as follows:
Used quantity per call — number of characters in the input text.
Condition — the quota is used if the voice belongs to Standard voices.

Project X allows to use the request body for making decisions on how much of the quota should be used.

So the algorithm is to parse request body as JSON, extract the needed value from it, and make a decision.

The tricky part is that there is no parameter telling the engine, of which type this particular voice is.

Nevertheless, if we take a look at the voice name we’ll see that it contains the voice type name.

"voice": {
    "name": "en-GB-Standard-A"
  }

Another examples of voice names are “cs-CZ-Wavenet-A”, “en-AU-Neural2-B”.

It means, we can make a decision which quota to use based on whether the voice name contains the respective voice type name.

The implementation for the quota “Characters synthesized (Standard voices)” will look like:

Project X: Quota configuration for Standard voices — Quota configuration for Standard voices

These fields should be valid JavaScript expressions and several variables can be used.

Translated to human language, the quantity expression can be read as
“Take the request body, decode it as JSON, take value by path “input.text” and calculate it’s length — then consume that much of the quota”.

The condition expression can be read as
“Take the request body, decode it as JSON, take value by path “voice.name” and, if it contains a substring “Standard”, then use the quota quantity”.

Repeating the configuration for two other quotas:

WaveNet

Neural2

Now each quota will be used only in the case when a particular call to the endpoint is for the respective type of voice.

Additional rule

There is one unobvious issue: if another type of voice is added in the API, the customer subscribed to the current plan will be able to use it without any limitations since there are no quotas to track that usage.

For safety, it would make sense to add a rejection rule telling the Project X API Gateway to drop the request if it contains none of the three types of voices.

Here’s the rule:

Project X: Rejection rule to drop requests with unknown voice types — Rejection rule to drop requests with unknown voice types

It checks whether the voice name contains one of the three voice types and if not, the request is dropped.

Conclusion

That’s it. Customers can now use the API and be billed in strict accordance with the pricing model. The API vendor will be able to see the actual usage on customer subscription dashboard.