AI-Powered Media Processing: 4 Steps To Automate Audio Story Annotation and Poster Creation

Published in

firstlineoutsourcing

11 min readJul 10, 2023

This article is the first article from the series of two. Subscribe to our blog to be notified about new articles.

Artificial intelligence is dramatically reshaping the media industry, offering new methods for creating, editing, and managing media files. Advanced AI algorithms can automate tasks such as video editing, image enhancement, and audio mastering, improving efficiency and reducing the need for human intervention. Additionally, AI-driven tools enable innovative content generation and personalization, revolutionizing how media is produced and consumed.

We at First Line Outsourcing help media and broadcasting companies with challenges in their daily workflows. In the case this article is about, we have a wide archive of audio stories. It’s a collection of around 1600 MP3 files with voiced text accompanied by exceptional electronic music in the background. We built a great mobile application that shows story titles, authors, annotations, tags, and other info.

A number of these stories have no annotation or image. For some reason, they don’t have original texts we can use. As a result, we have a perfect task for AI: Fill in the annotation and image for each story where these fields are missed. We focus on a single file to start and figure out the steps we need to take to hit our target. So, what kind of tasks do we need to get done?

Transcription

In the beginning, we’ll need to have our hands on the original narrative text from the audio file. To get this done, we’ll be leveraging a process known as transcription or Speech-to-text. There’s a whole spectrum of services out there that offer this kind of tool, with a good chunk of them even providing APIs for smooth integration.

Now, each of these tales comes with some baggage — advertisements, intros, and outros that we’ll need to snip out of the text. We don’t want these to interfere with the plot, after all. For this job, we’re going to use OpenAI’s Speech to text feature. What about limits?

File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

The files we have are in MP3 format, but their sizes may vary. The majority of them exceed 25 MB in size. We can split them and process them in parallel directly in the code. What about the price?

$0.006 / minute (rounded to the nearest second)

In this case, one hour of content will be $0.36. Compared to Amazon Transcribe, $0.024 / minute for the first 250,000 minutes means $1.44 / hour.

OpenAI Node.js Library has acreateTranscription method from a client interface. It’s better to work with media files with suitable tools, like ffmpeg. Let’s split a file into 20 minutes chunks.

await run(`ffmpeg -i ${filePath} -f segment -segment_time 1200 -c copy ${folderPath}/part%03d.mp3`);

Grab all parts and process in parallel with the openai client. By the way, what about API limits?

Pay-as-you-go users 50 RPM

We will keep it in mind but use this info in the next article of this series. Now we can make a createTranscription request to get the text.

const files: string[] = (await readdir(`${folderPath}`)).filter((file: string) => file !== 'source.mp3' && file !== '.DS_Store');
const textArray: string[] = await Promise.all(files.map(async (file: string, i: number): Promise<string> => {
  const response = await openai.createTranscription(
    createReadStream(`${folderPath}/${file}`) as any,
    'whisper-1', // model
    undefined, // prompt
    'json', // responseFormat
    0.7, // temperature
    'en' // language
  );

As you can see we have to pass a number of arguments:

model: ID of the model to use. Only `whisper-1` is currently available.
prompt: An optional text to guide the model’s style or continue the previous audio segment. The prompt should be in English.
responseFormat: The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
temperature: The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
language: The language to recognize. Whisper supports a large number of languages.

We forgot one thing here…

The openai client uses Axios under the hood with a limit on a request body. Let’s skip it by passing options as the last argument to createTranscription method.

{
  'maxContentLength': Infinity,
  'maxBodyLength': Infinity
}

The source mp3 file with a duration of 47:13 was processed by 1:40. As a result, the length of the text is 34118 characters.

From time to time I face Error: Request failed with status code 400 and would be good to handle it with a try-catch block and resend this request.

Upon reviewing the usage report on the OpenAI API page, I discovered that it charged me $0.29.

You can’t get the same stable output from AI due to its nature results (text length, execution time, and cost) can vary.

It’s time to create a summary, annotation, and image for it.

Summary

According to the initial academic paper from OpenAI, the method of task decomposition is essentially splitting a complicated task into more manageable smaller tasks. In the context of book summarization, the task can be systematically segmented into a hierarchy of summarization tasks, where only the terminal or leaf tasks directly engage with sections from the original book. Irrespective of the specifics, the following illustration provides a vivid demonstration of this strategy in practice.

Thanks to Massimiliano Costacurta for his article. Our audio stories are not too long to use the original OpenAI way, and we can use his algorithm:

In the initial iteration, we divide the input text into segments with a 20% overlap. This overlap is designed to ensure continuity and context preservation across segments. A summary is then generated for each segment.
In the subsequent iterations, we concatenate the summaries from the previous iteration and split this combined summary into new segments, again maintaining a 20% overlap.

20% overlapping to keep context from one chunk to another.

It’s not so easy to work with string characters in the text if you want to save a normal narrative and transform them to “tokens” for OpenAI API. There are packages for these types of tasks such as natural, a general natural language facility for Node.js. Let’s split the text into sentences and create chunks limited by 4096 tokens. Why?

The total number of tokens in an API call affects:
- How much your API call costs, as you pay per token
- How long your API call takes, as writing more tokens takes more time
- Whether your API call works at all, as total tokens must be below the model’s maximum limit (4096 tokens for gpt-3.5-turbo)

import { WordTokenizer, SentenceTokenizer } from 'natural';

export function splitTextIntoChunks(text: string, maxTokens = 4096, overlapRatio = 0): string[][] {
  const sentenceTokenizer = new SentenceTokenizer();
  const sentences = sentenceTokenizer.tokenize(text);
  const wordTokenizer = new WordTokenizer();
  const sentenceSets: string[][] = [];
  let currentSet: string[] = [];
  let currentSetLength = 0;
  const overlapCount = Math.floor(maxTokens * overlapRatio);
  function addOverlapFromLastSet() {
    let tokensToAdd = 0;
    const lastSet = sentenceSets[sentenceSets.length - 1];
    if (lastSet) {
      let index = 1;
      while (tokensToAdd < overlapCount) {
        const sentenceLength = lastSet[lastSet.length - index]?.length || 0;
        if (tokensToAdd + sentenceLength < overlapCount) {
          tokensToAdd += sentenceLength;
          currentSet.unshift(lastSet[lastSet.length - index]);
        } else {
          break;
        }
        index++;
      }
    }
  }
  for (const [index, sentence] of sentences.entries()) {
    const tokens = wordTokenizer.tokenize(sentence);
    const sentenceLength = tokens.join(' ').length;
    if (currentSetLength + sentenceLength <= maxTokens * (1 - overlapRatio)) {
      currentSet.push(sentence);
      currentSetLength += sentenceLength;
    } else {
      addOverlapFromLastSet();
      sentenceSets.push(currentSet);
      currentSet = [sentence];
      currentSetLength = sentenceLength;
    }
    if (index === sentences.length - 1) {
      addOverlapFromLastSet();
      sentenceSets.push(currentSet);
    }
  }
  return sentenceSets;
}

Alright, we have all the chunks. Time to summarize with createChatCompletion method.

export async function openAISummarizer(openai: OpenAIApi, inputText: string, minResponseLength: number = 0, model: number = 3): Promise<string> {
  let prompt: string = "Write a summary of this text less then 250 words:\n\n";
  let completionsModel: string = model === 3 ? "gpt-3.5-turbo" : "gpt-4";
  let messages: any[] = [{ "role": "user", "content": prompt + inputText }];
    try {
      return (await openai.createChatCompletion({
        model: completionsModel,
        messages,
        temperature: 0.5
      })).data.choices[0].message.content.trim();
    } catch (err) {
      console.log(err)
    }
}

The temperature here is what sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

As a result, we can have a situation where the summary will be longer than the limit of 4096 tokens. Recursion will solve this issue.

export async function iterativeSummarization(openai: OpenAIApi, text: string, maxToken: number): Promise<string> {
  // Split the input text into chunks with a maximum token limit and 20% overlap between chunks
  let chunks: string[][] = splitTextIntoChunks(text, maxToken, 0.20);
  let summaries: string[] = await Promise.all(chunks.map(chunk => openAISummarizer(openai, chunk.join(' '), 1024)));
  if (summaries.join(' ').length > maxToken) {
    return await iterativeSummarization(openai, summaries.join(' '), maxToken);
  } else {
    return summaries.join(' ');
  }
}

The timing’s a bit bad right now, but fear not. We’ll fine-tune that in our next piece. We’ll be pulling back the curtain on the process architecture and strategizing ways to rev up the optimization engine.

Annotation

In our quest to craft a good synopsis, we will need to carefully create a query for ChatGPT. The aim is to make the outcome interesting and exciting, without giving away any important plot details. To find the best result, you might need to try different combinations of queries until you get exactly what you want. We decided to stop on this prompt:

export async function makeAnnotation(openai: OpenAIApi, inputText: string, model: number = 3): Promise<string> {
  let prompt: string = "Shorten and paraphrase this summary of the story from 80 to 120 words. As a result, it should be when and where the action takes place, the image of the main character and why his fate will be interesting to the reader, the problem that the hero faces, the question of whether the hero can solve his problem.";
  let completionsModel: string = model === 3 ? "gpt-3.5-turbo" : "gpt-4";
  let messages: any[] = [{ "role": "user", "content": prompt + inputText }];
  try {
    let response: string = (await openai.createChatCompletion({
      model: completionsModel,
      messages,
      temperature: 1
    })).data.choices[0].message.content.trim();
    messages.push(
      { "role": "user", "content": prompt + inputText },
      { "role": "assistant", "content": response },
      { "role": "user", "content": "Revise this. More intrigue, less plot. From 80 to 120 words. Use questions related to the plot.:" }
    );
    return (await openai.createChatCompletion({
      model: completionsModel,
      messages,
      temperature: 1
    })).data.choices[0].message.content.trim();
  } catch (err) {
    console.log(err)
    throw err;
  }
}

As a result, we have:

“The boy spends time with his grandfather, who teaches him to appreciate the present moment. But an inevitable loss comes in his life: grandfather stayed only for two months. In the oppressive atmosphere of a dying city, the boy seeks to keep hope, telling his grandfather about the future. However, when he decides to travel to the future, he discovers a grim reality and faces danger. Questioning his ability to overcome obstacles, we’ll find out if the boy can find new hope in a shattered world. When meeting with the last survivor, the story takes a turn, and the boy realizes that only here, in this new world, can he find a new home and find happiness. This mysterious and emotional story captivates the reader, forcing them to ask questions about the meaning of life and the power of hope.”

Let’s move forward and generate a picture with this synopsis.

Picture

Frankly, DALL-E’s performance isn’t floating our boat. The test results have been super abstract and ugly.

Instead of it, we’d use Midjourney. It is a generative artificial intelligence program and service created by San Francisco-based independent research lab Midjourney, Inc. This program generates images from natural language descriptions or “prompts”, similar to OpenAI’s DALL-E and Stable Diffusion. If we want to generate an excellent picture, we must generate a perfectly fit prompt.

async function makePromptForMJ(openai: OpenAIApi, inputText: string, model: number = 3): Promise<string> {
  let prompt: string = "Determine the genres of the story. Simply list 8 genres and key details that will help characterize it, separated by commas.";
  let completionsModel: string = model === 3 ? "gpt-3.5-turbo" : "gpt-4";
  let messages: any[] = [{ "role": "user", "content": prompt + inputText }];
  try {
    let response: string = (await openai.createChatCompletion({
      model: completionsModel,
      messages,
      temperature: 0.5
    })).data.choices[0].message.content.trim();
    messages.push(
      { "role": "assistant", "content": response },
      { "role": "user", "content": 'Write a request based on the answer you provided and the original text of the story to generate an image in DALL·E. Start with "Create an image depicting a scene from the story where"' }
    );
    return (await openai.createChatCompletion({
      model: completionsModel,
      messages,
      temperature: 0.5
    })).data.choices[0].message.content.trim();
  } catch (err) {
    console.log(err)
  }
}

Then we need to send it to Midjourney. It doesn’t have API and we have to work with Discord for this integration. Midjourney client allows you to connect to your Discord server and use the bot.

import { Midjourney } from "midjourney";

const client = new Midjourney({
  ServerId: env.SERVER_ID,
  ChannelId: env.CHANNEL_ID,
  SalaiToken: env.SALAI_TOKEN,
  Debug: true,
  Ws:true,
});
export async function createPicture(openai: OpenAIApi, summary: string): Promise<string> {
  const prompt = await makePromptForMJ(openai, summary);
  await client.init();
  const Imagine = await client.Imagine(prompt);
  const upscale = await client.Upscale(
    {
      content: Imagine.content,
      index: 1,
      flags: Imagine.flags,
      msgId: <string>Imagine.id,
      hash: <string>Imagine.hash
    }
  );
  return upscale.uri;
}

Generated images for the audio story “I’m Lucky!”

We have all the components to fill the gaps in our stories but how to make this process fast, secure, cost-predictable, and reliable? I’ll get back to it in the next article!

Did you enjoy this article? 👏 Clap for the story!
Do you have any thoughts about the article? 💬 Leave a comment!
Want to stay updated on future content like this? 🙌 Don’t forget to follow us on Medium to get notified about my latest articles and insights in AI, machine learning, and more.
Do you have a similar case to automate? ✉️ Mail us and we will help!

AI-Powered Media Processing: 4 Steps To Automate Audio Story Annotation and Poster Creation

Transcription

Summary

Annotation

Picture

Written by Andrew Zaikin