Text to Speech from C# using XTTS v2 (Python), with Chains & CodeInterpreterThoughts

4 min readJan 22, 2024

Photo by Dall-E 3 (https://bing.com/create).

Introduction

Previously we looked at how to execute Python code from C#, Executing Python Code in C# using CodeInterpreterThoughts | by Dean Martin | Jan, 2024 | Medium.

In this tutorial, we will build on this practically by exploring how to use C# and the XTTS v2 model in Python to synthesize speech from text. We will create a chain using the FrostAura.Libraries.Semantic.Core.Thoughts.Chains.Cognitive.TextToSpeechChain class that takes input text, synthesizes speech, saves it to a .wav file, and returns the path to the file.

Prerequisites
Before we begin, make sure you have the following installed:

- C# development environment (Visual Studio or .NET Core)
- Python 3.10
- TTS library for Python
- ffmpeg package for Python

Problem Statement

Our goal is to create a C# code that can transform text into speech using the XTTS v2 model in Python. The C# code will invoke a Python script that handles the text-to-speech synthesis and saves the output as a .wav file.

Solution Overview

To achieve this, we will:

Use the CodeInterpreterThoughts.InvokeAsync() method to construct a Python script that utilizes the XTTS model through the TTS library.
Ensure that the required Python packages are installed by specifying the correct dependencies (pipDependencies and condaDependencies).
Define two Python functions: “download_and_get_speaker_voice_wav_file_path()” and “synthesize(text: str)”.
— The “download_and_get_speaker_voice_wav_file_path()” function downloads a pre-recorded voice file and saves it locally.
— The “synthesize(text: str)” function utilizes the TTS library to synthesize the input text with the downloaded voice file.
Invoke the “main()” function, passing in the input text to be synthesized.
Retrieve the path of the synthesized speech .wav file from the Python script and return it to the C# code.

Step 1: Constructing the Python Script

We will use the CodeInterpreterThoughts.InvokeAsync() method to create the Python script. Here is the code for the Python script that we will generate:

def download_and_get_speaker_voice_wav_file_path() -> str:
  import requests
  
  voice_file_download_url: str = 'https://github.com/neonbjb/tortoise-tts/raw/main/tortoise/voices/emma/1.wav'
  response = requests.get(voice_file_download_url)
  output_file_name: str = 'voice_to_speak.wav'

  with open(output_file_name, 'wb') as f:
   f.write(response.content)

  return output_file_name

def synthesize(text: str) -> str:
 try:
   from TTS.api import TTS
   import uuid
  
   voice_file_path: str = download_and_get_speaker_voice_wav_file_path()
   tts: TTS = TTS('tts_models/multilingual/multi-dataset/xtts_v2')
   output_file_path: str = f'{str(uuid.uuid4())}.wav'
   result: str = tts.tts_to_file(
     text=text,
     file_path=output_file_path,
     speaker_wav=voice_file_path,
     language="en",
     split_sentences=False
   )

  return result
 except Exception as e:
   print(e)
   raise e

def main() -> str:
  return synthesize("$input")

In the script, we define two functions:

“download_and_get_speaker_voice_wav_file_path()” and “synthesize(text: str)”.

The “download_and_get_speaker_voice_wav_file_path()” function downloads a pre-recorded voice file from a specific URL and saves it locally.

The “synthesize(text: str)” function utilizes the TTS library to synthesize the input text with the downloaded voice file. It creates an instance of the TTS class, sets the output file path, and synthesizes the speech based on the input text.

The “main()” function sets the necessary environment variables and calls the “synthesize()” function with the input text as a parameter.

Step 2: Invoking the Python Script from C#

Next, we will invoke the Python script from our C# code. We will use the CodeInterpreterThoughts.InvokeAsync() method to execute the Python script. Here is the C# code that invokes the Python script:

public class TextToSpeechChain : BaseChain
{
  // …
  public override List<Thought> ChainOfThoughts => new List<Thought>
  {
   new Thought
   {
     Action = $"{nameof(CodeInterpreterThoughts)}.{nameof(CodeInterpreterThoughts.InvokeAsync)}",
     Reasoning = "I will use my code Python code interpreter to construct a script that can use the XTTS model via the TTS library and synthesize speech, and finally return the path of the file.",
     Arguments = new Dictionary<string, string>
     {
       { "pythonVersion", "3.10" },
       { "pipDependencies", "TTS" },
       { "condaDependencies", "ffmpeg" },
       { "code", /* Generated Python script */ }
     },
     OutputKey = "1"
   },
   new Thought
   {
     Action = $"{nameof(SystemThoughts)}.{nameof(SystemThoughts.OutputTextAsync)}",
     Reasoning = "I can simply proxy the response as a direct and response is appropriate for an exact transcription.",
     Arguments = new Dictionary<string, string>
     {
       { "output", "$1" }
     },
     OutputKey = "2"
   }
 };
 // …
}

In the thought chain, we define two thoughts:

The first thought uses the CodeInterpreterThoughts.InvokeAsync() method to execute the Python script. We specify the required Python version, pipDependencies, condaDependencies, and the generated Python script. The output of this thought is stored in the “OutputKey” variable.
The second thought uses the SystemThoughts.OutputTextAsync() method to output the result from the previous thought. We retrieve the output using the “OutputKey” variable.

Step 3: Passing Input Text and Getting the Speech File Path

To use the TextToSpeechChain class, we can call the SpeakTextAndGetFilePathAsync() method and pass in the input text. The method will return the path to the synthesized speech .wav file. Here is an example of how to use the TextToSpeechChain class:

// Instantiate the TextToSpeechChain
var textToSpeechChain = new TextToSpeechChain(serviceProvider, semanticKernelLanguageModels, logger);

// Synthesize speech for the input text
string inputText = "This is a hello world example";
string filePath = await textToSpeechChain.SpeakTextAndGetFilePathAsync(inputText);

Console.WriteLine($"Synthesized speech file path: {filePath}");

In the example above, we create an instance of the TextToSpeechChain class and use the SpeakTextAndGetFilePathAsync() method to synthesize speech for the input text. The returned filePath variable contains the path to the synthesized speech .wav file.

Conclusion

In this tutorial, we learned how to use C# and the XTTS v2 model in Python to synthesize speech from text. We created a chain in C# that invokes a Python script to handle the text-to-speech synthesis. We covered the necessary Python code and how to integrate it with C#. By following these steps, you can easily implement text-to-speech functionality in your C# applications.

If you have any questions or feedback, feel free to leave a comment below. Happy coding!

Links & References

Disclaimer

The above article was completely autonomously generated by an AI pipeline that I have coded as well as the code interpreter code that I wrote prior, so please excuse any oddities. We will improve over time. :)

#csharp #python #ai #ml #llm #medium #dalle3 #codeexecution #interoperability #integration #xtts #tts #texttospeech #speechsynthesis