Prototyping Open Source Speech Recognition and Translation

4 min readDec 3, 2023

Exploring OpenAI’s Whisper and Meta’s Llama2 7b integration on a MacBook Pro M1.

Illustration depicting the transformation of sound waves into digital text, symbolizing the speech-to-text conversion process in advanced AI technology. — Image David Kolb Midjourney

Speech-to-text technology has become vital for accessibility and efficiency. However, privacy remains a rising concern as reliance on cloud services grows. This heightens the need for secure solutions that uphold confidentiality over sensitive data.

Open-source large language models present a viable, secure alternative to traditional cloud services, prioritising data privacy and user control at comparable, if not better, performance.

This experiment tests the feasibility of using large language models for translation on a MacBook Pro.

I used Openai Whisper ‘tiny’ and Llama2 7b to match the MacBook’s performance capabilities. However, this comes with translation accuracy and depth trade-offs compared to the larger Llama2 70b and Whisper’ large’ model, reflecting a necessary compromise in this experiment.

Use Cases

Non-Profit Organisations
Non-profits can benefit from models like Whisper and Llama2. By recognising speech and translating languages locally and privately, sensitive voice data remains under client control at a marginal cost. This allows non-profits to better serve diverse linguistic communities through enhanced, accessible tools.

Privacy-Centric Applications
For areas where privacy is paramount, like healthcare and legal, on-device large language models enable secure localisation of sensitive conversations — vital for compliance and safeguarding proprietary data. Firms maintain more granular control around data access policies, combining security for sensitive information with cost-effectiveness.

Approach

Setup ollama.ai.
Set up the Whisper and Llama2 7b models on a MacBook Pro M1.
Use the Whisper model to convert speech from a webinar snippet into text.
Use the Llama2 7b model to translate the transcribed text.

ollama.ai

ollama.ai can be downloaded at https://github.com/jmorganca/ollama
open the ollama app, and it will start the ollama server.

To download the Llama2 7b model enter the command.

ollama pull llama2

For other versions use llama2:13b or llama2:70b.

Whisper

To install OpenAI Whisper

pip install -U openai-whisper

OpenAI recommend to install rust

pip install setuptools-rust

Whisper transcription

Import the Whisper library

import whisper

Loading the Whisper model, transcribing audio content, and outputting the transcription.

For this example, I used the “tiny” Whisper model.

  modelname = "tiny"
  model = whisper.load_model(modelname)
  result = model.transcribe(filename, fp16=False, language="en")
  transcription_text = result['text']

Llama2 7b Translation

To initiate the translation append a prompt to the text to be transcribed.

prompt = "translate this English text into Spanish: " + transcription_text

Translate the text using Llama2 7b.

llmmodel = 'llama2'
try:
    r = requests.post('http://localhost:11434/api/generate',
                      json={
                          'model': llmmodel,
                          'prompt': prompt,
                      },
                      stream=True)
    r.raise_for_status()
    response_text = str(r.text)

    for line in r.iter_lines():
        body = json.loads(line)
        response_part = body.get('response', '')
        # Process and print each token from the streamed response
        print(response_part, end='', flush=True)

except requests.exceptions.RequestException as e:
    print(f"Error during API request: {e}")

Results

The Whisper model’s performance in transcribing English text is impressive, showcasing high accuracy and contextual understanding. The provided example effectively captures the nuances of a strategic planning discussion, maintaining the technical specificity and coherence of the original speech. This level of precision in the transcription process illustrates Whisper’s robust capability in speech-to-text conversion, particularly in handling professional and context-rich content.
Whisper output.

Llama2 7b model’s translation into Spanish, while preserving the core essence and strategic concepts of the original content, lacks some of the linguistic fluidity and directness of the English version in places. This reflects trade-offs around precision due to using a smaller 7b parameter size rather than the more accurate but larger 70b model. However, it still reasonably conveys all the complex, technical ideas in Spanish well.
Llama2 output.

Integrating Whisper and Llama2 7b combines accurate speech-to-text with promising translation functionality. Whisper transcribes spoken words adeptly, while Llama2 7B conveys complex ideas reasonably well despite limitations in capturing linguistic nuance. The experiment offers a snapshot of large language models where precision, contextual comprehension, and adaptability approach human-level comprehension.

Key Takeaways

Our experiment with OpenAI’s Whisper and Meta’s Llama2 7b on a MacBook Pro M1 has successfully demonstrated the integration of open-source speech-to-text and translation technologies.

Whisper exhibits exceptional accuracy in transcribing context-laden English speech. This highlights the viability for professional applications.
Llama2 7b model capably translates complex ideas but sacrifices linguistic fluidity and nuance, highlighting the need for refinement or a larger model.
On-device processing secures privacy-sensitive AI capabilities that otherwise rely on cloud solutions, underscoring rising user expectations of confidentiality.

These insights reflect the current capabilities and future trajectory of open-source speech/language AI. They underscore the necessity of sustained invention in this fast-moving domain.

Interested in the evolving world of Generative AI? Join the conversation by sharing your thoughts in the comments below, or reach out to learn more.