Creating a Voice-Enabled Sales Report Auto-Generation App using LLM

Published in

The Web Tub

5 min readAug 2, 2024

Introduction

The purpose of this post is to explain how to leverage the use of speech recognition and LLMs like Gemini to create a sales report generator app, in which the user uses their voice to input the information and gets a pdf or txt file with the final sales report.

Technologies used

For the development of this app we have used the following libraries.

"dependencies": {
    "@capacitor-community/file-opener": "^6.0.0",
    "@capacitor-community/speech-recognition": "^6.0.0",
    "@capacitor/android": "^6.0.0",
    "@capacitor/browser": "^6.0.0",
    "@capacitor/core": "^6.0.0",
    "@capacitor/filesystem": "^6.0.0",
    "@capacitor/ios": "^6.0.0",
    "@capacitor/splash-screen": "^6.0.0",
    "@emotion/react": "^11.11.4",
    "@emotion/styled": "^11.11.5",
    "@google/generative-ai": "^0.12.0",
    "@mui/icons-material": "^5.15.18",
    "@mui/material": "^5.15.18",
    "@vitejs/plugin-react": "^4.2.1",
    "html2pdf.js": "^0.10.1",
    "react": "^18.3.1",
    "react-dom": "^18.3.1",
    "react-router-dom": "^6.23.1",
    "textversionjs": "^1.1.3"
  },

The main libraries are:

Capacitor, for hybrid mobile development.
React
Material UI
Google Generative AI, to use Gemini API.
html2pdf, to convert html to pdf.
textversionjs, to convert html to text file.

Development

The structure of the project is as follows.

First, obtain a Gemini API key following this tutorial.

Create a file env.js inside of src/js with the following content. Use your Gemini API key.

const env = {
  API_KEY: '...',
};
  
export default env;

To use Gemini in our app.

import { GoogleGenerativeAI } from "@google/generative-ai";
import env from "./env.js";

const API_KEY = env.API_KEY;
const genAI = new GoogleGenerativeAI(API_KEY);
const geminiModel = genAI.getGenerativeModel({ model: "gemini-pro"});

export default geminiModel;

The two main screens are SpeechRecognition and ReportPreviewDownload.

For the Speech Recognition, the main code for starting and stopping the voice recognition is

const startRecording = async (continueRecording = false) => {
    setFirstTime(false);
    setContinueRecording(continueRecording);
    setCurrentTranscript('');
    if (Capacitor.platform == 'web') {
      setError('Speech recognition is not available here');
      setTranscript(['Speech recognition is not available here','sales are good'])
      return;
    }

    SpeechRecognition.start({
      language: 'en-US',
      maxResults: 2,
      prompt: 'Say something',
      partialResults: true,
      popup: false,
    });
    setIsRecording(true);

    timeoutId.current = null;
    if (Capacitor.platform === 'android') {
      // Set a timeout to stop the recording after 2seconds
      timeoutId.current = setTimeout(stopRecording, 2 * 1000);
    }
  };
  
  const stopRecording = async () => {
    // Combine the current transcript with the previous ones
    if (continueRecording) {
      setTranscript([...transcript, currentTranscript]);
      setContinueRecording(false);
    } else {
      setTranscript([currentTranscript]);
    }
    setCurrentTranscript('');

    if (Capacitor.platform === 'ios') { // Android stops automatically
      await SpeechRecognition.stop();
    }
    setIsRecording(false);
  };

We have to take into consideration that in the case of iOS the Speech Recognition does not stop until we manually trigger it, but with Android, the native Speech Recognition automatically stops after a couple of seconds without detecting input.

This is also important when setting up the listeners in the useEffect.

useEffect(() => {
  if (Capacitor.platform == 'web') {
    setError('Speech recognition is not available here');
    setTranscript(['Speech recognition is not available here', 'sales are good']);
    return;
  }

  const requestPermissionsAndSetupListeners = async () => {
    const { speechRecognition } = await SpeechRecognition.requestPermissions();

    if (speechRecognition !== 'granted') {
      setError('Speech recognition permission was denied');
      return;
    }

    // Add listeners
    SpeechRecognition.addListener('partialResults', (data) => {
      setCurrentTranscript(data.matches[0]);

      if (Capacitor.platform === 'android') {
        // If a result is received, clear the timeout and set a new one
        clearTimeout(timeoutId.current);
        timeoutId.current = setTimeout(stopRecording, 2 * 1000);
      }
    });

  };

  requestPermissionsAndSetupListeners();

  return () => {
    // Remove listeners on component unmount
    SpeechRecognition.removeAllListeners();
  };
}, []);

The final step is calling Gemini API with the input and requesting a sales report formatted in html, as this format is flexible and rich and easy to convert to pdf without formatting changes.

const handleGenerateReport = async () => {
  setIsLoading(true);
  try {
    const input = transcript.join(' ') + "Generate a sales report with the previous input in html format. Only return the html";
    const result = await Promise.race([geminiModel.generateContent(input), timeout(30000)]); // 30 seconds timeout
    const response = await result.response;
    if (!response) {
      throw new Error(response.text());
    }
    const text = response.text();
    const htmlContent = text.replace(/```html\n|\n```/g, '');
    navigate('report-preview', { state: { htmlContent, transcript } });
  } catch (err) {
    setError(err.message);
  } finally {
    setIsLoading(false);
    setTranscript(transcript);
  }
};

This screen has buttons to record, stop recording, resume recording or record from the beginning, plus a button to generate the sales report with the current input.

For the Report Preview screen, the main functionality is divided into two components, each for one of the two supported file formats, pdf and txt.

For these two, the use of capacitor/filesystem and capacitor-community/file-opener is used to preview, download and open the generated files in both android and iOS.

For example, the code to generate a pdf from html is as follows:

const opt = {
  margin: 1,
  filename: 'sales_report.pdf',
  image: { type: 'jpeg', quality: 0.98 },
  html2canvas: { scale: 2 },
  jsPDF: { unit: 'in', format: 'letter', orientation: 'portrait' }
};

const element = document.createElement('div');
element.innerHTML = html;
const pdfOutput = await html2pdf().from(element).set(opt).outputPdf('arraybuffer');
    
// Convert the PDF output to a base64 string
const base64String = btoa(
  new Uint8Array(pdfOutput).reduce(
    (data, byte) => data + String.fromCharCode(byte),
    '',
  ),
);

try {
  // Save the PDF file to the device
  await Filesystem.writeFile({
    path: 'sales_report.pdf',
    data: base64String,
    directory: Directory.Documents,
  });

  // Get the URL of the file
  const fileUri = await Filesystem.getUri({
    path: 'sales_report.pdf',
    directory: Directory.Documents,
  });

  // Open the file
  await FileOpener.open({
    filePath: fileUri.uri,
    mimeType: 'application/pdf',
  });
} catch (error) {
  console.error('Error:', error);
}

Using these two components, the screen for the Report Preview and Download is organized in a display area for the html generated, a button to download pdf, a button to download txt and a back button.

By pressing the pdf and txt buttons, a preview will open in the native app for iOS or Android for each file format.

Conclusion

In this article we have shown how easy it is to create a hybrid application with Capacitor that integrates the use of an LLM by using Gemini API and native text-to-speech functionality.

You can find the code for this project here.

Creating a Voice-Enabled Sales Report Auto-Generation App using LLM

Introduction

Technologies used

Development

Conclusion

Written by Juan Serrano Soria