Build your own SIP phone and use it as a virtual voice call assistant

Published in

RingCentral Developers

15 min readJul 1, 2024

In this blog, I will walk you through the essential steps to set up and build your own SIP phone that works with your RingCentral phone number. You can turn the SIP phone into a virtual voice call assistant that can automatically answer incoming calls, interact with the caller, and transfer the call to an agent if requested. I will keep the demo features minimal and leave the framework expandable to support additional features. Once we’re finished, we’ll have an app that will do the following:

Answer incoming calls and detect the caller ID.
Identify whether the caller is a customer.
Detect whether the call is from a human or a robocall and terminate the call if it is detected as a robocall.
Interact with the caller.
Transfer the call to an agent if the caller requests.

Prerequisites

You must have a RingCentral production account and an available RingEX license.
An OpenAI developer account.
An IBM Watson developer account

Set up and provision a SIP device with RingCentral

First of all, follow the instructions on this page to add a new user to your account. You must select the “RingEX” user license for the new user as this is required for a user to have a phone device. At step #7 where you need to select a phone, choose the “Bring your own device” option and assign a phone number.

To help you easily follow the course of this tutorial blog, name the user extension as “Virtual Voice Assistant”. Complete the user creation process and setup the user login credentials.

After successfully adding a new user, browse to the account users list and select the newly added user, and select the “Devices & Numbers” option as shown below.

Next, click on the “Existing Phone” link to open the device settings page, and change the name of the phone to “My SIP phone”. The name of the phone must be unique because you will use that name to identify and select the device later in the code.

Figure 3. Set the name of the selected device

That is all you need to set up and provision your SIP device for a user in your RingCentral account.

Now, use the login credentials of the “Virtual Voice Assistant” extension you created earlier to login to the RingCentral developers portal to register a RingCentral app and to generate a JWT token for the “Virtual Voice Assistant” extension. If you are not familiar with this procedure, follow the steps in this getting started document to register your app, and follow the steps in this developer guide to create a JWT token.

I assume that you are familiar with getting IBM Watson IAM API keys and instance Id for accessing Watson AI services. And you are familiar with getting OpenAI API keys for accessing their services. If not, please click on the appropriate links above to learn how to achieve the API keys.

Once you have the RingCentral app credentials, the user JWT token, and the other service API keys, copy them and paste to the appropriate fields in the “dotenv” file in the project and rename the file to “.env”.

RINGCENTRAL_CLIENT_ID=Your_App_Client_Id
RINGCENTRAL_CLIENT_SECRET=Your_App_Client_Secret
RINGCENTRAL_JWT=The-Virtual-Voice-Assistant-JWT
WATSON_SPEECH2TEXT_API_KEY=Your_Watson_Speech_To_Text_Api_Key
STT_INSTANCE_ID=Your_SpeechToText_Instance_Id
GPT_SERVER_URL=api.openai.com
GPT_API_KEY=Your-OpenAI-Appkey

Now let’s start designing and implementing the virtual voice call assistant.

In this demo app, I use the following services and components to implement the features

RingCentral Soft-Phone SDK: to handle incoming calls
RingCentral API platform: to transfer a call to an agent.
WafeFile library: to convert the audio format
IBM Watson Real-Time Speech-to-Text: to transcribe caller’s speech
OpenAI ChatGPT: to make a conversation with the caller

Note: The code snippets shown in this article are shortened and just for illustration of essential parts. They may not work directly with copy/paste. I recommend you clone the entire project from my GitHub repo. Read the project README for details.

Register the Soft-Phone to handle incoming calls

The first and essential step is to register the soft-phone engine to connect to the provisioned SIP phone I set up earlier. To do this programmatically, I need to read the virtual voice assistant extension’s SIP device to retrieve the SIP phone credentials.

var resp = await platform.get('/restapi/v1.0/account/~/extension/~/device')
let jsonObj = await resp.json()
for (var device of jsonObj.records){
  if (device.name == 'My SIP phone'){
    resp = await platform.get(`/restapi/v1.0/account/~/device/${device.id}/sip-info`)
    let ivrPhone = await resp.json()
    let deviceInfo = {
                username: ivrPhone.userName,
                password: ivrPhone.password,
                authorizationId: ivrPhone.authorizationId,
            }
    this.softphone = new Softphone(deviceInfo);
    try {
      await this.softphone.register();
      ...
  }
}

The full and complete code for initializing the soft-phone is implemented in the ‘initializePhoneEngine()’ function.

If the soft-phone is registered successfully, it is ready to receive incoming calls. Before implementing the code, let’s define a data object that holds the necessary data of an active call. I will highlight only the most important data fields of the activeCall object.

var activeCall = {
      callSession: null, // For controlling the call
      transcript: "", // For keeping the transcript
      dtmf: "", // For keeping the DTMF tones
      watsonEngine: null, // For getting real-time transcript
      audioBuffer: null, // For keeping incoming audio data
      assistantEngine: null, // For interacting with OpenAI services
      telSessionId: "", // For transferring the call
      partyId: "", // For transferring the call
      customerInfo: null, // For keeping the customer info
      screeningStatus: "" // For caller verification
      ...
    }

For the full data object, see the ‘createActiveCall()’ function from the demo project. Now I continue to implement the code for answering incoming calls. Let’s break the code block into several sections so it’s easier to walk through the code.

Receive an incoming call and detect the caller Id (phone number)

When there is an incoming call, I will receive the SIP invite message event, where I can parse the SIP message headers and detect the caller’s phone number.

// detect inbound call
this.softphone.on('invite', async (sipMessage) => {
  console.log("SIP Invite for an incoming call")
  // answer the call
  var header = sipMessage.headers['Contact']
  var fromNumber = header.substring(5, header.indexOf('@'))

Answer the call, get the call session, and create the active call object

I answer the call by calling the ‘answer()’ function which returns the call session of this call.

var callSession = await this.softphone.answer(sipMessage);
// detect blocked caller
if (this.isBlockedCaller(fromNumber)){
  console.log("blocked number => terminate the call immediately")
  callSession.hangup()
  return
}
// create an activeCall object to keep call and necessary info
var activeCall = await this.createActiveCall(callSession, fromNumber)

// Function to detect if a caller’s number is blocked
isBlockedCaller: function(fromNumber){
  let blocked = blockedNumbers.includes(fromNumber)
  return (blocked) ? true : false
}

As I have a list of blocked phone numbers, I will check the caller’s phone number to detect if it’s on the list. And if the incoming call should be blocked, I immediately hang up the call. Otherwise, I move on to call the ‘createActiveCall()’ function to create a new activeCall data object.

I’ll create the Watson engine and the OpenAI engine; these 2 engine functionalities will be discussed shortly in the next sections. I also use the caller’s phone number to identify if the caller is a known customer or not. If the caller’s phone number is anonymous or an unknown number, I will use a simple technique to identify if it is a robocall or not. If it is a robocall, I will terminate the call, otherwise, I will let the caller continue to talk to the virtual assistant. If the caller is a customer, I will load the customer’s essential information into the “customerInfo” data field. In this demo, I created a list of customers with simple customer info with the customer’s name, phone number, the last 4-digit of SSN, and I saved them in a file called “customers.json”.

createActiveCall: async function(callSession, fromNumber){
  let customer = this.identifyCallerByPhoneNumber(fromNumber)
    
  var activeCall = {
      callSession: callSession,
      transcript: "",
      dtmf: "",
      delayTimer: null,
      assignedAgent: null,
      speechStreamer: null,
      watsonEngine: new WatsonEngine(this, callSession.callId),
      watsonEngineReady: false,
      audioBuffer: null,
      assistantEngine: new OpenAIEngine(),
      fromNumber: fromNumber,
      telSessionId: "",
      partyId: "",
      screeningStatus: "verified",
      conversationStates: new ConversationStates(),
      passCode: "",
      customerInfo: customer,
      screeningFailedCount: 0,
      maxScreeningFailedCount: 3
  }
  if (customer){ // known number
    activeCall.screeningStatus = "verified"
    this._playInlineResponse(activeCall, "Thank you for your call …")
  }else{ // unknown number => turn on robocall_defend mode
    activeCall.screeningStatus = "robocall_defend"
    activeCall.passCode = makePassCode()
    this._playInlineResponse(activeCall, `please repeat the following number. ${activeCall.passCode}`)
}

await sleep(2000)
await this.getCallInfo(activeCall)

Within the ‘getCallInfo()’ function, I call the RingCentral APIs to get the call’s telephony session Id and the call party Id. These Ids will be used when I transfer the call to an agent.

Get the audio data packets and get the transcript

It’s worth mentioning that the RingCentral soft-phone SDK uses the G.711 audio codec which gives audio packets in mu-law wav (8 bit, 8kHz) format. Also, each audio packet size is 160 bytes. So remember to set the ‘content-type’: `audio/mulaw;rate=8000;channels=1` together with other required configurations after opening the Watson web socket connection.

The audio buffer size should be optimized for a reasonable audio latency for getting the best real-time transcription. The defined buffer size below gives about 100ms latency and so far, it is the best buffer size for this demo application.

In the implementation below, I push the received audio packets into a buffer, and when the buffer size reaches the max size, I send the audio buffer to the Watson engine for transcription.

let MAXBUFFERSIZE = 160 * 50

activeCall.callSession.on('audioPacket', (rtpPacket) => {
  if (activeCall.audioBuffer != null){
    activeCall.audioBuffer = Buffer.concat([activeCall.audioBuffer, Buffer.from(rtpPacket.payload)])
  }else{
    activeCall.audioBuffer = Buffer.from(rtpPacket.payload)
  }
  if (audioBuffer.length >= MAXBUFFERSIZE){
    if (activeCall.watsonEngineReady){ 
      watsonEngine.transcribe(activeCall.audioBuffer)
    activeCall.audioBuffer = null
  }
});

When the transcript is ready, the Watson engine will call the callback function passing along the state and the transcript.

// From the WatsonEngine class
this.ws.on('message', function(evt) {
  var res = JSON.parse(evt)
  if (res.hasOwnProperty('results')){
    if (res.results[0].final){
      var transcript = res.results[0].alternatives[0].transcript
      transcript = transcript.trim().replace(/%HESITATION/g, "")
      callback("FINAL", transcript)
    }else{
      callback("INTERIM")
    }
  }
});

// From the Phone Engine class
activeCall.watsonEngine.createWatsonSocket((state, transcript) => {
  if (state == "READY") {
    activeCall.watsonEngineReady = true
  }else if (state == "ERROR"){
    console.log("WatsonSocket creation failed!")
  }else if (state == "INTERIM"){
    // cancel the delay timer when there are more transcribed texts
    if (activeCall.delayTimer != null){
      clearTimeout(activeCall.delayTimer)
      activeCall.delayTimer = null
    }
  }else if (state == "FINAL"){
    if (transcript.length == 0 && activeCall.transcript.length == 0)
      return
    // concatenate transcripts
    activeCall.transcript += `${transcript.toLowerCase()} `
    this.transcriptReady(activeCall)
  }
})

In this demo app, when I expect to hear a passcode or a yes/no short answer from the caller, I will cancel the delay timer (if it was set) and immediately call the processTranscript() function to analyze the caller’s speech. In other situations, I set the delay timer to wait for more transcribed text. If additional transcribed text is received within the delay period, I concatenate the transcripts and reset the delay timer. This process continues until the delay timer expires, at which point I call the processTranscript() function to analyze the entire transcript.

transcriptReady: function(activeCall){
  let subState = activeCall.conversationStates.getSubState()
  if (activeCall.screeningStatus == 'robocall_defend' || 
      subState == 'wait-for-transfer-decision'){
    // Cancel the delay timer if set
    if (activeCall.delayTimer != null){
      clearTimeout(activeCall.delayTimer)
      activeCall.delayTimer = null
    }
    // Process the transcript immediately
    this.processTranscript(activeCall)
  }else{
    var thisClass = this
    // Set a delay timer to wait for more transcript
    activeCall.delayTimer = setTimeout(function(){
      activeCall.delayTimer = null
      if (activeCall.transcript.length > 0){
        // Process the transcript if the timer expires
        thisClass.processTranscript(activeCall)
      }
    }, 2000)
  }
}

Analyze the transcript and create a conversation

Let’s deal with an unknown caller when the caller Id is an unknown phone number or an anonymous one. In this case, I generate a random 4-digit passcode and ask the caller to repeat it.

if (customer){ // known number
  activeCall.screeningStatus = "verified"
  this._playInlineResponse(activeCall, `Thank you for your call Mr. ${activeCall.customerInfo.name}! ...`)
}else{ // unknown number => turn on robocall_defend mode
  activeCall.screeningStatus = "robocall_defend"
  activeCall.passCode = makePassCode()
  this._playInlineResponse(activeCall, `Before continue, please repeat or use the keypad to dial the following number. ${activeCall.passCode}`)
}

I set the screeningStatus to “robocall_defend” and process the caller speech to check if the caller repeats the passcode. To identify the passcode from the caller speech, I use the OpenAI completions endpoint to check if the transcript contains a passcode and match with the passcode I generated.

// From the OpenAI Engine
getCodeVerification: async function(message, code){
  var message = `Find a number from the text between the triple dashes and compare it with the number from the text between the triple hashes ---${message}---, ###${code}###.`
  message +=  'Provide in JSON format where the key is "matched" and the value is true if the numbers are equal or false if they are not equal. And the second key is "number" and the value is the detected number.'
  return await this.getClassifiedTask("JSON", message)
}

The above function will return a JSON response which would look like this

{
  matched: true, // or false
  number: 1234
}

I set the maxScreeningFailedCount and hung up the call if the caller failed to repeat the passcode several times. If the caller answers the passcode correctly, I set the screeningStatus to “verified” and then let the caller talk to the assistant. I don’t discriminate if the caller is a known customer or an unknown caller except when greeting a known customer with the name I found from the customer profile. In a real use case, you can provide different services to your customers or ask different questions to unknown callers.

processTranscript: async function(activeCall){
  switch (activeCall.screeningStatus) {
    case "robocall_defend":
      ...
      var checkResult = await activeCall.assistantEngine.getCodeVerification(activeCall.transcript.trim(), activeCall.passCode)
        
      if (checkResult && checkResult.matched){
        activeCall.screeningStatus = "verified"
        await this._playInlineResponse(activeCall, "Thank you for your verification. How can I help you ...")
      }else{
        if (activeCall.screeningFailedCount >= activeCall.maxScreeningFailedCount){
          console.log("Reject and hangup")
          activeCall.callSession.hangup()
        }else{
          // ask the caller to retry
        }
      }
      activeCall.transcript = ""
      break
    case "customer_verification":
      // If you want to verify the customer last 4-digit SSN
      // You can set and handle this screeningStatus similar to the
      // passcode verification
      break
    ...
  }
}

I also capture DTMF tones so the caller can use the dial pad to enter the passcode.

// receive DTMF
activeCall.callSession.on('dtmf', (digit) => {
  this.handleDTMFResponse(activeCall, digit)
});


handleDTMFResponse: async function(activeCall, digit){
  switch (activeCall.screeningStatus) {
    case "robocall_defend":
      activeCall.dtmf += digit
      if (activeCall.dtmf.length >= 4){
        if (activeCall.dtmf == activeCall.passCode){
          activeCall.screeningFailedCount = 0
          activeCall.dtmf = ""
          activeCall.screeningStatus = "verified"
          await this._playInlineResponse(activeCall, "Thank you for your verification. How can I help you ...")
        }else{
          // reject this call
          console.log("Reject this call after max failure times")
          activeCall.screeningFailedCount++
          activeCall.dtmf = ""
          if (activeCall.screeningFailedCount >= activeCall.maxScreeningFailedCount){
            console.log("Reject and hangup")
            activeCall.callSession.hangup()
          }else{
            // ask the caller to retry
          }
        }
      }
      break
    default:
      break
  }
}

Before proceeding to implement a conversation with a caller, let’s discuss how the virtual assistant communicates with the caller. To achieve this functionality, I use OpenAI’s text-to-speech model “tts-1” to convert text messages into an audio buffer, which is then streamed to the SIP phone for interaction with the caller. However, OpenAI’s TTS does not directly support the 8kHz mu-law audio format output. Therefore, prior to streaming the audio to the SIP phone, it is necessary to convert the audio to the required format. The implementation for this conversion can be found in the OpenAIEngine class, specifically within the ‘_streamToBuf()‘ and ‘_convertWave()’ functions.

Handle call conversation

I define two conversation main states: the “chatting” state and the “call-transfer-request” state. A conversation starts in the “chatting” state, during which the assistant engages with the caller. I use the OpenAI “gpt-4” data model to analyze the caller’s transcript and generate text for the conversation. If the caller requests to speak with a person, the main state changes to “call-transfer-request”. In this state, the assistant asks the caller to confirm before proceeding with the call transfer. Below is the sample code to switch between chatting with the caller and handling the call transfer request.

processConversation: async function(activeCall){
  let mainState = activeCall.conversationStates.getMainState()
  switch (mainState) {
    case 'chatting':
      this.handleChattingConversation(activeCall)
      break
    case 'call-transfer-request':
      if (subState == 'transfer-call'){
        console.log("being transferring => ignore this")
        return
      }
      this.handleCallTransferRequest(activeCall)
      break
    default:
      break
  }
}

Every time I call the OpenAI API to process caller transcripts, I ask it to determine the caller’s intention, whether it’s a “request”, a “question”, or an “answer”. Additionally, I ask it to analyze whether the topic pertains to “ordering,” “technical support,” or “billing.” I also request it to provide the following metadata:

If the caller’s intent is a “request,” classify the request as “call transfer,” “billing inquiry,” “order inquiry,” or “unclassified” if it does not fit any of the predefined categories.
If the caller’s intent is a “question,” provide the best short answer.
If the caller’s intent is an “answer,” provide the best utterance to continue the conversation.

The request above is implemented within the ‘getIntents()’ function in the ‘OpenAIEngine’ class, and the expected response would look like one of the following examples:

{
  intent: "request",
  topic: "ordering" or "technical support" or "billing",
  class: "call transfer" or "billing inquiry" or "order inquiry",
  ask: ""
}

{
  intent: "question",
  topic: "ordering" or "technical support" or "billing"
  answer: "best short answer" // an answer to the question
}

{
  intent: "answer",
  topic: "ordering" or "technical support" or "billing"
  ask: "short utterance" // an answer to the question
}

Using the structured response above, I can check and decide how to respond to the caller.

Let’s assume that the caller intention is a call transfer request and I have 3 dedicated agents to handle customers’ ordering, billing, and technical support requests. These agents info are defined in the sample “agents.json” and loaded to the ‘agentsList’. The code snippet below shows how to assign an agent and get ready for the call transfer by changing the conversation main state to “call-transfer-request” and the sub state to “wait-for-transfer-decision” and wait for the caller’s confirmation.

var action = await activeCall.assistantEngine.getIntents(transcript)
switch (action.intent) {
  case 'request':
  var agent = null
  if (action.class == "call transfer"){
    if (action.topic == "ordering"){
      agent = this.agentsList.find(o => o.extensionNumber == "102")
    }else if (action.topic == "billing"){
      agent = this.agentsList.find(o => o.extensionNumber == "103")
    }else if (action.topic == "technical support"){
      agent = this.agentsList.find(o => o.extensionNumber == "104")
    }else{
      await this.playInlineResponse(activeCall, `${action.ask}`)
      return
    }
    activeCall.conversationStates.setMainState("call-transfer-request")
    activeCall.conversationStates.setSubState('wait-for-transfer-decision')
    var confirm = `Do you want me to transfer your call to the ${agent.name}?`
    activeCall.assignedAgent = agent
    await this.playInlineResponse(activeCall, confirm)
  }
  break
  ...
}

If the ‘intent’ is “question” or “answer”, I simply play the “answer” or the “ask” text generated by OpenAI.

case 'question':
    if (action.answer)
      await this.playInlineResponse(activeCall, action.answer)
    break
  case 'answer':
    console.log("Check topic?", action.topic)
    if (action.ask)
      await this.playInlineResponse(activeCall, action.ask)
    break

When the conversation main state is in “call-transfer-request”, I call the OpenAI API to analyze the caller transcripts, to determine whether it’s a “yes” or a “no” answer. The expected response is a JSON object with the key “answer” set to 1 for a “yes”, 0 for a “no”, or -1 if the answer is neutral.

If the answer is a “yes”, the assistant will tell the caller to stay on the line while the call is being transferred.

If the answer is a “no”, I set the conversation main state back to “chatting” and return to normal conversation mode.

Else, I ask the caller to repeat the answer.

handleCallTransferRequest: async function (activeCall){
  let subState = activeCall.conversationStates.getSubState()
  let decision = await activeCall.assistantEngine.getYesOrNoAnswer(activeCall.transcript)
  if (!decision){
    await this.playInlineResponse(activeCall, "Sorry, Can you repeat it?")
    return
  }
  if (decision.answer == 1){
    activeCall.conversationStates.setSubState('transfer-call')
    await this.playInlineResponse(activeCall, `Ok, let me transfer your call to ${activeCall.assignedAgent.name}. Please stay on the line.`)
    await sleep(5000)
    this.blindTransferCall(activeCall)
  }else if (decision.answer == 0){ // it's a no
    activeCall.conversationStates.setMainState('chatting')
    activeCall.conversationStates.setSubState('no-action')
    await this.playInlineResponse(activeCall, "Oh okay, how can I help you?")
  }else{
    await this.playInlineResponse(activeCall, "Sorry, can you repeat it?")
  }
}

Handle call transfer

To transfer a call, I use the telephony session Id and the call party Id to make a blind transfer to the assigned agent.

blindTransferCall: async function(activeCall){
  var endpoint = '/restapi/v1.0/account/~/telephony/sessions/'
  endpoint += activeCall.telSessionId + '/parties/'
  endpoint += activeCall.partyId + '/transfer'

  try{
    let bodyParams = {
      extensionNumber : activeCall.assignedAgent.extensionNumber
    }
    await this.login()
    var resp = await platform.post(endpoint, bodyParams)
    activeCall.callSession.hangup()
  }catch(e){
    // if failed reset the mainState and subState
    activeCall.conversationStates.setMainState('chatting')
    activeCall.conversationStates.setSubState('no-action')
    await this.playInlineResponse(activeCall, "Sorry, I can’t transfer your call right now ...")
    // Decide yourself what to do next in this situation
  }
}

It’s time to run the demo and see how it works. Please check the instructions in the project README file to learn how to clone and setup the project environment and run the demo app.

Enhancement

I keep the demo features as simple as possible. But you can further develop the app if you want to,

Enhance the call transfer feature by adding the caller to a waiting list if the agent is busy. To do this, you can subscribe to the user presence event notification to get the agent presence status, if the agent is busy, ask the user if he/she can wait then add the caller to the “waitingList” in the sample agent info object. Then when the agent is available, check the agent waitingList and make the call transfer.
Enhance the assistant services for customers. E.g if a customer asks for the billing status, check the billing info from the customer sample object and let the customer know the current and next billing info.
Build your own OpenAI Assistant object and use it with this voice call assistant.