RealTime Agents APIs
As we are starting to build more agents that use real time audio and video and input I was wondering what is the right API that should be exposed for developers for that.
Recently I wrote about existing RealTime Agents frameworks. All of them tend to use the same approach where the framework connects to an existing RTC platform and process audio and video using some pre-built extensions or creating your own to integrate with external services or tools.
This approach offered by the existing frameworks is very flexible but forces the developers to deploy servers handling media and also implementing the orchestration logic of the agents with LLMs and other external systems. It also requires having an additional RTC platform in the middle.
Also a month ago OpenAI released its WebRTC API that offers a completely different approach without an RTC platform in the middle and delegating the control (at least during Beta) to the client side.
This approach offered by OpenAI WebRTC API is very simple and quick to integrate but it doesn't allow proper control of what a user can or cannot do given that the whole control of configuration, prompts and tools is delegated to the client. This will probably improve in next iterations but for now it looks like a big limitation.
There is an alternative solution splitting media and control in a more traditional way that I think could be a good candidate for many use cases. In this case the server side of the agents would only need to handle the orchestration logic or control avoiding being in the middle of the media path.
In this case an element provided by OpenAI or somebody else is responsible of the communication between the client application and the LLM backend but delegates any control to an application service that can decide what is the configuration of the session, the prompts, the tools calls, the integration with external services and any orchestration logic inside a session or between multiple sessions.
That interface could be a simple HTTP interface allowing deploying agents code in any tradicional environment including lambda/serverless hosting for example.
It would be great if somebody exposes that and see if it is good for 95% of the use cases or not. I'm starting prototyping it in the opensource LLM WebRTC proxy and hope to share a working version soon.
What do you think of this approach?