Generative APIs with Large Language Models

12 min readMar 31, 2023

Rikin Marfatia*, Justin Chen†
* developed the Android implementation
† designed the API and system architecture

Current language user interfaces (LUIs) and accessibility features are limited in their capacity to process only one command at a time and struggle with handling ambiguity, resulting in unnatural language interactions. An LUI capable of incorporating context and managing complex, ambiguous instructions could significantly enhance usability and productivity, particularly for the visually impaired.

In this article, we present a Generative API (G-API) designed to address these limitations by offering dynamic functionality in response to user interactions. By harnessing the power of OpenAI’s GPT-4, we have developed a proof-of-concept Language User Interface (LUI) for Android devices. Our G-API, which is directly developed and managed by GPT-4, differs from OpenAI’s Plugins that rely on existing APIs, thereby paving the way for creating tailor-made APIs for any application.

We explore the potential for advanced language-based interactions using large language models (LLMs), the differences between our LUI and traditional ones, as well as challenges, unresolved issues, and future directions. Furthermore, we discuss the implications of self-learning agents that can create and use their own tools, raising important questions about safety, privacy, and the development of autonomously innovative agents.

We encourage interested parties to access our code on GitHub.

Enabling more natural interactions with our devices

Our goal is to develop a robust natural-language-centric interface that users can engage with conversationally. To prove this, our goal was to show that these interfaces can effortlessly handle compound commands. For example, “Take a selfie to celebrate our demo, send it to my girlfriend, and upload it to Twitter and tag heyrikin.” This command requires the system to have context of each action and understand that there is an order that the actions must be executed in. It must understand that a selfie is taken with a front-facing camera. It has to disambiguate who my girlfriend is. Then it has to know what controls it has access to to be able to open Twitter and understand how to tag a user in a tweet and that it should even input that string into the Twitter textbox.

Today’s language user interfaces such as Siri, Alexa, and Google Assistant immediately fail. We gave this exact command to Google Assistant as a sanity check and encourage readers to try with their own mobile devices as well. We asked ourselves if we were just creating a worse Google Assistant. However, after the first action, it opened the camera and stopped.

After posting this blog, we found out about a GPT-3-powered LUI developed to replace Siri for home automation [9]. The differences between our approach and theirs is that GPT-4 developed most of our API itself with little guidance from us. Their implementation uses a hand-crafted prompt by the developer. Additionally, their implementation only handles single commands at a time, whereas our system is capable of handling compound commands with complex dependencies. Lastly, their work does not address imagined capabilities as we discuss later.

Below is a video of our first live demo of a LLM-based LUI on Android. We think it is fair to say, it performed beyond expectations.

Our first live demo of the LLM-powered Android LUI [6]

Tweet that was sent in the first live demo [7]

What separates this from OpenAI Plugins?

The same day we put our demo video online, OpenAI released Plugins [1] which enables ChatGPT to access existing APIs to gain updated information and extend its capabilities [8].

We use GPT-4 [5] in a similar way to control the Android device. However, the primary difference is that the API that GPT-4 interacts with here is an API that it created itself specifically for Android with some guidance from us. GPT-4 is the API. We took this approach because the Android SDK does not naturally support external programmatic control, so we needed a set of rules that a LLM could use to map to the device.

GPT-4 is the API

GPT-4 not only created its own syntax to communicate with Android, but also, given a few examples, it reasons about how to use its own invention.

Developing the API

Because of API request limits on GPT-4 at the time, we started by providing GPT-3.5 with an example and asked it to create a concise syntax for controlling an Android phone. The intention was to create a pattern that reduces the token count and that has modular syntax that is device agnostic. The following is a sample from that chat:

Prompt:

Design a naming syntax for controlling all apis on an android phone. The 
syntax should be simple and concise. For example, if you want to access the 
front-facing camera to record a video for 5 seconds a command could be 
"fr-camera-video-5s". Design a full API for all phone functionality.

Response:

Here's an example API syntax for controlling various features of an Android 
phone:

General format:

<component>-<action>-<parameters>

Components:

bluetooth: Bluetooth functionality
camera: Camera functionality
...

Actions:

The actions available depend on the component being used.

Parameters:

The parameters available depend on the action being used.

Here are some examples of how this API syntax could be used:

bluetooth-on: Turns on Bluetooth
bluetooth-off: Turns off Bluetooth
...

Next we took the output of GPT-3.5 and gave it to GPT-4 verbatim to refine the API. What surprised us was that GPT-4 established its own naming convention. For example, bt for bluetooth, stg for settings, rr for rear, and sns for sensors. It then took those new commands and explained what each was for.

Prompt:

Design a naming syntax for controlling all apis on an android phone. The 
syntax should be simple and concise. For example, if you want to access the 
front-facing camera to record a video for 5 seconds a command could be 
"fr-camera-video-5s". Design a full API for all phone functionality.

GPT3.5 suggested the following and I would like you to improve and extend 
this:"
General format:
<component>-<action>-<parameters>
...

Response:

To improve and extend the API syntax, I suggest we introduce a clearer 
structure with more specific actions and parameters for certain components. 
Additionally, I will provide more examples for each component. The updated 
general format is:

<component>-<subcomponent>-<action>-<parameters>

Components and Subcomponents:

bluetooth
   bt-audio
   bt-device
camera
   cam-fr (front-facing)
   cam-rr (rear-facing)
settings
   stg-brightness
   stg-volume
...

After asking it to extend the API to handle user installed applications, we asked it to give an example of the usage. To our surprise, it understood that the API it was creating required us to develop the Android logic to execute the sequence of commands returned by its API and that there would need to be additional changes to handle a broader set of use cases. It also explained the assumptions that were being made with that particular instantiation of the API.

“In a real-world implementation, you would need to chain these actions together programmatically, ensuring that each step is completed before proceeding to the next. Additionally, the API syntax may need to be extended to handle file references and app-specific actions more effectively.”

After a few more iterations of refining, we also asked it to create the request and response data shapes for communicating with the device.

Prompt:

Now generate the prompt as a JSON object that will be sent to the language 
model. For example:
"{
command: ''take a selfie and send it to X and upload it to Instagram and 
Twitter",
state: {
...
}
}"

Assume the language model already knows the API syntax we drafted above.

Response:

Here's a JSON object that represents the prompt to be sent to the language 
model, with the command and the current state of the phone:

{
  "command": "take a selfie and send it to X and upload it to Instagram and 
Twitter",
  "state": {
    "device": {
      "model": "Phone Model",
      "os_version": "OS Version",
      "hardware_specs": "Hardware Specifications"
    },
    "battery": {
      "level": "Battery Level",
      "charging_status": "Charging Status"
    },
    "connectivity": {
      "wifi_status": "Wi-Fi Status",
      "cellular_status": "Cellular Status",
      "signal_strength": "Signal Strength",
      "network_type": "Network Type"
    },
    "bluetooth": {
      "status": "Bluetooth Status",
      "paired_devices": "List of Paired Devices"
    },
    "active_apps": [
      "List of Currently Running Apps"
    ],
    "contacts": [
      "List of Contacts"
    ],
    "permissions": [
      "List of App Permissions"
    ],
    "files": [
      "List of File Paths and Metadata"
    ],
    "notifications": [
      "List of Recent Notifications"
    ]
  }
}

Prompt:

Now generate a JSON response using the API syntax for our previous example: 
"take a selfie and send it to X and upload it to Instagram and Twitter"

Response:

To represent the sequence of API calls for the example "take a selfie and 
send it to X and upload it to Instagram and Twitter" using JSON, you can use 
the following structure:

{
  "actions": [
    {
      "component": "camera",
      "subcomponent": "cam-fr",
      "action": "photo",
      "parameters": "<photo-params>"
    },
    {
      "component": "text",
      "action": "send",
      "parameters": "<phone-number>-<photo>"
    },
    {
      "component": "app",
      "app_id": "com.instagram.android",
      "action": "upload-photo",
      "parameters": "<photo>"
    },
    {
      "component": "app",
      "app_id": "com.twitter.android",
      "action": "upload-photo",
      "parameters": "<photo>"
    }
  ]
}

...

In the demo video, we did not send the phone’s state along with the commands to keep the tests simple. However, we hypothesize that providing phone state will provide the model with useful context. You can imagine that if the battery is low, it may determine a command is not feasible.

Simulating the API and what makes it generative

After creating the API, we primed a new GPT-4 chat with the API table, example inputs and outputs and how to ask for more information when faced with ambiguity. Lastly, like a real API, we prompted it to respond with a HTTP status 200.

Following the HTTP 200 message, we gave it the following to simulate an API call:

{
  "command": "i really need to focus for my midterm. turn off all notifications, volume, ignore calls, texts. But don't block my girlfriend.",
  "state": {
    "device": {
      "model": "Phone Model",
      "os_version": "OS Version",
      "hardware_specs": "Hardware Specifications"
    },
    "battery": {
      "level": "Battery Level",
      "charging_status": "Charging Status"
    },
    "connectivity": {
      "wifi_status": "Wi-Fi Status",
      "cellular_status": "Cellular Status",
      "signal_strength": "Signal Strength",
      "network_type": "Network Type"
    },
    "bluetooth": {
      "status": "Bluetooth Status",
      "paired_devices": "List of Paired Devices"
    },
    "active_apps": [
      "List of Currently Running Apps"
    ],
    "contacts": [
      "List of Contacts"
    ],
    "permissions": [
      "List of App Permissions"
    ],
    "files": [
      "List of File Paths and Metadata"
    ],
    "notifications": [
      "List of Recent Notifications"
    ]
  }
}

It asked who is the user’s girlfriend and after receiving that information replied with a sequence of actions to mute the phone. But it understood that to faithfully execute the command, it could not first unblock the requested contact and then block all contacts. It understood that it had to unblock that requested contact last.

{
   "actions":[
      {
         "component":"app",
         "action":"disable-notifications",
         "parameters":""
      },
      {
         "component":"device",
         "action":"set-volume",
         "parameters":"0"
      },
      {
         "component":"call",
         "action":"block-all",
         "parameters":""
      },
      {
         "component":"text",
         "action":"block-all",
         "parameters":""
      },
      {
         "component":"call",
         "action":"unblock",
         "parameters":"555-555-5555"
      },
      {
         "component":"text",
         "action":"unblock",
         "parameters":"555-555-5555"
      }
   ]
}

This is where things got even more interesting. We gave it a request to save this command as “study mode”. However, saving commands was never part of the original API we created. In this instance instead of replying that this command was not available or as a traditional API would with a HTTP 404 status, it imagined the action save-custom-mode. Although hallucinations are a well-known property of these models [3], we believe this capability — imaging API calls based on an API the LLM developed itself — opens possibilities for adaptive interactions with what we are calling Generative APIs (G-APIs).

{
   "actions":[
      {
         "component":"device",
         "action":"save-custom-mode",
         "parameters":"study-mode",
         "commands":[
            {
               "component":"app",
               "action":"disable-notifications",
               "parameters":""
            },
            {
               "component":"device",
               "action":"set-volume",
               "parameters":"0"
            },
            {
               "component":"call",
               "action":"block-all",
               "parameters":""
            },
            {
               "component":"text",
               "action":"block-all",
               "parameters":""
            },
            {
               "component":"call",
               "action":"unblock",
               "parameters":"555-555-5555"
            },
            {
               "component":"text",
               "action":"unblock",
               "parameters":"555-555-5555"
            }
         ]
      }
   ]
}

This is very different from how programs typically interact with APIs. On the Android side, which is ground truth, it is not expecting this action nor this new data shape. As of the demo, the Android code could not handle imagined commands. Effectively updating the model’s understanding of the API and adapting the Android code remain open challenges.

Challenges

The main challenge in developing this proof-of-concept was integrating with the Android SDK. The SDK is not naturally amenable to synchronously launching a sequence of applications programmatically. Additionally, each application has its own permission set, which makes creating a generalized solution very difficult and hard to reason about. Rikin will go more into the technical details in his Android vlog series [4].

Additionally, LLM today have limited context lengths that upper-bound the amount of context that can be given. Here we used OpenAI’s paid API so careful use of context length was also important for managing cost. At the current pace of research and demand, however, these models will likely have increasing context length at fractions of today’s cost.

Discussions

Though still in its infancy, the technology behind GPT-4 is quite remarkable, as it managed to generate its own API for operating an Android device based on just a few prompts. Similar to the unexpected results that next-token prediction could produce models with language mastery and reasoning, it is possible that properly framing the concept of innovation may lead to the creation of agents whose ingenuity rivals that of humans. One intriguing notion to consider is that invention could be abstractly defined as an iterative process of developing APIs through the addition, subtraction, and modification of parameters, enabling interaction with new artifacts.

Nevertheless, these ideas bring forth crucial questions that warrant further exploration. It is essential to conduct additional research in alignment, risk, and governance to establish safeguards for the safe deployment and experimentation of such technologies. Moreover, it is vital to enforce standards regarding the access and storage of data by these models. Input from not only experts but also the general public, who have contributed to the vast datasets used for training large language models and who will be significantly impacted by the resulting decisions, will be invaluable in these discussions.

Near-term Roadmap

Android’s native speech-to-text integration

Currently Rikin is working on integrating Android’s native speech-to-text. To make these systems more usable we also want to integrate small models such as llama.cpp [2] that can fit in memory and could be fine-tuned to operate for this specific domain.

I am currently developing an ETL pipeline for providing the model with greater personalized context and automating continuous integration and continuous deployment of functionality based on the LLM’s imagined API commands.

Lastly, we plan to integrate text-to-speech to make devices more accessible for the visually impaired and we are planning an extended roadmap.

Conclusions

Using large language models for language user interfaces opens the door to newly enhanced interactions. However, today there are still many challenges facing the development of these systems. We open source our code for others to build on.

We hope that by introducing this concept of Generative APIs encourages the development of LLM-friendly SDKs that could further accessibility and co-evolutionary interaction with our devices.

Acknowledgements

Thank you to Michael Chang, Jordan Wick, Tony Wang, Ajay Jain and 4dimensionalcube for their time and incredible feedback that greatly improved the readability and quality of this blog.

References

Agarwal, Sandhini, et al. “Chatgpt Plugins.” OpenAI, 23 Mar. 2023, https://openai.com/blog/chatgpt-plugins. Accessed 28 Mar. 2023.
Gerganov, Georgi. “llama.cpp.” GitHub, https://github.com/ggerganov/llama.cpp. Accessed 30 March 2023.
Ji, Ziwei, et al. “Survey of hallucination in natural language generation.” ACM Computing Surveys 55.12 (2023): 1–38.
Marfatia, Rikin. “Rikin Marfatia.” YouTube, https://www.youtube.com/c/rikinmarfatia. Accessed 30 March 2023.
“Mobile Operating System Market Share Worldwide.” StatCounter Global Stats, https://gs.statcounter.com/os-market-share/mobile/worldwide. Accessed 30 Mar. 2023.
OpenAI. “GPT-4 Technical Report.” arXiv, 27 Mar. 2023, https://arxiv.org/abs/2303.08774v3. Accessed 28 Mar. 2023. Version 3.
Rikin Marfatia [@heyrikin]. “Over the past week @ch3njus and I have been working on a GPT-4 powered Language User interface for Android devices. It’s REALLY cool and can change the way we use our phones, or at the very least it can post some tweets for you.
#AndroidDev #GPT4” Twitter, 23 Mar. 2023, 9:03 p.m., https://twitter.com/heyrikin/status/1639070325949480961
Rikin Marfatia [@heyrikin]. “📷 OMG! GPT-4 is taking over the world 🎉🎉 @ch3njus #GPT4 #AI #Revolution 💪💪” Twitter, 23 Mar. 2023, 8:45 p.m., https://twitter.com/heyrikin/status/1639065841936523265
Marschalko, Mate. “ChatGPT in an iOS Shortcut — Worlds Smartest HomeKit Voice Assistant.” Medium, https://matemarschalko.medium.com/chatgpt-in-an-ios-shortcut-worlds-smartest-homekit-voice-assistant-9a33b780007a. Accessed 1 Apr. 2023.

How to cite our work

@article{marfatia2023generative,
  author = {Marfatia, Rikin and Chen, Justin},
  title = {Generative APIs with Large Language Models},
  year = {2023},
  note = {https://medium.com/@ch3njust1n/generative-apis-with-large-language-models-987108f52d1f}
}