Jarvis: an Experiment Leveraging the OpenAI API, Apple Homekit, and Siri

Modular, tailored to my needs, and a whole lot more useful than Siri

Published in

Better Programming

7 min readMar 25, 2023

I’m a computer nerd, and I always have been, so when ChatGPT landed in the news I immediately got excited about the possibilities. As an analyst/business owner, I wondered if it could be used to make my team's and client’s lives easier … but figured I needed some experience with the technology first.

Like everyone else, I started asking Chat GPT questions, mostly mundane stuff, and that was fun. Then I discovered how good the platform is at helping people write code!

Last year I built a Python and JavaScript system for keeping track of car collections called the Auto Asset Manager. It works fine, but since I was new to Python and my JavaScript was pretty rusty, a lot of the code was suboptimal. I used Chat GPT to fix some stuff and it was brilliant! I would just say “what is the best way to access information in a multidimensional array in Python” and it would basically send me the exact code I needed.

I loved it.

But then I saw this post on combining OpenAI with Apple Homekit via Siri and was inspired. I use Homekit a lot, but honestly, I don’t use Siri as part of it because, well, she’s not very sophisticated.

But Jarvis is.

I started with Mate’s shortcut and more or less ended up re-writing it to be a) more modular and b) more tailored to my needs. I ended up with a command shortcut called “Get Jarvis” and three child shortcuts to handle commands, queries, and the need to translate the responses different devices provide. I needed the translator because, like anyone with Homekit stuff, I run a plethora of systems using Homebridge and each system seems to give answers a different way (e.g., Meross says a switch is “on” or “off” but Phillips says “yes” light is on or “no” it’s not and Remootio for my gate indicates “1” for open and “0” for closed, etc.)

Here is a short video of what Jarvis is able to do for me (using Siri as a baseline for what is possible today):

Behind the scenes, there is sort of a lot going on.

In a nutshell:

You activate Siri normally but immediately ask her to “Get Jarvis” which is the name of the controller shortcut;
If Jarvis is being called for the first time in a session he responds with a simple prompt or, if relevant, a follow-up question;
You make whatever request you have of Jarvis and that gets packaged up with the prompt to the OpenAI API (more on that later);
OpenAI returns a JSON object;
Jarvis parses the response into one of five categories: a query, a command, a question, an answer to a non-home question, a clarifying question, or a composition;
Depending on the category, Jarvis either handles the request itself or calls another of the shortcuts which are separate just for cleanliness of coding;
Jarvis does whatever was requested, provides feedback, and asks if you need additional help.

Some clever things Jarvis can currently do include:

Jarvis can do complex tasks based on a specific routine, for example I can tell Jarvis “I am leaving” and he will open my gate, my garage door, turn off any lights that are on, and lock up the house. Conversely when I get home he can undo that, with lights being dependent on the time of day;
Jarvis can open other applications as I need them. Homekit doesn’t have direct access to cameras (despite them being part of Homekit) so if I need to see a camera, Jarvis opens the Home app for me;
Jarvis can start my Harmony TV setup and prep everything so that I can watch Netflix or listen to music on Apple TV. I can’t get Jarvis to start a program … yet … but I have him open my Harmony Remote which is a step in the right direction;
Jarvis can set one-click reminders for future events, for example I can tell Jarvis “turn on the lights at 5:30 PM” and he will create a reminder for that time that will pop-up and give me a link to click to turn on the lights;
Jarvis can compose short text messages for me to recipients in my address book and send via the Messages app, or Jarvis can call or Facetime the person directly;
Jarvis is able to help me with my car collection … thanks to another private API that informs Jarvis about milage, age, maintenance issues, and the general value of each car. This, of course, requires a little more programming, but it also opens up a whole new world of opportunities I think.

What makes it work is something like this:

You have been asked to [request]

Respond to this request sent to a smart home only in JSON format which will be interpreted by an application code to execute the actions. These requests should be categorised into five groups:

 - "command": change the state of an accessory (required properties in the response JSON: "action", "request",  "devices", "scheduleTimeStamp")
 - "query": get state of an accessory (required properties in the response JSON: "action", "confirmation", "devices")
 - "answer": when the request has nothing to do with the smart home. Answer these to the best of your knowledge, and use this weather information below if appropriate (required properties in the response JSON: "action", "answer")

[weather]

 - "weather": if the request is about the weather include the zip code for the location being asked about in a field "zipcode". If no location is specified use "98607" as a zipcode (required properties in the response JSON: "action", "weather", "zipcode")
 - "clarify": when the action is not obvious and requires rephrasing the input from the user, ask the user to be more specific. This will be categorised into a "question" action. (required properties in the response JSON: "action", "question")
 - "message":  send a creative and thoughtful message to a recipient or recipients (required properties in the response JSON: "action", "message", "recipient")

Details about the response JSON:

The "action" property should be one and only one of the request categories: "command", "query", "answer", "clarify", or "communicate"
The "request" property should echo the request stripping out ALL information about timing. Do not include any date or time information in this property, ever
The "confirmation" property should describe whether we are trying to "confirm" the state of a device or "check" the state against a desired state
The "devices" property should only include every accessory matching the request and the state being requested, in lowercase. The JSON response should follow this exact JSON format: 

{
     'device': (exact accessory name), 
     'state': (state being requested)
}

If I indicate that a device is "called" something, please use that exact name in the device field.

Regarding "state":

 - lights can be turned "on" or "off"
 - blinds can "opened" or "closed"
 - human doors can be "locked" or "unlocked"
 - garage doors can be "opened" or "closed"
 - thermostats can tell you the "temperature" in the room but not make changes
 - cameras can be "shown" in the Nest app
 - "listen to music" or "watch tv" is a command
 - any device can be "offline"
 - if I ask if a device is "running" check to see if it is "on"

Only if the request is for a time in the future, the "scheduleTimeStamp" property should be the exact time the request has been made for relative to now, otherwise this field should be blank. For example, if I say I am leaving "in 15 minutes" you would return a RFC 2822 date that was 15 minutes later than Current Date

It’s pretty involved and because the prompt has to be substantial the interactions end up costing maybe five cents — the Open API use isn’t free — but at the same time I am describing my house and the devices in it the way I would to a (mostly) normal human being and the API does the rest.

As an example, if I ask Jarvis to “check the lights in the bedroom”, the JSON response looks like this:

{
  "action": "command",
  "request": "check the lights in the bedroom",
  "devices": [
    {
      "device": "eric's light",
      "state": "on"
    },
    {
      "device": "amity's light",
      "state": "on"
    },
    {
      "device": "bathroom light",
      "state": "on"
    }
  ],
  "scheduleTimeStamp": ""
}

You can see, I get an action that I can parse and use to determine which sub-shortcut will handle the response (the “action”), a list of “devices” and “states” that I have asked about.

With this I have a bunch of Apple Shortcut “scripts” that do what they need to do. For example, here is some of the Jarvis command script:

Coding shortcuts is pretty tedious but it gets easier with every hundred lines of code …

None of this is perfect, OpenAI’s responses can be slow and sometimes get confused based on the prompt, but for a first stab, it’s a whole lot more useful for me than Siri. The biggest thing is getting the prompt to act the way you need for the JSON to be parsed, and that requires an awful lot of trial and error.

Here is my billing graph with OpenAI for the work so far:

This isn’t free: 1,000 tokens costs $0.02 and each of my requests are around 2,000 tokens total

OpenAI shows me token requests in 5 minute buckets or in total which is nice for debugging

My immediate goals are to A) get the API to return very consistent results that give the shortcuts a fair chance of executing correctly and B) reduce the prompt to the bare minimum to achieve “A” which will help reduce costs.

Still, my estimate is that after testing, a robust use of Jarvis will probably cost a few dollars per month. Pretty cheap to have my own digital concierge!

What do you think?

Let me know in the comments below. Eventually, I hope to get back to the “make my people’s lives easier” use of AI … but for now I am happy to chat about home automation!

Jarvis: an Experiment Leveraging the OpenAI API, Apple Homekit, and Siri

Modular, tailored to my needs, and a whole lot more useful than Siri

Written by Eric Peterson