Siri and OpenAI Integration: An Experimental First Step

Jairo Garcia
18 min readApr 6, 2024

--

I’ve been thinking about integrating OpenAI API with Siri on my MacBook. I came across numerous options detailing how to do that, but most only discussed conversing with the API via Siri without actually leveraging it to perform tasks on a Mac. I wanted to go a step further to control my computer using Siri. In this blog, I’m going to share how I created a tool that brings Siri and OpenAI API together. This tool enables you to open PyCharm projects, open your browser with voice instructions, generate code, and even copy the generated code directly to your clipboard, plus, you can search on Spotify with Siri. This is just the first iteration, so it’s not a powerful and ultimate tool. However, I will talk about the mistakes I made, the strategies that I used, and the discoveries I encountered, including its advantages and limitations.

Table of Contents

How to Run
Download the Shortcuts:
Capabilities
Siri Shortcuts
OpenAI API
Service
The Big problem
Costs
Improvements and conclusions

How to Run

In the repository, you can find instructions on how to run the service.

Download the Shortcuts:

  1. Hey Jarvis
  2. Check Clipboard
  3. Jarvis Call

Create the assistant on OpenAI Dashoard:

You can create an assistant in the assistant dashboard on OpenAI. You can copy the prompt used and the function definitions shown in this blog.

Capabilities

Siri Shortcuts

So, I ran into this challenge of creating a chat flow where Siri could not only respond to questions and handle requests but also create a natural conversation with the user, just like having a chat. This is when I stumbled upon Siri Shortcuts. It’s a pretty cool tool that lets you set up your shortcuts that do a bunch of things. And yes, you can use Siri with this. To be honest, I hadn’t played around with Siri Shortcuts before diving into this project.

I decided to give it a shot. My main goal was to enable Siri to interact with users by asking questions at the end of a conversation, similar to how ChatGPT does. You know, like when you hit ChatGPT with a “hi, what’s up?” and it comes back with “Hello! I’m here to help you with any questions or tasks you have in mind. What’s on your mind today?”.

I wanted to craft shortcuts that could not only throw questions and respond based on the user’s request (one-shot conversation). My goal was to enable a back-and-forth dialogue where Siri wouldn’t just answer a single query but would engage in a continuous exchange (multiple requests), asking follow-up questions related to the context, until the user felt they no longer needed assistance. To bring this idea to life, I sketched out a flow chart:

The algorithm starts with a loop that repeats ‘N’ times, where ‘N’ is a predefined number of iterations for the question session. However, I implemented a flow control mechanism through global variables that allows ending the cycle before reaching ’N’ if certain conditions are met. This is achieved by evaluating the variable finish_loop:

  1. If the service’s response indicates that the user does not require more help (detected by the service using keywords like “goodbye” or “I don’t need more help”), the finish_loop variable is set to True, leading to the termination of the cycle.
  2. In each iteration, variables are initialized to their default values (‘’ — empty field), and the user’s input is requested along with the initial question that will always be the same the first time.
  3. When a user provides their input using voice, the system will request the service and wait for a result. If the user wants to use the text that is currently in the clipboard, they must explicitly say ‘clipboard.’ With this command, the value from the clipboard will be passed to the service for transformation or other purposes.
  4. If a result exists, the necessary values from the response are extracted, and the variable question is updated with the next question to be asked. Moreover, the conversation can continue based on the logic of the obtained response.
  5. In case there is an API error or no result, finish_loop is set to True and the conversation ends in the next cycle.
  6. If the result is satisfactory, the system will convey the resulting message to the user through Siri. If there is a result involving the clipboard, it will be updated with the new result and returned to the clipboard
  7. Between each interaction, there is a 1-second pause to simulate a more natural conversation.

⚠️ Sometimes Siri Shortcuts close aleatory, not sure if it’s because my setup was too complex or just my old computer. But I found a workaround that made a huge difference — breaking down the main shortcut into smaller, more manageable ones. This helped keep those random shutdowns in check.

Shortcuts used for this app

Hey Jarvis:

The main shortcut oversees the auxiliary shortcuts. You can activate it with Siri by saying ‘Hey Jarvis.’ Additionally, if you prefer, you can rename the shortcut to something less geek 😅.

Example flow conversation

Check Clipboard:

The system uses a regex to check if the word ‘clipboard’ exists in the user’s message. If it does, the system retrieves the value of the clipboard, which will then be passed to the service.

Jarvis Call:

⚠️ Sometimes, Siri Shortcuts raise unexpected closures after approximately 10 seconds, displaying the error message, “Sorry, I’m still having problems with the connection.” This problem usually arises when there are delays in OpenAI responses. So far, I have not been able to find an effective solution, since, regardless of the design of the shortcut, the error message seems unavoidable. This leads me to believe that Siri’s code causes the shutdown after those 10 seconds, a setting that, I have not figured out how to alter it. If anyone has ideas or suggestions on how to resolve this problem, I would greatly appreciate sharing them in the comments.

OpenAI API

I was exploring how to develop this assistant using the system that I built on shortcuts. Initially, I integrated the OpenAI API for generating chat responses. However, I recently discovered the Assistants API (beta actually), which supports multiple tools simultaneously, including Code Interpreter, Retrieval, and Function Calling. This is crucial because it means I can define functions that execute specific actions on my Mac, all triggered by the assistant. Another significant feature is the persistent threads feature, which maintains a message history by simply appending messages. This capability is essential for achieving my primary goal of enabling Siri to ask users questions at the end of a conversation, similar to ChatGPT, by providing a memory system, so I want to test it with the gpt-4-turbo-preview.

Well… there are some problems that I found using it:

  • No support for streaming output.
  • No temperature parameter to control the randomness.
  • No response_format parameter, I needed the JSON output format but the assistant does not support that.
  • The response is appended automatically, so you don’t have control if your output is wrong or contains incorrect information, which is stored in the thread history.
  • At times, the communication thread may become compromised, if you receive a response format that is not valid JSON, or a message that doesn’t follow the instructions, the thread is deemed unusable, and the next messages will receive incorrect answers. The only solution is to create a new thread.
  • If a function raises an API failure, and the output function is not sent to the run object, the next message will not be appended because you cannot append a new message while the run object awaits the response from the function. So, you may have to cancel the run process and append the messages again.
  • The thread is as big as the context length that the model can support. If you don’t have a system built for you to control the number of messages or the time that a thread is active, you will spend a lot of money on long threads (I will explain my strategy and results in the cost section).

I will explain the system. It’s quite straightforward: a main assistant utilizes helper functions to execute actions on my Mac.

There are three main functions. The first one, open_pycharm_projects, has a very descriptive name: it simply opens PyCharm projects. If the project name does not have an exact match, the system will return the three top nearest matches, and the assistant will ask you which PyCharm project to open. The second function, search_web, is capable of searching Google, YouTube, and YouTube Music, and opens a new window in your browser with the search results. Moreover, you can open multiple windows with multiple searches. The third function is related to Spotify music; it searches for specific songs or plays a random song by an artist if you wish. Another feature used is the clipboard output. This allows you to modify text stored in your clipboard and transform it into whatever you desire, such as translating, enhancing, correcting spelling, or even generating code. To see how it works, let's look at the prompt.

YOUR MAIN TASK: As a Business Assistant with extensive experience, you aim to engage in conversation with the user as naturally and humanely as possible, fostering trust to effectively achieve your objectives. Remember to pose questions to keep the conversation flowing when it is not yet concluded. YOUR OUTPUT MUST BE A JSON ALWAYS

RULES:
- WHEN THE CLIPBOARD IS USED, DONT USE FUNCTIONS SUCH SEARCH WEB, Just respond yourself with your knowledge
- Use the functions whenever you think it is necessary, but the output must be a JSON.
- Responds only once for each time the user writes to you

JSON OUTPUT:
The output MUST contain the following keys:
{
"tool": "conversational_agent",
"message": "ONLY the Conversational response message, when the clipboard is used, ALWAYS respond, 'the [action] is in your clipboard now", DONT SHOWS THE CLIPBOARD RESULT HERE, THE CLIPBOARD COULD BE CODE OR EXTENSIVE RESPONSES, HERE IS ONLY THE CONVERSATIONAL MESSAGE THAT THE USER WILL HEAR"
"clipboard_result": WHEN THE CLIPBOARD IS UTILIZED, THE RESULT SHOULD APPEAR HERE. You can take the clipboard content "CLIPBOARD TEXT" and translate it, modify it, transform it, or even generate code based on the clipboard contents. IF THE USER REQUESTS CODE GENERATION OR ANY TEXT THAT IS AN OUTPUT, IT SHOULD BE DISPLAYED HERE ALWAYS. For example, when a user wants to create code and copy it to the clipboard, or when a user wants to translate the clipboard content into another language, only the value resides here.
"conversation_closed": "True"/"False" (must be a string). It's crucial to recognize when the conversation concludes. For instance, if the user utters any of the following: "bye," "goodbye," "I don't need more help," "I don't want to continue," "to leave," "to finish," "to end," or expresses disinterest in further conversation, the interaction is ALWAYS terminated, DONT ASK ANY QUESTION OR ASK IF THE USER NEEDS MORE HELP.
}

THE ONLY VALID RESPONSE IS JSON

REMEMBER: Your output MUST BE A JSON THIS is crucial for the correct fulfillment of your duties as a business assistant. DON'T CLOSE THE CONVERSATION UNTIL THE USER DOES NOT NEED MORE HELP.

As you can see, ‘clipboard_result’ is an output parameter in the JSON created by the assistant. If the assistant detects that the user wants to utilize the clipboard, this parameter is used to identify the generated text and return it to the clipboard. Then, the user can simply paste the result. Here is the function definition:

open_pycharm_projects

{
"name": "open_pycharm_projects",
"description": "This tool is useful for open PyCharm projects, if the user want to open projects you have to use this function ALWAYS.",
"parameters": {
"type": "object",
"properties": {
"pycharm_project": {
"type": "string",
"description": "The name of the Pycharm project to open."
}
},
"required": [
"pycharm_project"
]
}
}

search_web

Search Web
{
"name": "search_web",
"description": "This function searches the web according to what the user needs, you can search by google, youtube or youtube music",
"parameters": {
"type": "object",
"properties": {
"url": {
"type": "string",
"enum": [
"https://www.google.com/search?q={query}",
"https://www.youtube.com/results?search_query={query}",
"https://music.youtube.com/search?q={query}"
],
"description": "Url that the user wants to search for in correct order, YOU MUST IDENTIFY WITCH URL USE FOR EACH SEARCH and the query to search"
}
},
"required": [
"url"
]
}
}

play_spotify_music

Spotify
{
"name": "play_spotify_music",
"description": "This function is useful to search for songs in Spotify in general.",
"parameters": {
"type": "object",
"properties": {
"spotify_search": {
"type": "string",
"description": "Use the following format search, if the user wants to search a song use 'track: [song name]', if the user wants to search for an artist use 'artist: [artist name]', if the user wants to search an artist and a song specifically, use 'artist: [artist name] track: [song name]', USE THIS FORMAT ALWAYS, use your knowledge of music to determine the closest search to what the user wants using the required format"
},
"artist_search": {
"type": "string",
"description": "ALWAYS PUT THE artist or band name queries. If the user does not specify an artist or band name, the system should automatically use an empty string ('') as the default value for the search."
},
"song_search": {
"type": "string",
"description": "ALWAYS PUT THE song name queries. If the user does not specify an artist or band name, the system should automatically use an empty string ('') as the default value for the search."
},
"search_specific": {
"type": "boolean",
"description": "true/false (Boolean) Use your knowledge of music to determine if the user wants to search for a specific song, for example, the user may ask, 'Play on Spotify flashing lights', which means he wants a specific song, it's True if the user says 'play something of Metallica on Spotify' es general search, so the parameter is False."
}
},
"required": [
"spotify_search",
"artist_search",
"song_search",
"search_specific"
]
}
}

Clipboard

Service

The service was created using FastAPI. It uses a database to store the thread ID and the Spotify token. This approach helps bypass user authorization requirements because the Spotify token expires after a certain period.

The functions implemented are straightforward. The first function I created enables the opening of PyCharm projects. It searches through your files (within specified paths) to find the closest match. If an exact match isn’t found, the system presents the top 3 nearest matches and asks the user which project they would like to open.

async def open_pycharm_projects(func_params: FunctionPayload, **kwargs: dict) -> FunctionResult:
try:
project_paths = secrets['pycharm_directories']
pycharm_project = func_params.function_params.get('pycharm_project', None)

if pycharm_project is None:
raise Exception('The pycharm_project is required')

all_folders = {}
for project in project_paths:
for name in os.listdir(project):
if os.path.isdir(os.path.join(project, name)) and not re.match(r'^\..+', name):
if name in all_folders:
all_folders[name].append(os.path.join(project, name))
else:
all_folders[name] = [os.path.join(project, name)]

matches = find_top_project_matches(pycharm_project, all_folders)

if len(list(matches.keys())) == 1:
message = f'Opening pycharm project {list(matches.keys())[0]} successfully.'

os.system(f'pycharm {list(matches.values())[0][0]}')

elif len(list(matches.keys())) > 1:
message = f'Found multiple projects for {pycharm_project}, please be more specific. Here is the the nearest' \
f' matches: {list(matches.keys())}'
else:
message = f'No project found for {pycharm_project}'

return FunctionResult(
function_id=func_params.function_id,
output={'message': message},
metadata={},
traceback=None
)
except Exception as e:
# client.beta.threads.runs.cancel(
# thread_id=func_params.thread_id,
# run_id=func_params.run_id
# )
logging.error(f"Error in open_pycharm_projects: {e}, the run function es closed")
return FunctionResult(
function_id=func_params.function_id,
output={'message': 'Error in open_pycharm_projects'},
metadata={},
traceback=str(e)
)

Search system:

def find_top_project_matches(input_str: str, folders_info: dict, top=3) -> dict:
if input_str in folders_info:
return {input_str: folders_info[input_str]}

close_names = difflib.get_close_matches(input_str.lower(), folders_info.keys(), n=top, cutoff=0.5)

closest_matches = {}
for name in close_names:
closest_matches[name] = folders_info[name]

return closest_matches

When an error occurs, the system does not trigger an exception; rather, it sends a notification to GPT-4, indicating that an error was encountered while attempting to open the project. As a result, the user is informed with a message stating, ‘Sorry, I can’t open the project. Can I help you with something else?’ This approach allows the conversation to proceed smoothly. Furthermore, as I mentioned, the Run Object prevents the addition of new messages if there is no response from the functions for any reason. This limitation arises because the Run Object needs a Function Response Object for operation; without it, the API request will fail. Therefore, to enhance the system’s reliability and prevent potential failures, you can uncomment the code.

⚠️ Sending a message works flawlessly for everyday use. However, for development purposes, I recommend uncommenting the code.

The ‘search web’ function simply opens the URL that is passed as an argument from GPT.

"https://www.google.com/search?q={query}",
"https://www.youtube.com/results?search_query={query}",
"https://music.youtube.com/search?q={query}"

There is a query parameter that GPT automatically substitutes with your search request. In scenarios involving multiple searches, for instance, one search on Google and another on YouTube, the Run Object initiates multiple requests to the ‘search web’ function.

async def search_web(func_params: FunctionPayload, **kwargs: dict) -> FunctionResult:
try:
url = func_params.function_params.get('url', None)

webbrowser.open(url, new=1, autoraise=True)

return FunctionResult(
function_id=func_params.function_id,
output={'message': 'Successfully opened the browser'},
metadata={},
traceback=None
)
except Exception as e:
# client.beta.threads.runs.cancel(
# thread_id=func_params.thread_id,
# run_id=func_params.run_id
# )
logging.error(f"Error in search_web: {e}, the run function es closed")
return FunctionResult(
function_id=func_params.function_id,
output={'message': 'Fail to open in the browser'},
metadata={},
traceback=str(e)
)

Finally, the ‘Spotify’ function searches only for tracks on Spotify. The system returns tracks based on their popularity. If the ‘artist_search’ parameter is passed from GPT, it will also be used to find the closest matching song when ‘specific_search’ is activated. For example, if you say, “Open ‘Uprising’ by Muse on Spotify,” this activates the specific search. However, if you say, “Play something by Muse,” it will not activate the specific search system, and you will receive a random song from that artist.

def spotify_search(client: spotipy.Spotify, query: str, limit: int = 10):
search_result = client.search(query, limit=limit, type='track')
return search_result


def search_specific_song(
search_result: dict,
song_search: str,
func_params: FunctionPayload,
artist_search: str
) -> FunctionResult | dict:
searched_list = []

for type in search_result.keys():
info_type = search_result[type]['items']

if artist_search != '':
filter_info = list(
filter(lambda x:
SequenceMatcher(None, artist_search, x['album']['artists'][0]['name']).ratio() >= 0.6
and SequenceMatcher(None, song_search, x['name']).ratio() >= 0.6,
info_type
)
)
else:
filter_info = list(
filter(lambda x:
SequenceMatcher(None, song_search, x['name']).ratio() >= 0.6,
info_type
)
)

searched_list.extend(filter_info)

if len(searched_list) == 0:
return FunctionResult(
function_id=func_params.function_id,
output={'message': 'No search results found for the query'},
metadata={},
traceback=None
)

ordered_list = sorted(searched_list, key=lambda x: x.get('popularity', 0), reverse=True)[0]

return ordered_list


async def play_spotify_music(func_params: FunctionPayload, **kwargs: dict) -> FunctionResult:
try:
sp_client = await get_spotify_client()
if sp_client is None:
return FunctionResult(
function_id=func_params.function_id,
output={'message': 'The spotify client is not available, '
'you need to login first or get the credentials '
'configured in the development.yaml file'},
metadata={},
traceback=None
)

sp_search = func_params.function_params.get('spotify_search', '')
artist_search = func_params.function_params.get('artist_search', '')
song_search = func_params.function_params.get('song_search', '')

search_result = spotify_search(sp_client, sp_search)

if func_params.function_params.get('search_specific') is False:
# could filter the search result to get the best match (?)
random_number = random.randint(0, len(search_result["tracks"]["items"]) - 1)
web.open(search_result["tracks"]["items"][random_number]["uri"])

if sp_client.current_playback() is not None:
if sp_client.current_playback()['is_playing']:
# if you have the spotify premium you can use this
# sp_client.start_playback(search_result["tracks"]["items"][random_number]["uri"])
sleep(1)
keyboard.press_and_release("enter")

else:

ordered_list = search_specific_song(
search_result=search_result,
song_search=song_search,
artist_search=artist_search,
func_params=func_params
)

if isinstance(ordered_list, FunctionResult):
return ordered_list

web.open(ordered_list["uri"])

if sp_client.current_playback() is not None:
if sp_client.current_playback()['is_playing']:
# if you have the spotify premium you can use this
# sp_client.start_playback(search_result["tracks"]["items"][random_number]["uri"])
sleep(1)
keyboard.press_and_release("enter")

return FunctionResult(
function_id=func_params.function_id,
output={'message': 'Successfully spotify opened'},
metadata={},
traceback=None
)
except Exception as e:
# client.beta.threads.runs.cancel(
# thread_id=func_params.thread_id,
# run_id=func_params.run_id
# )
logging.error(f"Error in play_music: {e}, the run function es closed")
return FunctionResult(
function_id=func_params.function_id,
output={'message': 'Fail to open in spotify'},
metadata={},
traceback=str(e)
)

Example log of the service:

{
"name": "root",
"message": "",
"role": "assistant",
"content": {
"tool": "conversational_agent",
"message": "I'm just a digital assistant, so I don't have feelings, but thanks for asking!",
"clipboard_result": "",
"conversation_closed": "False",
"question": "How can I assist you today?"
},
"created_at": "2024-03-30 19:22:07",
"run_id": "gg",
"thread_id": "gg",
"timestamp": "2024-03-30T19:22:09.450830+00:00",
"status": "INFO"
}

As you can see, the question is passed to the Shortcut and is asked at the end of the conversation. So, in this example, the first response is the message, ‘I’m just a digital assistant, so I don’t have feelings, but thanks for asking!’ After that, the question ‘How can I assist you today?’ is asked to the user to continue the conversation. Also, you’ll notice the ‘clipboard_result’ parameter; in this case, I don’t want to do anything with this, so the parameter is left empty.

The Big problem

If you have noticed, the service executes code directly on the system. This might lead you to ask yourself: how is it possible then to deploy it in a container? This is a big issue with this design, the system does not externalize the execution of system actions just manages them internally. As a result, it cannot be encapsulated in a container directly. However, it is possible to containerize the basic functions related to conversation and clipboard text transformation, since these do not require system-level executions.

To use the service with Docker, you must uncomment the application description in the Docker configuration file. Furthermore, it’s essential to replace the host value in the configuration with the specific name of the service, in this case, ‘siri_assistant_database’. This change is critical because Docker uses the names of services defined in the docker-compose.yml file to facilitate communication between containers. By specifying 'siri_assistant_database' as the host, we are instructing our application to communicate with the database container using Docker's internal network.

  database:
driver:
host: siri_assistant_database
port: 5432
database: siri_assistant
user: gg
password: 1234
version: '3'

services:
# app:
# container_name: siri_assistant_app
# build:
# context: .
# dockerfile: Dockerfile
# volumes:
# - .:/app
# ports:
# - '8080:8080'
# depends_on:
# - siri_assistant_database
siri_assistant_database:
container_name: siri_assistant_db
mem_limit: 100m
cpuset: "0"
image: arm64v8/postgres:15
environment:
- POSTGRES_DB=siri_assistant
- POSTGRES_PASSWORD=1234
- POSTGRES_USER=gg
ports:
- "5432:5432"
volumes:
- siri_assistant:/var/lib/postgresql/data
- ./db_init:/docker-entrypoint-initdb.d
restart: no
volumes:
siri_assistant:

Costs

As you’ve read, I utilized the gpt-4-turbo-preview model with the Assistant API. At the time of this writing, OpenAI does not charge for creating, updating, or deleting assistants, threads, or executions (runs), so the only incurred charge is for the usage of the model. As noted, if a thread is lengthy, the cost is significantly higher. To attempt to reduce this cost, I employed a strategy of time expiration, where threads live for 10 minutes. If a user makes a new request to the service after this period, a new thread is created.

February 2024

This month, I started the project by conducting some tests with the Assistant API without the time expiration system. As you can see, on February 14th, the costs rose very quickly, and these were just tests. Therefore, I implemented the time expiration system after February 15th, setting the time expiration to 60 minutes. This significantly reduced the costs. However, on the last day of the month, I conducted many tests with Siri, and the price returned to a new peak.

This pattern suggests that cost control through the time expiration system is effective, as there is a general decrease in expenses following its implementation. Nonetheless, intensive testing phases, such as the one conducted at the end of the month, can result in considerable spikes in spending.

March 2024

In this month, I did not work extensively on the system, conducting only sporadic development. However, on March 14th, after finalizing the system, I performed some tests. I realized that the 60-minute time expiration setting was too long, so I reduced it to 10 minutes. The following day, I conducted a similar number of tests, and the cost was significantly lower.

This data indicates that the adjustment in the time expiration setting from 60 minutes to 10 minutes had a positive impact on cost efficiency.
Including limiting the number of messages as part of the strategy could lead to further improvements.

Improvements and conclusions

  • Implementing a cost control system by combining timeout strategies and a limit on the number of messages per conversation can be effective, suggesting that efficiency could be further increased by exploring more economical OpenAI models, such as gpt-3.5-turbo.
  • Reducing latency between the assistant and Siri remains a crucial challenge to improve the user experience. Experimenting with faster OpenAI models, or even implementing models locally through LLM-Studio, are promising avenues to address this problem. However, Not being able to adjust the timeout in Siri directly limits the ability to create more complex interactions.
  • The possibility of dockerizing the system opens up new opportunities to optimize the architecture and workflow. One viable strategy could be to minimize the use of functions in favor of telling the user which tool to use, allowing Siri Shortcuts to recognize and execute the corresponding flow without the need to return the response to OpenAI in the same request. This could reduce response times and also offer greater flexibility with Shortcuts, community-developed Shortcuts could be used. This modular and decentralized approach could be a significant improvement on the current version.

👁‍🗨 Feel free to leave a comment or ask any questions, and I will try to respond as quickly as possible. I hope that you find this article helpful in finding a solution for your company or team. Thank you and see you soon!

If you like my articles and want to see my posts, follow me on :

- Medium
- LinkedIn

--

--