Stories by Kamil Tustanowski on Medium

AI Agents 101: From Concept to Code (No Frameworks Required)

Kamil Tustanowski — Sun, 08 Feb 2026 15:05:17 GMT

Artificial Intelligence is evolving beyond simple chatbots. Today, we’re witnessing the rise of AI agents — autonomous systems that can perceive their environment, make decisions, and take actions to achieve specific goals. Unlike traditional Large Language Models that simply respond to prompts, agents can use tools, learn from interactions, and accomplish complex tasks independently.

In this post, I’ll explore who AI agents really are and show you a super-simple but ready-to-use practical example. But before we go to all of this good stuff:

How to Read This Article

This is a hands-on, from-first-principles guide to AI agents. It’s intentionally detailed, so you don’t need to read it all in one go.

Choose your path:

⏱️ Short on time (10–15 minutes)
Read:

What is an AI Agent?
Why AI Agents Matter
Defining the Agent’s Rules (The System Prompt)
💡 Pro Insight: The ReAct Pattern
The Big Picture

This gives you the core mental model without touching the code.

🧠 Want to really understand agents (35–45 minutes)
Read everything above, plus:

Building a Simple Research Agent
The Agent Loop
Why This Pattern Matters

You’ll understand how and why agentic systems work.

🧑‍💻 Want to build it yourself (60–90 minute)
Read the article top-to-bottom and follow the code sections step by step.
This path is for readers who want a working mental model and a runnable agent.

You can safely skip or skim setup sections on the first pass and come back later — the article is structured so that each part builds on the previous one without hidden dependencies.

🤖 The fully functioning agent code is included at the bottom of the article.

What is an AI Agent?

An AI agent is a software entity that:

- Observes the current state through inputs and data
- Reasons about available actions and potential outcomes
- Decides which actions to take based on its goals
- Acts by executing those decisions, often using external tools
- Learns and adapts from the results

Think of an AI agent as a smart assistant that doesn’t just answer questions — it actively solves problems by using available resources.

Why AI Agents Matter

Traditional AI models are reactive. You ask, they answer. But AI agents are proactive and adaptive. They can:

✅ Complete multi-step tasks autonomously
✅ Adapt strategies based on feedback
✅ Use specialized tools to extend capabilities
✅ Handle complex, real-world scenarios

This shift represents a fundamental change in how we build intelligent systems.

The Example

More times than I can remember I was spending hours searching the web while exploring interesting topics. I was looking for reliable sources, websites, blog posts, reddit threads etc. trying to select the best ones. Sometimes I was finding pure gold and sometimes I was just wasting time in rabbit holes. Back then it wasn’t possible to automate this process in a reliable way. For sure not end-to-end.

This changed in the age of AI. But let’s start from the beginning.

Brain alone can’t do much without the help of the senses. Let’s run a simple thought experiment and give this brain:

Legs — now it can move (note: it can also fall onto the ground)
Hands — now it can grab food, rocks, defend itself
Eyes — it stops falling into pits and bumping into obstacles, it’s easier to avoid danger, additionally it’s easier to find food and rocks for hands
Ears — detecting danger becomes much easier
Memory — knowing where to find food, which animals are dangerous and which can be transferred into food, knowing where the home is, no more Groundhog Day resets.

This is more-or-less (for sure less than more) functioning human. Let’s consider AI agent now:

LLM is a brain but brain alone can’t do much without the help of the senses. Let’s (again) run a simple thought experiment and give this brain:

Tools (hands) — now it can actually do things: call APIs, update databases, write files, send messages, trigger workflows. Without tools, it can only talk about actions.
Execution environment (legs) — now it can move work forward: run code, schedule tasks, retry failures, loop, and adapt. It can also fail, get stuck, or crash if not guided properly.
Observability (eyes) — logs, metrics, feedback from systems. The agent stops blindly retrying broken actions, notices errors, sees results, and adjusts instead of bumping into the same wall.
Event signals (ears) — webhooks, alerts, user input, system events. The agent no longer waits to be prompted; it can react to changes, detect anomalies, and respond to danger early.
Memory (short + long term) — context from the current task plus knowledge of past interactions. It remembers what worked, what failed, user preferences, and where “home” is — no more Groundhog Day resets on every prompt.

The Analysis

We have all the needed bootstrapping and the time has come to make an agent I foreshadowed in the beginning of this article. The agent that can search any requested topic and store any interesting findings on disk. Simple yet useful and powerful if you need versatile investigator at your service.

First let’s take two steps back and think what we actually want to achieve.

The Goal — I want to have an assistant who will browse the web in search of topics that I choose. Will do it autonomously and findings will be stored in files.

Knowing the goal helps with figuring out what we actually need.

There is no specified framework or service. We will proceed without any help to have better understanding of the process and what this enigmatic Agent really is.

Having this out of the way, let’s think what we need to achieve our goal. Let’s think what we would need to do to complete this research task. Imagine our boss says to us “Please find interesting news about dogs and save them on disk so I can read them”. How should we tackle this task?

Think on how to achieve this task.
Open web browser and find current news about dogs.
Visit provided webpages.
Decide which are interesting and fetch details.
Save the found results onto disk.

Building a Simple Research Agent (No Frameworks, Just Python)

There’s a lot of noise around “AI agents” right now. Frameworks, diagrams, buzzwords, promises of autonomy.

⚠️ But if you strip all of that away, an agent is actually a very simple thing:

A loop where a language model reasons, decides what to do next, and uses tools it cannot access directly.

Now we’ll build from scratch a small but complete research agent that can:

reason about a task
decide when it needs external information
search the web
visit webpages
save results to disk
stop when the job is done

Using Python, LLM, and a clean control loop.

Why This Approach?

Before jumping into code, it’s important to understand the philosophy here.

LLMs are extremely good at:

reasoning
planning
summarizing
deciding what should be done next

But they can’t:

browse the web
write files
execute code
interact with the real world

So instead of pretending the model is autonomous, we treat it like what it really is:

A brain that needs hands, eyes, and memory provided by us.

That’s what tools are.

Tooling & Project Setup

I’m using uv as the dependency manager and virtual environment tool.

Why uv?

extremely fast
replaces pip, pip-tools, virtualenv
deterministic installs
it’s rapidly becoming the new gold standard for Python project and package management

Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Initialize the Project

uv init search-agent
cd search-agent

This creates a new Python project named search-agent with a fresh virtual environment and basic setup files.

Currently the most interesting place in this directory is pyproject.toml:

[project]
name = "search-agent"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = []

You can update the description. But I will leave it as is and go to the next step.

Create a virtual environment

uv venv
source .venv/bin/activate

A virtual environment keeps each Python project isolated, avoiding version conflicts and making the setup reproducible — so projects stay clean, independent, and predictable.

Add Dependencies

When we are safely in our own private environment, it’s time to add all the needed dependencies:

uv add openai python-dotenv ddgs requests bs4

This gives us:

openai — LLM access
python-dotenv — environment variables
ddgs — Dux Distributed Global Search
requests — HTTP requests
bs4 — HTML/XML parser

Don’t worry about this just yet. We are doing this to have it out of the way so we won’t have to revisit this later.

Your project structure should look like this:

search-agent/
├── main.py
├── pyproject.toml
└── .venv/

If you open pyproject.toml now:

[project]
name = "search-agent"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "bs4>=0.0.2",
    "ddgs>=9.10.0",
    "openai>=1.107.3",
    "python-dotenv>=1.1.1",
    "requests>=2.32.5",
]

You will see it’s listing all the dependencies needed for this project.

Additionally

.venv – This is your project’s virtual environment. It isolates your Python packages from the system Python, preventing version conflicts between projects. Activate it to run project-specific dependencies.
pyproject.toml – The project configuration file. It defines metadata, dependencies, build tools, and scripts, serving as the modern replacement for setup.py and sometimes requirements.txt. uv uses it to manage your project’s packages and setup.

Environment Configuration

One crucial piece is still missing. Where should we store all the credentials, keys, etc. that we need? The answer is the .env file.

A .env file is used to store configuration and secret information for your project outside of your code. This includes things like API keys, passwords, or other settings that might change depending on the environment (your computer, a server, or a teammate’s machine). By keeping this information in a .env file, your code stays clean, secure, and easy to share.

For this example we need to provide OPENAI_API_KEY to OpenAI.

🤝 To make it work you have to create you own .env and add your OPENAI_API_KEY. How to get key → https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key, please let me know if you need any help!

OPENAI_API_KEY=sk-...

When all of this is done, we just use load_dotenv() to load all of the credentials listed in .env file.

import os
from dotenv import load_dotenv

load_dotenv()

if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY not found in environment variables")

Fail fast. Silent failures here are painful to debug later.

print(os.getenv("OPENAI_API_KEY"))

Make sure that if you add this line you are seeing the key you provided:

sk-...

If the answer is yes you are set to go 💪!

⚠️ Why you shouldn’t commit your .env file
The .env file usually contains sensitive information, like API keys, passwords, or tokens. If you commit it to a public repository, anyone could see and misuse these secrets. Instead, keep .env local and add it to .gitignore so it stays private, while still allowing your code to read the variables safely.

Initializing the OpenAI Client

from openai import OpenAI
client = OpenAI()

Nothing fancy here — just initializing the official OpenAI client. You don’t have to provide the OPENAI_API_KEY separately because it’s already provided in the os.getenv(“OPENAI_API_KEY”).

The LLM Helper Function

Now we are getting serious. This function is our single gateway to the model. I’m using gpt-4o-mini because it is enough for our purpose. Also — it’s cheap. If needed you can use any other of the models.

def call_llm(messages: list[dict]) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0
    )

return response.choices[0].message.content

Why temperature = 0?

Temperature controls how creative or random an AI’s responses are: low makes it predictable and consistent, high makes it more varied and surprising.

Agents should be:

predictable
repeatable
boring (in a good way)

Creativity is great for writing stories.
Agents need control, not surprises.

Why messages is a list of dicts?

The chat API is designed to support multi-turn conversations, not just a single prompt.
Each element in the list represents one “message” in the conversation, and must have a role and content:

{
    "role": "system" | "user" | "assistant",
    "content": "Text of the message"
}

The role field describes who is sending the message. Specifically:

system — Instructions for the model (“You are a helpful assistant” etc.)
user — Input from the human
assistant — Model responses from previous turns

So the API sees the conversation as a sequence, allowing the model to reason across multiple messages.

Example:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"}
]

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)

print(response.choices[0].message.content)
# -> "The capital of France is Paris."

If you wanted a follow-up question:

messages.append({"role": "user", "content": "And what about Germany?"})

The model remembers the previous question and answer because messages is a list of the conversation.

Initial State (No conversation yet)

The model starts with just the system prompt, if there is one. Example:

[System]: You are a helpful assistant that explains things clearly.

Context content: Only the system instruction.
Purpose: Sets the model’s behavior and tone.

After User Message #1

You type:

[User]: What is the capital of France?

The model now sees both the system prompt and your first message in its context:

[System]: You are a helpful assistant that explains things clearly.
[User]: What is the capital of France?

The model generates a response using everything in the context.

[Assistant]: The capital of France is Paris.

Context now includes the assistant’s reply:

[System]: You are a helpful assistant that explains things clearly.
[User]: What is the capital of France?
[Assistant]: The capital of France is Paris.

✅ Observation: Both your input and the model’s reply are added.

After User Message #2

You ask a follow-up:

[User]: And what about Germany?

The model’s context now includes all previous messages:

[System]: You are a helpful assistant that explains things clearly.
[User]: What is the capital of France?
[Assistant]: The capital of France is Paris.
[User]: And what about Germany?

The model generates:

[Assistant]: The capital of Germany is Berlin.

Context grows to include this latest assistant reply:

[System]: You are a helpful assistant that explains things clearly.
[User]: What is the capital of France?
[Assistant]: The capital of France is Paris.
[User]: And what about Germany?
[Assistant]: The capital of Germany is Berlin.

How Context Changes Over Time

Every new user message adds to the context.
Every new model response also adds to the context.
The context is a running transcript of the conversation that the model can see to maintain coherence.

Summary

The list represents the entire conversation history.
Each dict is a message with role + content.
This allows multi-turn reasoning, which is essential for building agents.

Multi-Turn Reasoning

Multi-turn reasoning means that an AI doesn’t treat each prompt as an isolated question. Instead, it maintains context, tracks intermediate conclusions, and adjusts its next steps based on previous turns in the conversation.

Think of it as the difference between:

Answering one question at a time and forgetting everything else
Participating in an ongoing discussion, where each response depends on what was already said and learned

Adding Tools

Tools are add-ons that an agent can use to perform its tasks. In this example, the tools allow the agent to browse and read information on the internet, as well as store it to disk.

Search Web Tool

The LLM cannot browse the web on its own. We have to do it for it. In order to do this we use Dux Distributed Global Search https://github.com/deedy5/ddgs . There are other more sophisticated and powerful services but we just need a simple web search functionality, and huge advantage of this one is — it’s simple and doesn’t require account & API key. It’s perfect to use in this publication.

First we need to download the library, but since we did it already before we can just go ahead and jump to implementation. We import DDGS:

from ddgs import DDGS

…and create a short function:

def search_web(query: str):
    results = []

    with DDGS() as ddgs:
        for r in ddgs.text(query, max_results=5):
            results.append({
                "title": r["title"],
                "url": r["href"],
                "snippet": r["body"],
            })
    return results

This function lets an application query the web in real time to gather the information. It sends a search request to DDGS which collects selected set of relevant results (max 5 in this example). It will be stored in the context and later on passed to the page fetching tool.

It’s important to note that this tool only accepts a query and performs a web search. It simply produces a string containing the retrieved data — there is no reasoning or LLM involvement at this stage.

Example:

🤖 LLM Reply SEARCH: "most interesting events 2025"
🛜 web_search - start
=============================
🔍 query "most interesting events 2025"
🧾 title 2025 - Wikipedia
🔗 url https://en.wikipedia.org/wiki/2025
📋 snippet 8 hours ago - January 13 – C/2024 G3 (ATLAS), a non-periodic comet, reaches perihelion, and is dubbed The Great Comet of 2025 .
=============================
🔍 query "most interesting events 2025"
🧾 title 2025 in the United States - Wikipedia
🔗 url https://en.wikipedia.org/wiki/2025_in_the_United_States
📋 snippet 3 days ago - Idaho Governor Brad Little formally signs a bill into law that permits the death penalty for rape and sexual abuse of children younger than 12 years of age . The bill will become effective on July 1, 2025. This law could ultimately challenge the precedent of Kennedy v.
=============================
🔍 query "most interesting events 2025"
🧾 title Ten Most Significant World Events in 2025 | Council on Foreign Relations
🔗 url https://www.cfr.org/articles/ten-most-significant-world-events-2025
📋 snippet As 2025 comes to a close, here are the ten most notable world events of the year.
=============================
🔍 query "most interesting events 2025"
🧾 title 2025: Facts & Events That Happened in This Year - The Fact Site
🔗 url https://www.thefactsite.com/year/2025/
📋 snippet January 1, 2023 - 2025 was a year marked by major events, including Trump's return, the first American pope, climate crises, AI advancements, and inspiring global efforts .
=============================
🔍 query "most interesting events 2025"
🧾 title 2025: Year in Review | Pop Culture, Images, Current Events, News, & Timeline | Britannica
🔗 url https://www.britannica.com/topic/2025-Year-in-Review
📋 snippet December 22, 2025 - A number of firsts also made headlines in 2025. Japan and Namibia inaugurated their first women heads of government —Sanae Takaichi and Netumbo Nandi-Ndaitwah, respectively. On the space front, the first successful soft landing of a commercial ...
🛜 web_search - finish - 5 results

The LLM says that it want to search for phrase “most interesting events 2025”, we invoke the search and provide the response. Including 5 websites with potential answers.

Fetch Webpage Tool

When search_web tool is done searching for appropriate webpage fetch_webpage tool is used to read the contents of the webpage and get actual response.

import requests
from bs4 import BeautifulSoup

def fetch_webpage(url):
    """
    Fetch provided webpage
    """
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")    
    return soup.get_text()

Again, this is a super simple example. This is enough for the sake of this experiment but bear in mind that while working on production application you can and should do better.

⚠️ In a real scenario, this often dumps 5,000 or more words of navigation menus and footer links into the LLM context which is not great. I don’t want to complicate this example but please have this in mind and for production use a bit more sophisticated solution.

Example:

🤖 LLM Reply FETCH: https://en.wikipedia.org/wiki/2025
🌍 FETCH https://en.wikipedia.org/wiki/2025

The LLM says that it would like to use fetch tool to fetch contents of https://en.wikipedia.org/wiki/2025 and we trigger the fetch_webpage function.

Save-to-Disk Tool

def save_to_disk(text: str) -> str:
    with open("search_results.txt", "w") as f:
        f.write(text)

This tool is a textbook example of the “Single Responsibility Principle” in agentic design. It performs one action: taking a text and persisting it to a local file.

Agents work best when tools do one thing only.

Example search_results.txt:

"In 2025, Cardinal Robert Prevost became Pope Leo XIV, marking the first time 
a pope was from North America. Prevost, originally from Chicago, was elected 
after the sudden death of Pope Francis. His election is significant as it 
reflects a shift in the leadership of the Catholic Church and highlights the 
increasing influence of North American clergy within the Vatican."

"The United States and Israel launched a military operation against Iran's 
nuclear facilities in June 2025, known as the Twelve-Day War. This conflict 
was sparked by Israel's concerns over Iran's nuclear ambitions and resulted in
significant military engagements, including airstrikes and missile attacks. 
The operation underscored the ongoing tensions in the Middle East and the 
complexities of international relations involving nuclear proliferation."

"In 2025, China demonstrated its economic power by weaponizing its dominance 
over rare-earth minerals, crucial for various industries. Following U.S. 
tariffs, China halted exports of these minerals to the U.S., showcasing its 
ability to leverage economic resources in geopolitical conflicts. This event 
marked a significant moment in the U.S.-China relationship and highlighted the
strategic importance of rare-earth materials in global trade."

The three most interesting and incredible confirmed events from 2025 include 
the election of Cardinal Robert Prevost as Pope Leo XIV, the military 
confrontation between the U.S., Israel, and Iran over nuclear facilities, and 
China's strategic use of rare-earth minerals in international relations.

Defining the Agent’s Rules (The System Prompt)

This is the most important part of the whole system.

You are a research assistant.

If you need web data:
SEARCH:
    
If you have a URL:
    
FETCH:
IMPORTANT: When you receive search results with multiple URLs, you MUST try to FETCH all relevant URLs before doing another SEARCH.
FETCH only one URL at a time when done go to next URL.
Check all the URLs from the search results first. Only search again if none of
the URLs contain useful information. If a page can't be read or is irrelevant, 
move on to the next URL from your search results. 
Do not use URL with bad content again.
If URLs are provided, base your answer only on their content.

Save useful info with:
SAVE:

Final answer must be based only on verified sources.
After saving, provide the final answer based only on verified sources.
Do not provide final answer if answer wasn't saved.

Every agent starts with a system prompt. This is not just a “role description” — it’s the agent’s operating rules.

The KEYWORD:value format isn’t arbitrary. LLMs have seen this pattern millions of times in markdown and config files. Compare to alternatives: — JSON like {“action”: “search”} — the model forgets quotes or adds unexpected fields — Function syntax like search(“dogs”) — confuses describing code vs outputting commands The colon creates a clear parsing boundary. Even if the LLM adds text, it works: SEARCH: let me look that up: dogs → extracts everything after the first colon.

In our case, the system prompt defines four critical behaviors:

When to search for external information
When to gather concrete information from found webpages
When to save gathered knowledge
When to stop and return a final answer

This prompt forces the model to externalize its decisions in a structured way.

Instead of silently “thinking” about whether it needs more data, the model must explicitly say:

SEARCH: ... , FETCH: ...→ I need more information
SAVE: ... → I’ve gathered something worth persisting
anything else → I’m done

This is the core idea behind tool-using agents.

Notice what we’re doing and not doing:

❌ We do NOT describe Python functions
❌ We do NOT mention what we will use to search, fetch
❌ We do NOT expose implementation details

✅ We define communication protocols

The LLM learns:

how to ask for information
how to store results
when it is allowed to finish

This keeps the agent flexible and model-agnostic.

💡 Pro Insight: The ReAct Pattern

What we are implementing here is a simplified version of the ReAct (Reason + Act) pattern. This is a foundational concept in AI engineering.

Instead of just “guessing” an answer, the model follows a cycle:

Thought: The LLM analyzes the task (e.g., “I need to find 2025 events”).
Action: The LLM selects a tool via our protocol (e.g., SEARCH:2025 events).
Observation: The LLM sees the tool’s output and updates its understanding.
Repeat: The LLM looks at the results and decides its next move.

By using Protocols (SEARCH:, FETCH:) instead of describing Python functions, we decouple the "Brain" from the "Tools":

From: “Call the dogs_search function with a string argument." — Brittle & confusing
To: “If you need data, output SEARCH:." — Clean & robust

This abstraction allows the agent to focus on logic while your code handles the heavy lifting.

Why this works:

Terminology: It introduces ReAct, which is a gold star keyword for anyone interviewing for AI roles or reading research papers.
Visual Aid: The diagram tag helps the reader visualize the infinite loop you are about to show in the code.
Clarity: It reinforces why you chose the specific syntax in your system prompt.

The Big Picture

This is the mysterious, omnipotent loop that makes the agent “tick”. I will be explaining it step by step in a moment but I think it’s easier to unpack it knowing the full picture upfront.

def research_agent(task: str) -> str:
    """
    A simple agent loop:

    1. The model reasons about the task
    2. Decides whether to use a tool
    3. Consumes tool output
    4. Stops when the goal is achieved
    """

  system_prompt = """
  ☝️ The system prompt shown above goes here...
  """

  messages = [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": task}
  ]

  while True:
      reply = call_llm(messages)
      messages.append({"role": "assistant", "content": reply})
    
     if reply.startswith("SEARCH:"):
          query = reply.replace("SEARCH:", "").strip()
          search_result = search_web(query)

          messages.append({
              "role": "user",
              "content": f"Search result:\n{search_result}"
           })

      elif reply.startswith("FETCH:"):
          query = reply.replace("FETCH:", "").strip()
          fetch_webpage_result = fetch_webpage(query)

          messages.append({
            "role": "user",
            "content": f"Fetch result:\n{fetch_webpage_result}"
          })

      elif reply.startswith("SAVE:"):
          text = reply.replace("SAVE:", "").strip()
          save_to_disk(text)

          messages.append({
              "role": "user",
              "content": f"Save content:\n{reply}"
          })

      else:
          return reply

Now let’s go step by step and see what is going on here.

Messages: The Agent’s Working Memory

The messages list acts as the agent’s short-term memory.

The initial context consists of the system prompt and the task that we want agent to accomplish. It’s created right before the loop is started:

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": task}
]

Later on it’s updated when actions are used:

when search did finish

messages.append({
    "role": "user",
    "content": f"Search result:\n{search_result}"
})

when fetch did finish

messages.append({
    "role": "user",
    "content": f"Fetch result:\n{fetch_webpage_result}"
})

when findings are saved to disk

messages.append({
    "role": "user",
    "content": f"Save content:\n{reply}"
})

Every step — model replies, tool outputs, saved content — is appended back into this list.

This means:

The model sees its own past decisions
Tool outputs become part of the context
Reasoning unfolds across turns

This is what enables multi-step reasoning without any hidden state.

I described in detail how context works in “Why messages is a list of dicts?” section. Revisiting this piece will help if feeling lost.

The Agent Loop

At the heart of the agent is a simple loop:

while True:
    reply = call_llm(messages)
    messages.append({"role": "assistant", "content": reply})

Each iteration follows the same pattern:

Decide — The model decides what to do next
Act — A tool is invoked if requested
Observe — Tool output is fed back to the model

This mirrors the classic agent architecture:

Think → Act → Observe → Repeat

Tool Invocation Without Magic

Notice that there is no hardcoded logic deciding when to search, fetch or save. Instead, tool usage is driven entirely by the model’s output:

if reply.startswith("SEARCH:"):
    query = ...
    search_result = search_web(query)

The agent does not “call tools automatically.”
It asks for them explicitly, using a format defined in the system prompt.

This makes the system:

Transparent
Debuggable
Model-agnostic

You can log every decision and see exactly why a tool was used.

Feeding Reality Back Into the Model

After a search, fetch or save operation, the result is injected back into the conversation:

messages.append({
    "role": "user",
    "content": f"Search result:\n{search_result}"
})

This step is crucial.

From the model’s perspective:

Tool output is just new context
There is no distinction between “user input” and “tool output”
Everything becomes part of the reasoning space

This is how agents ground themselves in reality.

Knowing When to Stop

Probably the most underrated part of agent design is termination.

In this implementation, the rule is simple:

else:
    return reply

If the model doesn’t request a tool, the agent assumes:

“The goal has been achieved.”

This works because the system prompt clearly defines:

When tools should be used
What a final answer looks like

A well-written system prompt eliminates the need for complex stopping logic.

Why This Pattern Matters

This small function demonstrates several core principles of agentic AI:

Explicit decision-making
Tool use via language, not APIs
State built entirely from context
Predictable, inspectable behavior

Once you understand this loop, you understand how agents actually work.

Triggering

On the top level we just feed question to the agent and print the results afterwards.

if __name__ == "__main__":

    result = research_agent(
        "Find me the 3 most interesting and most incredible,"
        "but real and confirmed events from 2025."
        "Save each fact with a short explanation"
    )

    print("\nFinal answer:\n")
    print(result)

📄 A nice next step would be to let the user pass the prompt via the command line when launching the agent. If you feel like tinkering a bit, try this as a small home assignment — totally optional, just a chance to get your hands dirty.

Using the agent

In order to use the agent we first have to go to the newly created folder and switch to the virtual environment configured by the uv (if not enabled already):

source .venv/bin/activate

And start the agent:

uv run main.py

When you are done using the agent remember to deactivate the environment:

deactivate

Sample Search

Now let’s see why do we even bother. When we ask plain LLM:

Find me 3 the most interesting and most incredible, but real and confirmed events from 2025.
Save each fact with a short explanation

thegpt-4o-mini responds:

I’m sorry, but I can’t provide real events from 2025 as my training only includes information up until October 2023, and I don’t have access to future events or developments. However, I can help you brainstorm potential future scenarios or discuss trends that might lead to interesting events in 2025 based on current knowledge. Would you like to explore that instead?

Like I said before, LLM is just a brain.

Now let’s see what our Search Agent can do when asked the same question:

[
{
"role":"system",
"content":"\n You are a research assistant.\n If you need web data:\n SEARCH:\n \n If you have a URL:\n \n FETCH:\n IMPORTANT: When you receive search results with multiple URLs, you MUST try to FETCH all relevant URLs before doing another SEARCH.\n FETCH only one URL at a time when done go to next URL.\n Check all the URLs from the search results first. Only search again if none of the URLs contain useful information.\n If a page can't be read or is irrelevant, move on to the next URL from your search results. \n Do not use URL with bad content again.\n\n If URLs are provided, base your answer only on their content. \n Final answer must be based only on verified sources.\n Save useful info with:\n SAVE:\n\n After saving, provide the final answer based only on verified sources.\n Do not provide final answer if answer wasn't saved.\n "
},
{
"role":"user",
"content":"Find me the 3 most interesting and most incredible,but real and confirmed events from 2025.Save each fact with a short explanation"
},
{
"role":"assistant",
"content":"SEARCH: \"most interesting events 2025\""
},
{
"role":"user",
"content":"Search result:\n[{'title': '2025 - Wikipedia', 'url': 'https://en.wikipedia.org/wiki/2025', 'snippet': 'The year saw an escalation of major ongoing armed conflicts, including the Gaza war, marked by famine and a humanitarian crisis; the Russian invasion of Ukraine ... Events · January · March · April'}, {'title': 'Ten Most Significant World Events in 2025 - Council on Foreign Relations', 'url': 'https://www.cfr.org/articles/ten-most-significant-world-events-2025', 'snippet': 'Dec 23, 2025 · Ten Most Significant World Events in 2025 · 10.Cambodia and Thailand Clash · 9. Cardinal Robert Prevost Becomes Pope Leo XIV · 8.India and ...'}, {'title': '2025: Year in Review | Pop Culture, Images, Current Events ... - Britannica', 'url': 'https://www.britannica.com/topic/2025-Year-in-Review', 'snippet': 'Dec 22, 2025 · In pop culture news, Taylor Swift became engaged to Travis Kelce, Anora was the big winner at the Oscars, and Beyoncé finally won the Grammy ...'}, {'title': 'Historical Events in 2025 - On This Day', 'url': 'https://www.onthisday.com/events/date/2025', 'snippet': 'Historical Events in 2025 · Youngest Ever Darts Champion · Presidential Medal of Freedom · Trudeau Promises to Resign · No Fact-Checking on Facebook and Instagram.'}, {'title': '15 Events to Watch in 2025 | Opus Agency', 'url': 'https://www.opusagency.com/article/15-events-to-watch-in-2025', 'snippet': 'World Experience Summit (WXO)— London, April 29-May 1, 2025 · C2 Montréal — May 20-22, 2025 · Future of Everything Festival — NYC, May 28-29, 2025 · Web Summit ( ...'}]"
},
{
"role":"assistant",
"content":"FETCH: https://www.cfr.org/articles/ten-most-significant-world-events-2025"
},
{
"role":"user",
"content":"Fetch result:\nTen Most Significant World Events in 2025 | Council on Foreign RelationsSkip to contentSupport CFRMember LoginRegionsTopicsResearchExpertsEventsArticlesThe Water's EdgeTen Most Significant World Events in 2025ShareDiplomacy and International InstitutionsTen Most Significant World Events in 2025As 2025 comes to a close, here are the ten most notable world events of the year.\\xa0\\xa0\nDisplaced Sudanese flee Zamzam displacement camp in North Darfur, Sudan, after attacks by the Rapid Support Forces, April 15, 2025.\nREUTERS/Stringer/File PhotoBy experts and staffPublishedDecember 23, 2025 9:48\u202fa.m.ExpertsBy James M. LindsayMary and David Boies Distinguished Senior Fellow in U.S. Foreign PolicyShareSign up to receive CFR President Mike Froman’s analysis on the most important foreign policy story of the week, delivered to your inbox every Friday. Subscribe to The World This Week.Anyone\\xa0hoping that 2025 would provide a break from what was\\xa0an exhausting 2024 on the world stage came away disappointed. The past twelve months have been a trying time for international cooperation, as the forces of conflict and contention grew stronger and the end of the American led world order more clearly came into view. Unlike 2024, when the pageantry of the Summer Olympics and beauty of the host city Paris reminded everyone of what cooperation and collaboration can accomplish, 2025 provided few instances of inspiration. One can only hope that 2026 will surprise us in a good way. But before we jump to the new year, here are my top ten most significant world events in 2025. You may want to read what follows closely. Many of these stories will continue to make news in 2026.\u202f\\xa010.Cambodia and Thailand ClashCountries can come to blows even when one is substantially larger and better equipped than the other. A case in point is the conflict that erupted in July between Cambodia and Thailand. With a population of just over 17 million, Cambodia is far smaller and militarily weaker than its neighbor Thailand, which has a population four times larger. The July fighting, the worst in decades, ostensibly was about a century-old border dispute. In particular, Cambodia and Thailand both claim ownership of Prasat Ta Muen Thom, an ancient temple perched in the forested Dangrek Mountains that divide the two countries. But complicated domestic politics on both sides of the border pushed those latent tensions into open conflict. The initial fighting killed several dozen people and displaced tens of thousands. President Donald Trump intervened and threatened to halt trade negotiations with Cambodia and Thailand unless the fighting stopped. Both sides then grudgingly agreed to a ceasefire. A formal deal was signed in late October in Kuala Lumpur on the margins of the annual ASEAN Summit. The agreement, however, did not stick. A landmine killed four Thai soldiers in November, prompting Bangkok to suspend parts of the agreement. The fighting escalated earlier this month when Thai fighter jets bombed targets in Cambodia. Thailand’s new prime minister, whose predecessor was\\xa0ousted for being too conciliatory\\xa0to\\xa0Phnom Penh,\\xa0has vowed that “if the fighting is to stop, Cambodia must follow the course of action set by Thailand.” That sounds like a recipe for continued fighting.\\xa0\\xa09. Cardinal Robert Prevost Becomes Pope Leo XIVThe Roman Catholic Church has had 267 popes. Until 2025, not one of them had been from North America, let alone the United States. That changed in May. On April 21, the ailing Pope Francis died suddenly after suffering a stroke. He had been pope for a dozen years. The Roman Catholic Church then entered a period it calls the\\xa0interregnum, which lasts until a new pope is selected. Following time-honored tradition, the 133 cardinals eligible to participate assembled in Rome for\\xa0a conclave, the formal process for selecting the next pope. The first three ballots saw black smoke emerge from the Sistine Chapel’s chimney, signifying that no one had achieved the\\xa0 necessary two-thirds support. Then on the fourth ballot, the smoke turned white. The conclave had selected Cardinal Robert Prevost. Born and raised on the South Side of Chicago and a graduate of Villanova University, he spent more than two decades in Peru, first doing missionary work and then as Bishop of Chiclayo. In 2023, he moved to Rome to become prefect of the\u202fDicastery for Bishops, which oversees the selection of most bishops,\u202fand was subsequently made a cardinal. Prevost became the first member of the Augustinian religious order, which was founded in 1244, to become pope. Upon being installed, he took the name Leo XIV to recall the example of Pope Leo XIII, the author of the 1891 encyclical\u202fRerum Novarum (“Revolutionary Change”) that called for improving the lot of the working class.\\xa0\\xa08.India and Pakistan ClashIn 2000, President Bill Clinton\\xa0called Kashmir “the most dangerous place in the world.” The simmering tensions between India and Pakistan over the region erupted into open conflict in May, two weeks after five terrorists killed twenty-six people near Pahalgam in Indian-administered Jammu and Kashmir. Lashkar-e-Taiba, the Pakistani-based terrorist organization that carried out the 2008\\xa0Mumbai\\xa0terrorist attack that killed 175 people, likely sponsored the Pahalgam attack. In 2008, India declined to retaliate. This time, however, it struck what\\xa0it\\xa0called “terrorist infrastructure” inside Pakistan. Pakistan responded with drone and missile strikes against Indian military bases and shot down two of India’s most advanced fighter jets. India in turn destroyed air defense systems around the Pakistani city of Lahore. Neither side gained a meaningful military advantage from their most intense fighting in half a century, and they agreed to a ceasefire after three days. President Donald Trump said he mediated the ceasefire, a claim that Pakistan’s Army Chief of Staff and de facto\\xa0leader Asim Munir supported\\xa0and Indian Prime Minister Narendra Modi\\xa0rejected. The immediate consequence of the fighting was a rift in U.S.-Indian relations, as Trump slapped steep tariffs on Indian exports partly in retaliation for Modi’s refusal to acknowledge his mediation efforts. But the underlying animus between India and Pakistan remains. One open issue is whether India’s decision to suspend its participation in the\u202fIndus Waters Treaty of 1960 means it will curtail the flow of water in the Indus River system that supplies 80 percent of Pakistani farms.\\xa0\\xa07. The AI Race IntensifiesRussian President Vladimir Putin\\xa0said of artificial intelligence\\xa0(AI)\\xa0back in 2017 that “whoever becomes the leader in this sphere will become the ruler of the world.” History may prove him wrong, but for now, countries are acting as if he might be right. The United States and China are the world’s two leading AI powers, and no one else is close. The Chinese firm DeepSeek captured global attention in January when it released an AI model that matched the best U.S. AI models without using advanced Nvidia chips that had been seen as essential to cutting-edge AI operations. Experts\\xa0challenged DeepSeek’s claims, but U.S. firms took them as reason to double down on their own AI investments. The importance of chips in AI development fueled the Trump administration’s initial decision to continue and then expand the restrictions the Biden administration placed on the export of advanced semiconductors made in the United States or with U.S. technology. Two weeks ago, however, Trump reversed course and authorized Nvidia to sell its powerful H200 chip to China. The decision triggered\\xa0biting\\xa0criticism that he was endangering U.S. national security. Lost in this debate is the fact that the AI race is not just about which country builds the best models but also who does a better job integrating AI into everyday operations. Here China may have the advantage. Meanwhile, critics ask whether AI might turn into an epic bust that shocks the economy\\xa0or an epic success that\\xa0erases jobs.\\xa0\\xa06. Sudan’s Grueling Civil War Continues“Hell on Earth” may be the best description of Sudan’s nearly three-year long civil war. The fighting\u202fpits\u202fthe Sudanese Armed Forces (SAF), led by Gen. Abdel Fattah al-Burhan, against the Rapid Support Forces (RSF), led by Mohamed Hamdan “Hemedti” Dagalo. The two men seized power in a coup in October 2021 but eventually had a falling out.\u202fNeither side has won a decisive breakthrough, and fighting continues\\xa0across multiple fronts. The SAF governs from Port Sudan on the Red Sea, controls the major cities in the east and north, and is recognized as Sudan’s legitimate government. Meanwhile, the RSF controls most of Darfur and other areas in central and western Sudan. Each side has foreign backers, with Egypt, Russia, and Turkey among the countries supporting the SAF and Chad, Ethiopia, and the United Arab Emirates among those supporting the RSF. The war’s human cost has been staggering. As many as 400,000 people\\xa0have died and more than twelve million have been displaced. Much of the country faces famine, and the demand for humanitarian aid far outpaces available resources. The fighting was particularly gruesome in the city of El Fasher, the last SAF stronghold in Darfur, which the RSF took in October after an eighteen-month siege. The blood and bodies from the resulting slaughter of the city’s inhabitants could be seen from space. The prospects for a mediated end to the civil war remain dim, and the country’s de facto partition is a possibility.\\xa0\\xa05. The United States Brokers a Gaza Peace PlanAfter\\xa0two years of brutal fighting,\\xa0Israel and Hamas\\xa0agreed in October to a ceasefire. It was the second of the year. The Biden administration negotiated a ceasefire in January that increased humanitarian aid to Gaza, led Hamas to release thirty-three\\xa0hostages,\\xa0and\\xa0prompted\\xa0Israel to free\\xa0nearly 2,000 Palestinian prisoners. The respite from the fighting ended in mid March after the two sides disagreed on the terms of an extension. President Donald Trump played a central role in pushing for the\\xa0October ceasefire, with Qatar, Turkey, and Egypt also helping mediate. The agreement laid out a three phase peace plan for Gaza: 1) an immediate ceasefire with Israeli security forces withdrawing to preset lines and an exchange of hostages and prisoners; 2) the demilitarization of Hamas and the dispatch of an international stabilization force to Gaza; and 3) the reconstitution of Palestinian governance and the reconstruction of Gaza. The United States considers phase one completed, though Israel insists that it will not move to phase two until the last Israeli hostage is released. The UN Security Council endorsed\\xa0the\\xa0Gaza peace plan\u202fin November, authorizing the deployment of the International Stabilization Force, and calling for the establishment of a\u202fPalestinian Committee\u202fto manage Gaza’s day-to-day governance. A lasting peace, however, remains elusive. Hamas shows no signs of disarming, Israel has resumed strikes\\xa0in Gaza,\\xa0and\\xa0no country has yet formally committed\\xa0to\\xa0contribute troops\\xa0to\\xa0the International Stabilization Force.\\xa0\\xa04. The War in Ukraine Grinds onThe war in Ukraine, now in its fourth year, ground on in 2025. Russia intensified its missile and drone campaigns, repeatedly striking Ukrainian cities, causing heavy civilian casualties and damaging major infrastructure. In March, Russia reclaimed its province of Kursk, which Ukraine seized in a surprise invasion in August 2024. Russian advances in Ukraine itself, though, were meager; Russia increased its control of Ukrainian territory by less than 1 percent in 2025. Those gains came at a frightening cost, with\\xa0Russia losing\\xa0roughly\\xa0one thousand soldiers every day. Ukraine’s losses are far lower, but its population is just a third of Russia’s. Ukraine stunned the world in June with Operation Spiderweb, a covert drone strike deep into Russia that hit five air bases. The attack, however, did not change the war’s basic dynamics. Experts debate how long Russia and Ukraine can continue fighting, though Ukraine’s position looks more precarious. President Donald Trump insists that Kyiv does not “have the cards” needed to win, and he remains opposed to increasing U.S. support. Europe is providing Kyiv with substantial financial and military support, and the European Union ended 2025 by agreeing to provide Kyiv with a $105 billion loan. That should cover two-thirds of\\xa0Ukraine’s financial needs\\xa0over the next two years. Trump is pushing Ukrainian President Volodymyr Zelenskyy to accept a ceasefire that many experts see as heavily tilted in Russia’s favor. Even so, Russian President Vladimir Putin looks likely to\\xa0hold out for more.\\xa0\\xa03. Israel and the United States Attack Iran’s Nuclear FacilitiesSome\\xa0wars are measured in years, others in days. The June confrontation between Israel and Iran falls in the latter category; it is now known as the Twelve-Day War. The clash was years in the making. Israel has long seen Iran’s nuclear ambitions and its support for anti-Israeli militias across the Middle East as an existential threat. After the October 7, 2023 Hamas attacks, Israel targeted Iranian proxies in Lebanon, Iraq, Syria, and Yemen. These operations crippled Tehran’s ability to use the so-called Axis of Resistance to deter an Israeli attack on Iran. In October 2024, Israeli airstrikes devastated\\xa0Iran’s missile\\xa0facilities and decimated its air defenses. In June 2025, as Iran\\xa0neared the point where it could quickly build a small number of crude nuclear weapons, Israel launched\\xa0Operation Rising Lion. It involved airstrikes on Iranian nuclear and ballistic-missile facilities, as well as on military bases and command nodes, and the assassination of politicians, military leaders, and nuclear scientists. Iran retaliated by firing\\xa0missiles\\xa0and\\xa0drones at Israel. Assisted by the U.S. military, Israel shot down most of the barrages. Then, on June 22, U.S. B2 bombers and Tomahawk missiles hit Iran’s nuclear facilities as part of Operation Midnight Hammer. President Donald Trump\\xa0claimed the\\xa0attack “totally obliterated” Iran’s nuclear program. However, Iran likely dispersed its stockpile of enriched uranium before the bombs fell. Israel and Iran agreed to a ceasefire on June 24. That halted the fighting but left the underlying political disputes unresolved.\\xa02. China Weaponizes Rare-Earth MineralsFuture historians may flag 2025 as the year China established itself as\\xa0 superpower equal to the United States. For decades, Washington used its economic dominance, particularly of international finance, to advance its foreign policy objectives. In April 2025 and again in October, Beijing showed that it could also weaponize its economic advantages. In its case, the leverage comes from rare-earth minerals essential to a wide range of industrial and military applications. China has worked for years to dominate\\xa0the rare-earth supply chain; it now controls roughly 60 percent of the world’s rare-earth mining and 90 percent of it refining capacity. China’s willingness to weaponize its dominance was no secret. Beijing curtailed rare-earth exports to Japan in 2010 during a territorial dispute over the Senkaku (or Diaoyu) islands. Despite the warning, the United States did little to protect itself. When Trump raised tariffs on China in April, Beijing halted\\xa0the export\\xa0of\\xa0magnets and\\xa0seven rare earth minerals to the United States. Within a month, Trump\\xa0scaled back his tariffs. In October, after Trump imposed additional restrictions on the export of advanced semiconductor chips and technology to China, Beijing imposed new regulations on the export of products using Chinese rare earth minerals. Faced with a potentially substantial disruption to the U.S. economy, Trump delayed implementation of his\\xa0export controls and abandoned several other measures aimed at China. Beijing in turn delayed implementation of its October regulations for one year, suggesting that the rare-earth threat continues to hang over the United States.\\xa0\\xa01.Donald Trump Disrupts U.S. Foreign PolicyIs making good on your word always a good thing? Donald Trump vowed in 2024 to overhaul U.S. foreign policy. He can now say, “Promises Made. Promises Kept.” His disruptions began even before he took office when he declared his desire to acquire Greenland, make Canada the fifty-first state, and reassert control over the Panama Canal. On Inauguration Day, he withdrew the United States from the Paris Agreement\\xa0and the\\xa0World Health Organization, restricted\\xa0refugee readmissions, and designated drug cartels as foreign terrorist organizations. In his first month in office, he began shuttering the U.S. Agency for International Development, ending independent oversight of major federal agencies, and slashing government payrolls. On April 2, he launched “Liberation Day,” imposing 10 percent tariffs on most imports along with additional country specific tariffs of up to 50 percent. Trump claimed credit for convincing NATO members to spend 5 percent of their GDP\\xa0on defense and for\\xa0ending eight foreign wars. He failed, however, to end the war in Ukraine, and to the surprise of his non-interventionist supporters, he threatened Venezuela with regime change. Earlier this month, he released a National Security Strategy that provided a strategic logic for his actions. Critics found much to fault in what Trump did: his tariffs\\xa0hurt U.S. consumers and producers, his Ukraine peace plan rewarded Russia for its aggression, many of the conflicts he claims\\xa0to have ended\\xa0continue, and his downsizing of the national security bureaucracy left the U.S.\\xa0government greatly weakened. One thing, however, has become clear: Trump has\\xa0ended\\xa0the\\xa0era of\\xa0Pax Americana. That reality brings to mind Joni Mitchell’s haunting lyrics: “You don’t know what you’ve got. Till it’s gone.”\\xa0\\xa0\\xa0Oscar Berry and Sofia Park Jadotte\\xa0assisted\\xa0in the preparation of this post.\u202f\\xa0Other posts in this series:\u202f\\xa0Ten Most Significant World Events in 2024\\xa0Ten Most Significant World Events in 2023\u202f\\xa0Ten Most Significant World Events in 2022\u202f\u202f\\xa0Ten Most Significant World Events in 2021\u202f\\xa0Ten Most Significant World Events in 2020\u202f\\xa0Ten Most Significant World Events in 2019\u202f\\xa0Ten Most Significant World Events in 2018\u202f\\xa0Ten Most Significant World Events in 2017\u202f\\xa0Ten Most Significant World Events in 2016\u202f\\xa0Ten Most Significant World Events in 2015\u202f\\xa0Ten Most Significant World Events in 2014\u202f\\xa0InformationFrom The Water's Edge and U.S. Foreign Policy ProgramCreative Commons: Some rights reserved.This work is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) License.\nDiplomacy and International InstitutionsSudanTime Hasn't Clarified Trump's Venezuela StrategyBy Roxanna VigilFebruary 4, 2026The AI Bubble Is Getting Closer to PoppingBy Shannon K. O'NeilJanuary 29, 2026Everything but Territory: Europe’s Response to Trump’s Greenland ThreatsBy Liana FixJanuary 16, 2026Council on Foreign RelationsAbout CFRMembershipSupport CFRNewslettersCareersCFR EducationFor MediaContact UsPrivacy PolicyGet CFR’s Daily News BriefSubscribe©2026 Council on Foreign Relations"
},
{
"role":"assistant",
"content":"SAVE: \"In 2025, Cardinal Robert Prevost became Pope Leo XIV, marking the first time a pope was from North America. His election followed the sudden death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church\\'s engagement with contemporary challenges.\"\n\nSAVE: \"The Twelve-Day War between Israel and Iran occurred in June 2025, where Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a significant military confrontation. This conflict underscored the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.\"\n\nSAVE: \"In 2025, the United States and China intensified their competition in artificial intelligence, with both nations making significant advancements. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, highlighting the global race for technological supremacy.\"\n\nFinal answer based on verified sources:\n\n1. **Cardinal Robert Prevost Becomes Pope Leo XIV**: In 2025, Cardinal Robert Prevost was elected as Pope Leo XIV, making history as the first pope from North America. His election followed the death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church\\'s engagement with contemporary challenges.\n\n2. **The Twelve-Day War Between Israel and Iran**: In June 2025, Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a military confrontation known as the Twelve-Day War. This conflict highlighted the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.\n\n3. **Intensified AI Competition Between the U.S. and China**: The year 2025 saw a significant escalation in the competition between the United States and China in the field of artificial intelligence. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, underscoring the global race for technological supremacy."
},
{
"role":"user",
"content":"Save content:\nSAVE: \"In 2025, Cardinal Robert Prevost became Pope Leo XIV, marking the first time a pope was from North America. His election followed the sudden death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church\\'s engagement with contemporary challenges.\"\n\nSAVE: \"The Twelve-Day War between Israel and Iran occurred in June 2025, where Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a significant military confrontation. This conflict underscored the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.\"\n\nSAVE: \"In 2025, the United States and China intensified their competition in artificial intelligence, with both nations making significant advancements. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, highlighting the global race for technological supremacy.\"\n\nFinal answer based on verified sources:\n\n1. **Cardinal Robert Prevost Becomes Pope Leo XIV**: In 2025, Cardinal Robert Prevost was elected as Pope Leo XIV, making history as the first pope from North America. His election followed the death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church\\'s engagement with contemporary challenges.\n\n2. **The Twelve-Day War Between Israel and Iran**: In June 2025, Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a military confrontation known as the Twelve-Day War. This conflict highlighted the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.\n\n3. **Intensified AI Competition Between the U.S. and China**: The year 2025 saw a significant escalation in the competition between the United States and China in the field of artificial intelligence. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, underscoring the global race for technological supremacy."
},
{
"role":"assistant",
"content":"The content has been saved successfully. Here is the final answer based on verified sources:\n\n1. **Cardinal Robert Prevost Becomes Pope Leo XIV**: In 2025, Cardinal Robert Prevost was elected as Pope Leo XIV, making history as the first pope from North America. His election followed the death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church's engagement with contemporary challenges.\n\n2. **The Twelve-Day War Between Israel and Iran**: In June 2025, Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a military confrontation known as the Twelve-Day War. This conflict highlighted the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.\n\n3. **Intensified AI Competition Between the U.S. and China**: The year 2025 saw a significant escalation in the competition between the United States and China in the field of artificial intelligence. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, underscoring the global race for technological supremacy."
}
]

This is a log of the conversation.

When done it saves the final response in search_results.txt file:

Final answer based on verified sources:

1. **Cardinal Robert Prevost Becomes Pope Leo XIV**: In 2025, Cardinal Robert Prevost was elected as Pope Leo XIV, making history as the first pope from North America. His election followed the death of Pope Francis and was significant for the Catholic Church, as Prevost aimed to address social issues and improve the church’s engagement with contemporary challenges.

2. **The Twelve-Day War Between Israel and Iran**: In June 2025, Israel launched a series of airstrikes on Iranian nuclear facilities, leading to a military confrontation known as the Twelve-Day War. This conflict highlighted the ongoing tensions in the Middle East and the complexities surrounding nuclear proliferation.

3. **Intensified AI Competition Between the U.S. and China**: The year 2025 saw a significant escalation in the competition between the United States and China in the field of artificial intelligence. The Chinese firm DeepSeek released a competitive AI model, prompting the U.S. to increase its investments in AI technology, underscoring the global race for technological supremacy.

The Conclusion

LLM alone wasn’t able to provide the information we required but our agent was able to browse the web and save its findings on disk 🤯

Stay tuned: this is just the beginning 💪

Please find the full code below ⬇️

⚠️ The purpose of this post is to keep things as simple as possible, providing an easy-to-understand overview of how AI Agents work, how they are built, and what their main components are. That said, unexpected issues and nuances may arise in practice, which are not covered here — this is a deliberate trade-off to prioritize clarity and accessibility.

☕️ A small coffee goes a long way! If you feel like supporting me , that’s awesome (go to buymeacoffee.com/ktustanowski or scan the QR code below) — but no worries at all if not. I hope you got something valuable here!

# ----------------------------
# Imports
# ----------------------------

import os
from dotenv import load_dotenv

from openai import OpenAI
from ddgs import DDGS
import requests
from bs4 import BeautifulSoup

# ----------------------------
# Environment setup
# ----------------------------

# Load environment variables from .env file
load_dotenv()

# Expect OPENAI_API_KEY to be defined in .env
if not os.getenv("OPENAI_API_KEY"):
    raise RuntimeError("OPENAI_API_KEY not found in environment variables")

# Initialize OpenAI client (2026 API style)
client = OpenAI()

# ----------------------------
# LLM helper
# ----------------------------

def call_llm(messages: list[dict]) -> str:
    """
    Send a conversation to the language model
    and return the assistant's response text.
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0
    )

    return response.choices[0].message.content

# ----------------------------
# Tools
# ----------------------------

def search_web(query: str):
    print("🛜 web_search - start")
    results = []

    with DDGS() as ddgs:
        for r in ddgs.text(query, max_results=1):
            results.append({
                "title": r["title"],
                "url": r["href"],
                "snippet": r["body"],
            })

            print("=============================")
            print(f"🔍 query {query}")
            print(f"🧾 title {r['title']}")
            print(f"🔗 url {r['href']}")
            print(f"📋 snippet {r['body']}")

    print(f"🛜 web_search - finish - {len(results)} results")
    return results

def fetch_webpage(url):
    """
    Fetch provided webpage
    """
    print(f"🌍 FETCH {url}")
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    return soup.get_text()

def save_to_disk(text: str) -> str:
    print("💾 Save to search_results.txt")
    with open("search_results.txt", "w") as f:
        f.write(text)

# ----------------------------
# Agent logic
# ----------------------------

def research_agent(task: str) -> str:
    """
    A simple agent loop:

    1. The model reasons about the task
    2. Decides whether to use a tool
    3. Consumes tool output
    4. Stops when the goal is achieved
    """

    system_prompt = """
    You are a research assistant.
    If you need web data:
    SEARCH:
    If you have a URL:
    FETCH:
    If a page can’t be read or is irrelevant, discard it and search again. 
    Do not use URL with bad content again.
    If URLs are provided, base your answer only on their content.
    Save useful info with:
    SAVE:
    After saving, provide the final answer based only on verified sources.    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": task}
    ]

    while True:
        reply = call_llm(messages)
        print(f"🤖 LLM Reply {reply}")
        messages.append({"role": "assistant", "content": reply})
        if reply.startswith("SEARCH:"):
            query = reply.replace("SEARCH:", "").strip()
            search_result = search_web(query)

            messages.append({
                "role": "user",
                "content": f"Search result:\n{search_result}"
            })

        elif reply.startswith("FETCH:"):
            query = reply.replace("FETCH:", "").strip()
            fetch_page_result = fetch_webpage(query)

            messages.append({
                "role": "user",
                "content": f"Feetch result:\n{fetch_page_result}"
            })

        elif reply.startswith("SAVE:"):
            text = reply.replace("SAVE:", "").strip()
            save_to_disk(text)

            messages.append({
                "role": "user",
                "content": f"Save content:\n{reply}"
            })

        else:
            return reply

# ----------------------------
# Entry point
# ----------------------------

if __name__ == "__main__":

    result = research_agent(
        "Browse the web to find one strange but true historical facts that sound fake."
        "Save each fact with a short explanation"
    )

    print("\n✅ Final answer:\n")
    print(result)

AI-Powered Mobile App Creation

Kamil Tustanowski — Sun, 10 Sep 2023 06:47:02 GMT

Creating an application is a multifaceted endeavor, often requiring a diverse skill set. A proficient developer may excel in coding but fall short in areas such as design, user interface/user experience (UI/UX), or business strategy. Conversely, a talented artist might lack the technical know-how to implement their ideas in code, and a successful entrepreneur might not possess the design acumen needed for crafting an appealing app. Even seemingly simple applications can pose challenges, as individuals may have expertise in some aspects (X and Y), but find themselves struggling when confronted with others (Z). Excelling in all these areas simultaneously can be a formidable task.

My interest in artificial intelligence (AI) led me to embark on an experiment aimed at transforming the way we approach application development. The focus of this experiment was to create a basic yet fully functional Minimum Viable Product (MVP) while minimizing the amount of code writing required.

To accomplish this, I leveraged several AI tools and models:
- ChatGPT 3.5
- Midjourney (Version 5.0 and above)
- Stable Diffusion XL
- Bard (Experiment), as of August 18, 2023

I was particularly mindful of cost efficiency throughout this project, allocating expenses primarily to Midjourney while utilizing other resources that were either freely available or offered free prompts.

In summary, the experiment I conducted sought to challenge the traditional approach to application development by harnessing the power of AI, thereby demonstrating the potential for a more accessible and efficient method of creating software solutions.

The Idea

In pursuit of a straightforward yet beneficial endeavor, I opted to develop a water intake logging application with the aim of enhancing personal hydration management. With a clear goal in mind, I proceeded to the next phase of the project.

The Plan

Typically, this was the stage where I would commence coding, engaging in a combination of experimentation and deliberation regarding the app’s UI design. Unfortunately, this approach often led to frequent code refactoring, as my development process revealed opportunities for improvement. Even when the application appeared theoretically complete, I found myself continuously adding features, as it became evident that without them, the app lacked a sense of fulfillment.

However, for this particular project, I decided to adopt a novel approach. I formulated a prompt and sought assistance from ChatGPT and Bard. Here’s the prompt I used:

Act like iOS app designer and propose simple water driking app.

Bard

Sure, here is a simple water drinking app design for iOS:

Home Screen: The home screen should have a large water bottle icon in the center, with a progress bar underneath showing how much water the user has drunk today. There should also be a button to set a reminder to drink water, and a button to view the water calculator.

Water Reminder: The water reminder should be set to go off at regular intervals throughout the day. The user can choose how often they want to be reminded to drink water.

Water Calculator: The water calculator should calculate the amount of water the user needs to drink based on their weight, activity level, and climate. The user can also override the calculator’s recommendation if they want.

Medical Data: The app should be able to access the user’s medical data, such as their weight, if they have provided it. This data can be used to improve the accuracy of the water calculator.

Here are some additional features that could be added to the app:

Habit Tracking: The app could track the user’s water drinking habits over time. This could help the user to see how much progress they are making and stay motivated.

Social Sharing: The user could share their water drinking goals and progress with friends and family. This could help to create a sense of community and accountability.

Gameification: The app could gamify the water drinking experience by awarding points or badges for drinking water. This could make the experience more fun and engaging.

I hope this gives you a starting point for your water drinking app design.

ChatGPT

Title: HydrateMe — Your Personal Water Drinking Assistant

Description: HydrateMe is a simple yet powerful water drinking app designed to help you stay hydrated throughout the day. With its intuitive interface and advanced features, it ensures you never forget to drink water and guides you on how much water you need to consume based on your personal information. The app can even fetch your weight data from your medical records (with your permission) to calculate the appropriate daily water intake.

Main Features:

Personalized Water Intake Calculator:

The app will ask you to enter some basic information, such as your weight, age, gender, and activity level.

Utilizing this data, HydrateMe will calculate your recommended daily water intake based on widely accepted hydration guidelines.

Customizable Water Goals:

You can set your preferred daily water drinking goal. HydrateMe will keep track of your progress and provide timely reminders to ensure you achieve it.

Intelligent Water Intake Reminders:

HydrateMe will send you push notifications at regular intervals, reminding you to drink water.

The app will consider your schedule and preferences to set reminders at the most suitable times for you.

Quick Water Entry:

Logging your water intake is hassle-free. Simply tap on the app to record the amount of water you’ve consumed.

Visual Progress Tracking:

A visual representation of your daily water intake progress will motivate you to stay on track and reach your hydration goals.

Integration with Medical Data (Optional):

If you grant permission, the app can access your medical records to retrieve your weight data.

By using your weight data, HydrateMe ensures a more accurate calculation of your personalized water intake needs.

Achievements and Rewards:

Stay motivated with achievements and rewards for consistently achieving your daily water intake goals.

Hydration Tips and Information:

The app will provide useful hydration tips and information on the importance of staying hydrated.

Dark Mode Support:

HydrateMe offers a dark mode option for a comfortable and energy-efficient experience during nighttime usage.

Data Sync and Backup:

Your water intake data will be securely stored in the cloud, ensuring you can access it across multiple devices and never lose your progress.

HydrateMe is your reliable companion to help you develop and maintain healthy hydration habits. Stay refreshed, energized, and focused with this user-friendly water drinking app!

I found ChatGPT’s response more appealing, as it was comprehensive and even suggested a potential app name. However, it was slightly too extensive for my needs. Therefore, I requested ChatGPT to:

Simplify this just to mvp for now

Response:

Title: Hydration Buddy

Description: Hydration Buddy is a simple water drinking app designed to remind you to drink water and track your daily intake. With just a few essential features, it helps you stay on top of your hydration goals effortlessly.

Main Features:

Water Intake Reminder:
Set personalized reminders to drink water at regular intervals throughout the day.

Customizable Daily Goal:
Choose your preferred daily water intake goal.

Intuitive Water Entry:
Easily log your water consumption with a quick and straightforward interface.

Automatic Weight-Based Calculation:
Optionally, provide your weight, and the app will calculate your recommended daily water intake based on it.

Hydration Buddy simplifies the process of staying hydrated, keeping you refreshed and healthy every day.

This streamlined approach proved efficient, and the longer version remained available for reference if needed, all accomplished within a matter of minutes.

The Code

Now that I have a clear vision of what I intend to construct and a well-defined plan in mind, it’s time to commence the coding phase. Nevertheless, it’s important to remember that AI can also play a pivotal role in this process. To harness its potential, I employed another prompt:

Generate the code for iOS written in swift for these application

I could do that because ChatGPT utilizes the context of the conversation. This implies that:

In the realm of natural language processing (NLP), the term “context window” refers to the amount of preceding text that a language model can consider when generating a response. For ChatGPT, this context window is measured in tokens, which can be as short as one character or as long as one word.

The context window plays a pivotal role in shaping the conversation with a chatbot. It’s like a sliding window that moves with each new message, always keeping track of the most recent tokens up to its maximum size.

Check this article for more info.

Note: This project is an experiment in which my goal was to utilize as much of the generated code as possible. While I did perform some refactoring, my curiosity led me to explore the outcomes of using the code in its original form.

🚨 It’s important to note that AI may not always provide the correct solution, as it strives to offer responses regardless of potential errors. These errors may be glaringly evident, causing code to fail to compile. In other cases, issues may not be immediately apparent. Therefore, it’s crucial to exercise caution and thoroughly comprehend the code you integrate into your application. 🚨

It’s time to proceed with the response:

import UIKit

class HomeViewController: UIViewController {
    // Add outlets to display water intake status and progress on the UI

    override func viewDidLoad() {
        super.viewDidLoad()
        // Load user's water intake status and progress from data storage
        updateWaterIntakeStatus()
    }

    // Function to update the UI with water intake status
    func updateWaterIntakeStatus() {
        // Retrieve and display the current water intake status and progress on the UI
    }

    @IBAction func logWaterButtonTapped(_ sender: UIButton) {
        // Navigate to the Water Intake Log Screen
        let waterIntakeLogVC = WaterIntakeLogViewController()
        waterIntakeLogVC.delegate = self
        navigationController?.pushViewController(waterIntakeLogVC, animated: true)
    }
}

// Implement an extension to handle the delegate callback after logging water intake
extension HomeViewController: WaterIntakeLogDelegate {
    func didLogWaterIntake(amount: Int) {
        // Handle the logged water intake amount
        // Update the water intake status and progress on the UI
        updateWaterIntakeStatus()
    }
}

import UIKit

protocol WaterIntakeLogDelegate: AnyObject {
    func didLogWaterIntake(amount: Int)
}

class WaterIntakeLogViewController: UIViewController {
    weak var delegate: WaterIntakeLogDelegate?

    @IBAction func waterAmountButtonTapped(_ sender: UIButton) {
        // Get the water intake amount from the button title or tag
        let selectedAmount = Int(sender.title(for: .normal) ?? "0") ?? 0

        // Call the delegate method to pass the logged water intake amount back to the HomeViewController
        delegate?.didLogWaterIntake(amount: selectedAmount)

        // Navigate back to the Home Screen
        navigationController?.popViewController(animated: true)
    }
}
Note: This code assumes that you have set up the UI elements (buttons, labels, etc.) in your storyboard or programmatically. Additionally, it's a simplified example, and you might need to adapt it to fit the specific structure and architecture of your iOS app.

My apologies, I intended to request the code in SwiftUI.

Generate the code for iOS written in swiftUI for these application


import SwiftUI

struct HomeView: View {
    @State private var waterIntakeStatus = 0 // Initialize with user's current water intake status
    @State private var dailyGoal = 2000 // Set your preferred daily water intake goal

    var body: some View {
        VStack {
            Text("Water Intake: \(waterIntakeStatus) ml")
                .font(.headline)
            
            Text("Daily Goal: \(dailyGoal) ml")
                .font(.subheadline)
            
            Button("Log Water Intake", action: showWaterIntakeLogScreen)
                .padding()
        }
    }

    func showWaterIntakeLogScreen() {
        let waterIntakeLogView = WaterIntakeLogView(delegate: self)
        // You can present this view in a navigation stack or as a sheet, depending on your app's navigation structure.
        // For example, using NavigationView:
        NavigationView {
            waterIntakeLogView
        }
    }
}

extension HomeView: WaterIntakeLogDelegate {
    func didLogWaterIntake(amount: Int) {
        // Update the water intake status and progress
        waterIntakeStatus += amount
        // Here, you can also update your data storage to save the new water intake status.
    }
}

import SwiftUI

protocol WaterIntakeLogDelegate: AnyObject {
    func didLogWaterIntake(amount: Int)
}

struct WaterIntakeLogView: View {
    weak var delegate: WaterIntakeLogDelegate?

    let waterAmounts = [100, 250, 500, 750, 1000] // Predefined water quantity options

    var body: some View {
        VStack {
            Text("Select Water Intake Amount")
                .font(.headline)
            
            ForEach(waterAmounts, id: \.self) { amount in
                Button("\(amount) ml", action: { logWaterIntake(amount: amount) })
                    .padding()
            }
        }
    }

    func logWaterIntake(amount: Int) {
        delegate?.didLogWaterIntake(amount: amount)
    }
}

That’s better.

I passed the code to Xcode and:

A promising beginning on this lengthy journey! However, I couldn’t help but notice the absence of ViewModels in the current implementation. Rather than attempting to modify the code independently, though:

Separate the logic for these code in the ViewModels

import SwiftUI

class HomeViewModel: ObservableObject {
    @Published var waterIntakeStatus = 0 // Initialize with user's current water intake status
    @Published var dailyGoal = 2000 // Set your preferred daily water intake goal

    func logWaterIntake(amount: Int) {
        waterIntakeStatus += amount
        // Here, you can also update your data storage to save the new water intake status.
    }
}

import SwiftUI

class WaterIntakeLogViewModel: ObservableObject {
    weak var delegate: WaterIntakeLogDelegate?

    let waterAmounts = [100, 250, 500, 750, 1000] // Predefined water quantity options

    func logWaterIntake(amount: Int) {
        delegate?.didLogWaterIntake(amount: amount)
    }
}

import SwiftUI

struct HomeView: View {
    @StateObject private var viewModel = HomeViewModel()

    var body: some View {
        VStack {
            Text("Water Intake: \(viewModel.waterIntakeStatus) ml")
                .font(.headline)
            
            Text("Daily Goal: \(viewModel.dailyGoal) ml")
                .font(.subheadline)
            
            Button("Log Water Intake", action: { showWaterIntakeLogScreen() })
                .padding()
        }
        .sheet(isPresented: $showWaterIntakeLogView) {
            WaterIntakeLogView(viewModel: WaterIntakeLogViewModel(delegate: viewModel))
        }
    }

    @State private var showWaterIntakeLogView = false

    func showWaterIntakeLogScreen() {
        showWaterIntakeLogView = true
    }
}

struct WaterIntakeLogView: View {
    @ObservedObject var viewModel: WaterIntakeLogViewModel

    var body: some View {
        VStack {
            Text("Select Water Intake Amount")
                .font(.headline)

            ForEach(viewModel.waterAmounts, id: \.self) { amount in
                Button("\(amount) ml", action: { viewModel.logWaterIntake(amount: amount) })
                    .padding()
            }
        }
    }
}

When I pasted the code into Xcode, I encountered a minor issue. The ‘WaterIntakeLogDelegate’ was missing, even though it was present in the previous version of the code. Instead of manually copying and pasting it, I opted to communicate the issue to ChatGPT:

WaterIntakeLogDelegate is missing

And among other things I got:

// WaterIntakeLogDelegate Protocol
protocol WaterIntakeLogDelegate: AnyObject {
    func didLogWaterIntake(amount: Int)
}

The missing element was the ‘WaterIntakeLogDelegate.’ While this code snippet had been included in previous samples, it was notably absent in this instance. Fortunately, this discrepancy was swiftly addressed.

The Design

I have the first iteration of my app, just a very basic — yet functional — piece of code. A good start. But before going further, I would like to switch roles again and prepare a prototype so I know what I want to create as a developer later on.

I was quickly exploring various prompts, and when I found inspiration and clarity regarding my objectives, I formulated the following prompt:

Note: Using the same prompt alone won’t result in the same image because:

The Midjourney bot uses a seed number to create a field of visual noise, like television static, as a starting point to generate the initial image grids. Seed numbers are generated randomly for each image but can be specified with the --seed parameter. If you use the same seed number and prompt, you will get similar final images.

Seed documentation in midjourney and stable diffusion xl

The prompt I used:

plan the app design for iOS mobile app, iPhone, for water drinking app with landscape in the background which fades out when user is drinking more and more water

Let’s see what generated Stable Diffusion XL first:

And now Midjourney:

I found Midjourney’s suggestions to be slightly more captivating.

I plan to strive for the combination of both of these aspects.

Now, let’s enjoy the creative process and design the application icon:

drink water everyday app ios icon, colorfull, show exotic places

Again I will start from Stable Diffusion XL:

And now Midjourney:

Midjourney wins again:

I have a clear vision for the UI direction I want to pursue, and I’ve also created the icon. Based on my past experiences, having these elements in place early on in the process is advantageous. And it looks nice.

The Data

Currently we do not store anything. When the app is closed it just vanishes and start over next time. This seems to be something both AI forgot. Maybe they don’t have this problem because all of this is stored by default. Since storing data is quite important for our app I asked ChatGPT:

Add code to store the values of how much user drank each day in userdefaults

Response:

import Foundation

class DataManager: ObservableObject {
    static let shared = DataManager()

    private let userDefaults: UserDefaults

    private init() {
        #if DEBUG
        userDefaults = UserDefaults(suiteName: "com.example.app.debug")!
        #else
        userDefaults = UserDefaults.standard
        #endif
    }

    func saveWaterIntake(_ intakeAmount: Int, forDate date: Date) {
        var data = loadWaterIntakeData()
        data[date] = intakeAmount
        saveWaterIntakeData(data)
    }

    func loadWaterIntakeData() -> [Date: Int] {
        guard let data = userDefaults.dictionary(forKey: "WaterIntakeData") as? [String: Int] else {
            return [:]
        }

        var result = [Date: Int]()
        for (key, value) in data {
            if let date = DateFormatterManager.shared.dateFormatter.date(from: key) {
                result[date] = value
            }
        }
        return result
    }

    private func saveWaterIntakeData(_ data: [Date: Int]) {
        var saveData = [String: Int]()
        for (key, value) in data {
            let dateString = DateFormatterManager.shared.dateFormatter.string(from: key)
            saveData[dateString] = value
        }
        userDefaults.set(saveData, forKey: "WaterIntakeData")
    }
}

class DateFormatterManager {
    static let shared = DateFormatterManager()

    private init() {}

    lazy var dateFormatter: DateFormatter = {
        let formatter = DateFormatter()
        formatter.dateFormat = "yyyy-MM-dd"
        return formatter
    }()
}

import SwiftUI

class HomeViewModel: ObservableObject {
    @Published var waterIntakeStatus = 0 // Initialize with user's current water intake status
    @Published var dailyGoal = 2000 // Set your preferred daily water intake goal

// ... (previously defined code)

    func logWaterIntake(amount: Int) {
        waterIntakeStatus += amount
        DataManager.shared.saveWaterIntake(amount, forDate: Date())
    }
}

import SwiftUI

struct HomeView: View {
    @EnvironmentObject private var viewModel: HomeViewModel
  
// ... (previously defined code)

}

Note: I’m only providing the code section and not the entire response, as typically ChatGPT also supplies context, instructions on how to use the code, and additional information.

Overall, it’s quite impressive. To be completely honest, there are a few areas I might consider for improvement, but it’s satisfactory, and the speed is exceptional. I just noticed a small snippet that I believe I’ll require, so I requested it:

Generate water intake getter function but for particular dat not all of them

Response:

import Foundation

class DataManager: ObservableObject {
    // ... (previously defined code)

    func getWaterIntake(forDate date: Date) -> Int? {
        let waterIntakeData = loadWaterIntakeData()
        return waterIntakeData[date]
    }
}

import SwiftUI

class HomeViewModel: ObservableObject {
    // ... (previously defined code)

    func logWaterIntake(amount: Int) {
        waterIntakeStatus += amount
        DataManager.shared.saveWaterIntake(amount, forDate: Date())
    }

    func waterIntake(forDate date: Date) -> Int? {
        return DataManager.shared.getWaterIntake(forDate: date)
    }
}

import SwiftUI

struct HomeView: View {
    @EnvironmentObject private var viewModel: HomeViewModel
    @State private var selectedDate = Date()

    var body: some View {
        VStack {
            DatePicker("Select Date", selection: $selectedDate, displayedComponents: .date)
                .padding()

            if let waterIntake = viewModel.waterIntake(forDate: selectedDate) {
                Text("Water Intake on \(selectedDate, formatter: DateFormatterManager.shared.dateFormatter): \(waterIntake) ml")
            } else {
                Text("No water intake recorded for \(selectedDate, formatter: DateFormatterManager.shared.dateFormatter)")
            }

            // ... (other UI elements)
        }
    }
}

From now on I will refrain from sharing all the specific details of each interaction with ChatGPT. Working on this application led to numerous discussions with ChatGPT, and at times, it was more challenging for me to articulate my requirements effectively than to write the code myself.

This experiment reinforced a principle I was already familiar with: the quality of responses is heavily influenced by the context and prompts provided. It serves as a reminder that well-crafted context and prompts significantly enhance the quality of the generated responses.

Add The Log Water Intake Screen

Intuitive Water Entry:
Easily log your water consumption with a quick and straightforward interface.

By ChatGPT acting as an app designer

I initially had a basic water intake screen connected to the home screen. However, I wanted to display the water intake using two rows of glasses. Despite my efforts, I struggled to effectively convey this concept to ChatGPT. Consequently, I took it upon myself to update the user interface.

Once I had the code in place, my next requirement was images for the glasses. In this endeavor, Midjourney emerged as the preferred choice and provided the needed assets.

The process wasn’t without its challenges. In fact, the provided samples represent less than half of the images I generated in my quest to find the matching ones.

Now, let’s explore the final screen before and after:

Add The Settings Screen

Customizable Daily Goal:
Choose your preferred daily water intake goal.
Automatic Weight-Based Calculation:
Optionally, provide your weight, and the app will calculate your recommended daily water intake based on it.

By ChatGPT acting as app designer

The “Settings” component is a pivotal element of this puzzle. I opted not to connect the app to health data at this stage, choosing instead to allow users to manually input the information required for water intake calculation. In this approach, users provide the necessary data, and the app takes care of calculating and storing the suggested water intake.

Create a code for settings screen, on this screen user will calculate water intake needed for his body

import SwiftUI

struct SettingsView: View {
    @State private var weight: String = ""
    @State private var waterIntake: Double = 0.0

    var body: some View {
        NavigationView {
            Form {
                Section(header: Text("Calculate Needed Water Intake")) {
                    TextField("Enter your weight (kg)", text: $weight)
                        .keyboardType(.decimalPad)
                    
                    Button("Calculate") {
                        calculateWaterIntake()
                    }
                    
                    Text("Recommended Water Intake: \(waterIntake, specifier: "%.1f") ml")
                        .foregroundColor(.accentColor)
                }
            }
            .navigationBarTitle("Settings")
        }
    }
    
    func calculateWaterIntake() {
        if let weightValue = Double(weight) {
            // Calculate water intake based on weight (example formula)
            waterIntake = weightValue * 30.0 // Adjust the formula as needed
        } else {
            waterIntake = 0.0
        }
    }
}

This was the first from the many prompts I used to produce the “Settings” screen. Piece by piece I was completing the puzzle.

I knew that there is a way to calculate suggested water intake. So I asked ChatGPT for it:

Make me an equation how much water should a person drink

Response:

Sure, here’s a simple equation that you can use as a starting point to estimate daily water intake for adults. Keep in mind that individual hydration needs can vary based on factors such as age, weight, activity level, climate, and health conditions.

Water Intake (in milliliters) = Body Weight (in kilograms) * Hydration Factor

Where:

Body Weight is the person’s weight in kilograms.

Hydration Factor is a value that you can adjust based on individual factors. A common range is between 30–40 mL of water per kilogram of body weight.

For example, if someone weighs 70 kilograms and you use a hydration factor of 35 mL/kg:

Water Intake = 70 kg * 35 mL/kg = 2450 mL (approximately 2.45 liters)

Remember that this is a general guideline, and it’s important to consider other factors like activity level, climate, and personal preferences. Always consult with a healthcare professional or registered dietitian for personalized recommendations on water intake, especially if you have specific health conditions or dietary needs.

After conducting tests on various water drinking apps and observing that many of them incorporated gender as a parameter in their intake calculation I did the same:

Add gender parameter to the equation

Response:

Certainly, adding a gender parameter to the equation can help adjust the estimated water intake based on differences in body composition and physiological needs between genders. Here’s the equation with a gender parameter:

Water Intake (in milliliters) = Body Weight (in kilograms) * Gender Factor * Hydration Factor

Where:

Body Weight is the person’s weight in kilograms.

Gender Factor is a coefficient that considers the difference between genders. This factor can be adjusted based on the individual’s gender.

Hydration Factor is the amount of water per kilogram of body weight.

For example, if someone weighs 70 kilograms, identifies as male, and you use a hydration factor of 35 mL/kg:

If Gender Factor for males = 1: Water Intake = 70 kg * 1 * 35 mL/kg = 2450 mL (approximately 2.45 liters)

If Gender Factor for females = 0.8: Water Intake = 70 kg * 0.8 * 35 mL/kg = 1960 mL (approximately 1.96 liters)

Remember that the gender factor can vary based on individual differences and may not be a fixed number. Factors such as muscle mass, metabolism, and overall health can influence water needs. It’s a good idea to consult with a healthcare professional or registered dietitian for personalized recommendations, especially if you have specific health conditions or dietary needs.

I had the basic requirements crystalised now and could start building the UI:

Include gender in SettingsView for calculating suggested water intake

Response:

struct SettingsView: View {
    @State private var weight: Double = 70
    @State private var selectedGender: Gender = .male

    enum Gender {
        case male
        case female
    }

    var suggestedWaterIntake: Double {
        let hydrationFactor: Double = 35 // You can adjust this value
        let genderFactor: Double = selectedGender == .male ? 1.0 : 0.8

        return weight * genderFactor * hydrationFactor
    }

    var body: some View {
        VStack {
            Text("Calculate Suggested Water Intake")
                .font(.title)

            Picker("Select Gender", selection: $selectedGender) {
                Text("Male").tag(Gender.male)
                Text("Female").tag(Gender.female)
            }
            .pickerStyle(SegmentedPickerStyle())
            .padding()

            Slider(value: $weight, in: 1...150, step: 1) {
                Text("Weight: \(Int(weight)) kg")
            }
            .padding()

            Text("Suggested Water Intake: \(Int(suggestedWaterIntake)) mL")
                .font(.headline)
                .foregroundColor(.accentColor)
                .padding()
        }
    }
}

Now, I needed to update the weight support functionality:

include body weight

Response:

struct SettingsView: View {
    @State private var weightKg: Double = 70
    @State private var weightLbs: Double = 154 // Default value for 70 kg

    @State private var selectedGender: Gender = .male

    enum Gender {
        case male
        case female
    }

    var suggestedWaterIntake: Double {
        let hydrationFactor: Double = 35 // You can adjust this value
        let genderFactor: Double = selectedGender == .male ? 1.0 : 0.8
        let weightInKg = selectedUnit == .kg ? weightKg : weightLbs * 0.453592 // Convert lbs to kg

        return weightInKg * genderFactor * hydrationFactor
    }

    var body: some View {
        VStack {
            Text("Calculate Suggested Water Intake")
                .font(.title)

            Picker("Select Gender", selection: $selectedGender) {
                Text("Male").tag(Gender.male)
                Text("Female").tag(Gender.female)
            }
            .pickerStyle(SegmentedPickerStyle())
            .padding()

            Picker("Select Unit", selection: $selectedUnit) {
                Text("kg").tag(Unit.kg)
                Text("lbs").tag(Unit.lbs)
            }
            .pickerStyle(SegmentedPickerStyle())
            .padding()

            if selectedUnit == .kg {
                Slider(value: $weightKg, in: 1...150, step: 1) {
                    Text("Weight: \(Int(weightKg)) kg")
                }
            } else {
                Slider(value: $weightLbs, in: 1...330, step: 1) {
                    Text("Weight: \(Int(weightLbs)) lbs")
                }
            }
            .padding()

            Text("Suggested Water Intake: \(Int(suggestedWaterIntake)) mL")
                .font(.headline)
                .foregroundColor(.accentColor)
                .padding()
        }
    }

    enum Unit {
        case kg
        case lbs
    }

    @State private var selectedUnit: Unit = .kg
}

After a few more iterations, I arrived at the following result:

The Home Screen

I made numerous requests to ChatGPT to shape the UI. While some responses were helpful and most were functional, they often didn’t align precisely with my intended design. I recognized that this particular screen involved a significant amount of graphics, which may have contributed to the challenges I encountered. ChatGPT tends to excel in generating code, but describing visual aspects precisely can be a more complex task.

Ultimately, I ended up writing most of the UI code myself, with ChatGPT assisting in specific cases, such as creating custom shapes or animations. The AI played a valuable role in providing code-building-blocks for me to incorporate.

Now, here’s how the home screen appears. The image becomes less blurred as the user consumes water:

The Colors

In my quest to choose a color palette for the app, I sought assistance from my AI designer, opting for a more personalized approach rather than relying on online sources as I typically would:

What main color would fit water drinking app with random nature images in the background?

And I got:

Aqua Blue or Turquoise: Shades of aqua blue or turquoise are often associated with water and ocean themes. They convey a sense of freshness, purity, and tranquility.

Leaf Green: Green is commonly associated with nature and health. A leafy green shade can convey the idea of natural hydration and wellness.

Sky Blue: Light blue shades reminiscent of the sky can evoke a feeling of refreshment and openness. They pair well with nature imagery.

Soft Teal: A soft teal color combines elements of blue and green, creating a soothing and calming hue that aligns with nature.

Cool Gray: A cool gray color can provide a neutral backdrop that complements the random nature images while giving a modern and clean look.

Mint Green: Mint green is a cool and refreshing color that can tie into both the water theme and the natural imagery.

I made a mistake that fortunately could be easily corrected:

Give me rgb of these colors

Which finally produced something I could use:

Aqua Blue / Turquoise: RGB: (0, 175, 230) or #00AFE6

Leaf Green: RGB: (46, 139, 87) or #2E8B57

Sky Blue: RGB: (135, 206, 235) or #87CEEB

Soft Teal: RGB: (90, 190, 175) or #5ABEAF

Cool Gray: RGB: (169, 169, 169) or #A9A9A9

Mint Green: RGB: (152, 255, 152) or #98FF98

In the context of this simple app, I only required the main color for the Minimum Viable Product (MVP). However, it’s worth noting that ChatGPT possesses the capability to handle much more complex tasks and provide comprehensive solutions across a wide range of software development needs. Its potential extends far beyond basic color selection.

The Notifications

To implement reminders for the user to drink water, I decided to utilize local notifications. This time, I turned to my AI developer and requested:

Generate the code to schedule 10 local notifications for drinking water

Response:

func scheduleNotifications() {
        let center = UNUserNotificationCenter.current()
        
        // Request permission for notifications
        center.requestAuthorization(options: [.alert, .badge, .sound]) { granted, error in
            if granted {
                for i in 1...10 {
                    let content = UNMutableNotificationContent()
                    content.title = "Drink Water!"
                    content.body = "It's time to drink water."
                    content.sound = .default
                    
                    let trigger = UNTimeIntervalNotificationTrigger(timeInterval: TimeInterval(i * 3600), repeats: false)
                    
                    let request = UNNotificationRequest(identifier: "water\(i)", content: content, trigger: trigger)
                    center.add(request) { error in
                        if let error = error {
                            print("Error scheduling notification: \(error)")
                        }
                    }
                }
            } else if let error = error {
                print("Error requesting authorization: \(error)")
            }
        }
    }

I realized that I hadn’t formulated the request correctly.

Schedule drinking water each hour from 10 to 20, for the whole wheek

Response:

func scheduleNotifications() {
        let center = UNUserNotificationCenter.current()
        
        // Request permission for notifications
        center.requestAuthorization(options: [.alert, .badge, .sound]) { granted, error in
            if granted {
                let daysOfWeek: [Int] = Array(1...7) // 1 is Sunday, 2 is Monday, and so on
                
                for day in daysOfWeek {
                    for hour in 10...20 {
                        let content = UNMutableNotificationContent()
                        content.title = "Drink Water!"
                        content.body = "It's time to drink water."
                        content.sound = .default
                        
                        var dateComponents = DateComponents()
                        dateComponents.weekday = day
                        dateComponents.hour = hour
                        dateComponents.minute = 0 // Set the minute to 0
                        
                        let trigger = UNCalendarNotificationTrigger(dateMatching: dateComponents, repeats: true)
                        
                        let request = UNNotificationRequest(identifier: "water\(day)\(hour)", content: content, trigger: trigger)
                        center.add(request) { error in
                            if let error = error {
                                print("Error scheduling notification: \(error)")
                            }
                        }
                    }
                }
            } else if let error = error {
                print("Error requesting authorization: \(error)")
            }
        }
    }

Finally, I wanted to have more appealing text for the notifications:

Generate title and subtitle for notification for drinking water app

Response:

func generateNotificationTitle() -> String {
    let titles = ["Stay Hydrated!", "Time to Drink!", "Stay Healthy!", "Hydration Reminder"]
    return titles.randomElement() ?? "Drink Water!"
}

func generateNotificationSubtitle() -> String {
    let subtitles = ["Your body needs water.", "Stay refreshed and hydrated.", "Keep up the good work!", "Don't forget to drink."]
    return subtitles.randomElement() ?? "It's time to drink water."
}

The Tests

Writing the code is just the beginning of a software project. To ensure reliability and maintainability, thorough testing is crucial. This includes unit tests, integration tests, and more. Additionally, testing helps in safely adding new code, making refactors, and collaborating with other team members on the same codebase.

ChatGPT can assist in creating test cases, testing strategies, and even explaining how to conduct various types of testing. It’s a valuable resource in the software development process beyond just code generation.

Write unit tests for *I pasted the code here*

Result:

class SettingsViewModelTests: XCTestCase {

    // Create a mock DataManager for testing
    class MockDataManager: DataManager {
        // Implement mock functions used in your viewModel
        // You can override save and load methods to store and retrieve values in memory for testing
        // For simplicity, we'll just store values in memory using variables
        var settingsInfo: SettingsInfo?
        var requiredWaterIntake: Int?

        override func save(requiredWaterIntake: Int) {
            self.requiredWaterIntake = requiredWaterIntake
        }

        override func loadSettingsInfo() -> SettingsInfo? {
            return settingsInfo
        }
    }

    // Create an instance of SettingsViewModel with the mock DataManager
    var viewModel: SettingsViewModel!
    var mockDataManager: MockDataManager!

    override func setUp() {
        super.setUp()
        mockDataManager = MockDataManager()
        viewModel = SettingsViewModel(dataManager: mockDataManager)
    }

    override func tearDown() {
        viewModel = nil
        mockDataManager = nil
        super.tearDown()
    }

    func testSuggestedWaterIntakeCalculation() {
        // Set up your viewModel with test data
        viewModel.selectedGender = .male
        viewModel.selectedUnit = .kg
        viewModel.weightKg = 70

        // Calculate the suggested water intake
        let suggestedWaterIntake = viewModel.suggestedWaterIntake

        // Assert that the calculation is correct based on your formula
        XCTAssertEqual(suggestedWaterIntake, 70 * Int(1.0) * 35) // Adjust the expected value based on your formula
    }

    func testSuggestedWaterIntakeSaved() {
        // Set up your viewModel with test data
        viewModel.selectedGender = .female
        viewModel.selectedUnit = .lbs
        viewModel.weightLbs = 154

        // Calculate the suggested water intake, which should also save it through the mock DataManager
        _ = viewModel.suggestedWaterIntake

        // Assert that the required water intake was saved
        XCTAssertEqual(mockDataManager.requiredWaterIntake, Int(154 * 0.453592 * 0.8 * 35))
    }
    
    // Add more test cases as needed
}

While it may not be ideal, it’s undeniable that ChatGPT provides a valuable starting point, generating code swiftly within seconds. This can significantly expedite the development process and serve as a foundation for further refinement and optimization.

The Conclusions

AI indeed offers substantial benefits in the realm of business analysis, idea validation, feature planning, text generation, script preparation, and various text-related tasks.

For app development, AI proves highly valuable in expediting the initial phases by swiftly generating foundational code upon which to build.

Furthermore, AI excels in providing specific code segments, bug identification, and test case generation, with results improving when prompts are detailed and specific.

In terms of design, AI is exceptional at crafting prototype designs, app icons, and assets, although maintaining asset consistency and precise alignment with requirements can pose challenges. Nevertheless, the iterative process is notably rapid.

It’s important to emphasize that AI thrives on clear and detailed prompts to function efficiently. Even with this, a fundamental understanding is essential to effectively integrate all components into the development process.

I hold the belief that the synergy between a developer and AI surpasses the capabilities of an individual developer alone. Similarly, this holds true for designers and other roles. However, it’s important to underscore that AI remains a valuable tool at our disposal, augmenting our abilities and enabling us to achieve more efficient and effective results in various domains.

Note: I primarily utilize ChatGPT for code-related tasks due to my familiarity with it. However, I also acknowledge the capabilities of Bard, which, despite a slower start today, is a valuable tool by default connected to the internet (ChatGPT3.5 is not…) . I encourage you to explore both options to determine which best suits your needs.

Comparing images using the Vision framework

Kamil Tustanowski — Sun, 23 Jul 2023 08:04:34 GMT

Photography is one of my hobbies. There is something magical in taking pictures. It’s like stopping time at a particular moment. The only problem is that I’m no expert and not all of my photographs are great, if any, therefore often I take more than one photo. Just to be safe. “Being safe” in my case means I have a bunch of redundant pictures that serve no purpose. I should remove them but I always have a good reason not to. I have more interesting and important things to do or places to be.

Last time I was complaining that I’m terrible at selfies but managed to find a solution to my little problem thanks to technology. Check my Detecting face capture quality using the Vision framework for more details. This time is no different.

The Vision request I will introduce today is unique. VNGenerateImageFeaturePrintRequest behaves slightly different from the other requests. But we will get to that. For now, knowing that this inconspicuous request is able to compare images for us is enough. Let's begin with:

import Vision

Then create a request:

let request = VNGenerateImageFeaturePrintRequest()

And pass it to the request handler:

guard let cgImage = image.cgImage else { return nil }
let request = VNGenerateImageFeaturePrintRequest()

let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                           orientation: .init(image.imageOrientation),
                                           options: [:])
do {
    try requestHandler.perform([request])
} catch {
    print("Can't make the request due to \(error)")
}

This is explained in detail in my Detecting body pose using Vision framework article.

As the result, we get a VNFeaturePrintObservation:

guard let result = request.results?.first else { return nil }
return result

The complete function:

func process(_ image: UIImage) -> VNFeaturePrintObservation? {
    guard let cgImage = image.cgImage else { return nil }
    let request = VNGenerateImageFeaturePrintRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }
    
    guard let result = request.results?.first else { return nil }
    return result
}

Note: I’m working with this code in a playground and it makes things easier when the function is synchronous but the Vision requests will block the main thread of your application. Make sure to execute them i.e. on:

private let visionQueue = DispatchQueue.global(qos: .userInitiated)

But remember to return to the main queue to display the results.

And that’s it. The work is done.

This is the place where you should ask

How it’s done? What about comparing the images?

This is why this request is unique. VNFeaturePrintObservation holds the calculated feature print data. We can't parse it and use it to compare images in an easy way but the observation can do it for us. The way we can use it to compare images is to use observation to compute the distance between the feature prints.

To compare two images similarity we need to:

Use the request on both of them to get the VNFeaturePrintObservation containing feature print data for each of them.
We use this function provided by the observations to compare the distance between the feature prints they hold:

open func computeDistance(_ outDistance: UnsafeMutablePointer, to featurePrint: VNFeaturePrintObservation) throws

Notice the UnsafeMutablePointer which is our outDistance. Returning values is not the only way function can provide its result:

You use instances of the UnsafeMutablePointer type to access data of a specific type in memory. The type of data that a pointer can access is the pointer’s Pointee type. UnsafeMutablePointer provides no automated memory management or alignment guarantees. You are responsible for handling the life cycle of any memory you work with through unsafe pointers to avoid leaks or undefined behavior.

UnsafeMutablePointer Documentation

This is a pointer to a Float located somewhere in memory that our function will use to pass the result.

Pointers are not something we use in Swift on a daily basis. I remember pointers mainly from the C / C++ and... Objective-C days. You can find more information here.

We know now that with pointers we can access specific locations in the memory but this doesn’t explain what we need to do to get our distance.

First, we need to have something to compare. This code will produce observations for two images:

let balloon1 = UIImage(named: "balloon_1.jpg")!
let balloon1FeaturePrint = process(balloon1)!

let balloon2 = UIImage(named: "balloon_2.jpg")!
let balloon2FeaturePrint = process(balloon2)!

I use ! for simplicity. I don’t recommend this approach on production code.

Then we need to make a variable for our distance:

var balloon1ToBallon2Distance: Float = .infinity

It will hold the distance between feature prints of balloon_1 and balloon_2 images. The lesser the distance is the more the images are similar. Identical images will have a 0.0 distance between their features. That's why we use .infinity as the initial value.

Now it’s the time to face the UnsafeMutablePointer. We are used to providing parameters for functions but this is different. This time we need to provide a pointer which the function can later use to provide us the result. To do it we will use & which in C++ is called address-of operator. Instead of providing balloon1ToBallon2Distance we will pass the &balloon1ToBallon2Distance which is a pointer containing an address to a place in memory where this var is located:

do {
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon2Distance, to: balloon2FeaturePrint)
} catch {
    print("Couldn't compute the distance")
}

And after the function is complete we will have the result in this variable. Pointers are powerful but also terrifying and dangerous. Behold their power:

print(balloon1ToBallon2Distance)

11.10728

This variable was .infinity a while ago and now it contains the distance between feature points of balloon_1 and balloon_2 images. They are pretty close which means they are similar.

See it for yourself. I made a few photographs of this year’s Christmas tree:

As you can see I made four photos of a balloon, one of a heart. The plane photo is for demonstration purposes. The plane is a photo by Lacie Slezak.

Let’s imagine we are working on an application that will help us with removing the redundant photographs. The user makes a few photographs and we want to know whether they are similar or not. If we find similar photographs we can try to help the user to select the best one and automatically get rid of the rest.

First, we need to have these feature prints for each image:

let balloon1 = UIImage(named: "balloon_1.jpg")!
let balloon1FeaturePrint = process(balloon1)!

let balloon2 = UIImage(named: "balloon_2.jpg")!
let balloon2FeaturePrint = process(balloon2)!

let balloon3 = UIImage(named: "balloon_3.jpg")!
let balloon3FeaturePrint = process(balloon3)!

let balloon4 = UIImage(named: "balloon_4.jpg")!
let balloon4FeaturePrint = process(balloon4)!

let heart = UIImage(named: "heart.jpg")!
let heartFeaturePrint = process(heart)!

let plane = UIImage(named: "plane.jpg")! // Original photo by https://unsplash.com/@nbb_photos
let planeFeaturePrint = process(plane)!

Then we need to prepare variables for the distance:

var balloon1ToBallon2Distance: Float = .infinity
var balloon1ToBallon3Distance: Float = .infinity
var balloon1ToBallon4Distance: Float = .infinity
var balloon1ToHeartDistance: Float = .infinity
var balloon1ToPlaneDistance: Float = .infinity

The last piece is to calculate distances using observations:

do {
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon2Distance, to: balloon2FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon3Distance, to: balloon3FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon4Distance, to: balloon4FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToHeartDistance, to: heartFeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToPlaneDistance, to: planeFeaturePrint)
} catch {
    print("Couldn't compute the distance")
}

The idea is simple. We take the first image and compare it to the next images. When the distance is small we can assume the images are similar therefore redundant.

Let’s see the results:

11.10728

11.60783

10.46046

21.83002

The Christmas tree is the same and the heart is located near the balloon but the distance has doubled. This is good because this photograph is different than the previous four. The background is similar but the salient object has changed.

28.88646

The plane distance from the balloon_1 is almost three times larger than from the other balloons.

Thanks to this request you can group similar images together. When they are grouped together you can analyze them one by one trying to find the best looking, least blurry image of them all. You can apply other vision requests like saliency detection to find which one has objects positioned in a better way, the face capture quality request can find the one with better-captured faces.

You can find the code here:

import UIKit
import Vision

func process(_ image: UIImage) -> VNFeaturePrintObservation? {
    guard let cgImage = image.cgImage else { return nil }
    let request = VNGenerateImageFeaturePrintRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }
    
    guard let result = request.results?.first else { return nil }
    return result
}

let balloon1 = UIImage(named: "balloon_1.jpg")!
let balloon1FeaturePrint = process(balloon1)!

let balloon2 = UIImage(named: "balloon_2.jpg")!
let balloon2FeaturePrint = process(balloon2)!

let balloon3 = UIImage(named: "balloon_3.jpg")!
let balloon3FeaturePrint = process(balloon3)!

let balloon4 = UIImage(named: "balloon_4.jpg")!
let balloon4FeaturePrint = process(balloon4)!

let heart = UIImage(named: "heart.jpg")!
let heartFeaturePrint = process(heart)!

let plane = UIImage(named: "plane.jpg")! // Original photo by https://unsplash.com/@nbb_photos
let planeFeaturePrint = process(plane)!

var balloon1ToBallon2Distance: Float = .infinity
var balloon1ToBallon3Distance: Float = .infinity
var balloon1ToBallon4Distance: Float = .infinity
var balloon1ToHeartDistance: Float = .infinity
var balloon1ToPlaneDistance: Float = .infinity

do {
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon2Distance, to: balloon2FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon3Distance, to: balloon3FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToBallon4Distance, to: balloon4FeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToHeartDistance, to: heartFeaturePrint)
    try balloon1FeaturePrint.computeDistance(&balloon1ToPlaneDistance, to: planeFeaturePrint)
} catch {
    print("Couldn't compute the distance")
}

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. You can find the code used in this article here.

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmail.com) or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

If you want to help me stay on my feet during the night when I’m working on my blog — now you can:

Kamil Tustanowski is AI enthusiast, iOS Dev, content cretor

[This was first published on my blog]

Detecting face capture quality using the Vision framework

Kamil Tustanowski — Sun, 23 Jul 2023 07:45:20 GMT

Selfies. I’m terrible at selfies. Even my daughter laughs at me when I try. I guess:

I’m too old for this 💩!

Roger Murtaugh — Lethal Weapon

Potentially.

What I’m certain of is that technology can help me with my little problem. I may even solve more than one problem at a time.

I like photography. I like to take pictures. The problem is that I don’t like to select the best pictures and I end up with redundant photos. What if I could assess the selfie quality? I could make fewer pictures if I knew the quality upfront or I could check multiple selfies quality at once and leave the best one.

This is what we will do today. Despite its mundane character, I think this is one of the features that are easy to apply and can bring a huge benefit to the user.

To make it happen we need VNDetectFaceCaptureQualityRequest this as a result produces a float number telling us how good the face quality is and a frame to indicate where the face is.

As usual, we start with:

import Vision

Then we create the request:

let faceQualityRequest = VNDetectFaceCaptureQualityRequest()

And we pass it to the request handler:

guard let cgImage = image.cgImage else { return }
let faceQualityRequest = VNDetectFaceCaptureQualityRequest()
let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                           orientation: .init(image.imageOrientation),
                                           options: [:])
do {
    try requestHandler.perform([faceQualityRequest])
} catch {
    print("Can't make the request due to \(error)")
}

This is explained in detail in my Detecting body pose using Vision framework article.

When a request is performed we get the results:

guard let results = faceQualityRequest.results else { return }

The result type is VNFaceObservation. From iOS 15 we don't have to do typecasting anymore.

We know this type because we were working with it before. This time in VNFaceObservation we are interested in the faceCaptureQuality which will give us a float number:

The value ranges from 0 to 1. Faces with quality closer to 1 are better lit, sharper, and more centrally positioned than faces with quality closer to 0.

VNDetectFaceCaptureQualityRequest documentation

Additionally, there is a frame to indicate which info is for which face. More than one face can be verified at a time.

It’s time to prepare the results for the presentation. This is done in a few phases because of all the translations we need to do:

let boxesAndNames = results
    .map { (box: $0.boundingBox.rectangle(in: image),
            name: "\($0.faceCaptureQuality ?? 0.0)") }

First, we associate the face capture quality with the bounding box and we project the bounding box CGRect onto the image to get non-normalized values.

let rectangles = boxesAndNames.map { $0.box }
    .map { CGRect(origin: $0.origin.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height - $0.size.height),
                  size: $0.size) }

Then we translate the non-normalized CGRect to UIKit coordinate space and populate DisplayableText we used in previous articles:

let displayableTexts = zip(rectangles,
                           boxesAndNames.map { $0.name })
    .map { DisplayableText(frame: $0.0,
                           text: $0.1) }

This is described in more detail in Barcode detection using Vision framework and Detecting body pose using Vision framework.

The last part is to prepare an updated image and set it in our user-facing image view:

self?.imageView.image = image.draw(rectangles: rectangles,
                                   displayableTexts: displayableTexts)But to do that we need the draw function we remember from previous articles. First, we draw the rectangles indicating faces:

extension UIImage {
    func draw(rectangles: [CGRect],
              displayableTexts: [DisplayableText],
              strokeColor: UIColor = .primary,
              lineWidth: CGFloat = 2) -> UIImage? {
        let renderer = UIGraphicsImageRenderer(size: size)
        return renderer.image { context in
            draw(in: CGRect(origin: .zero, size: size))

            context.cgContext.setStrokeColor(strokeColor.cgColor)
            context.cgContext.setLineWidth(lineWidth)
            rectangles.forEach { context.cgContext.addRect($0) }
            context.cgContext.drawPath(using: .stroke)

And then we display texts for each face:

let textAttributes = [NSAttributedString.Key.font: UIFont.systemFont(ofSize: 20, weight: .bold),
                                  NSAttributedString.Key.foregroundColor: strokeColor,
                                  NSAttributedString.Key.backgroundColor: UIColor.black]
            
            displayableTexts.forEach { displayableText in
                displayableText.text.draw(with: displayableText.frame!,
                                          options: [],
                                          attributes: textAttributes,
                                          context: nil)
            }
        }
    }
}

And this is it.

Friendly reminder: It’s good to make a separate queue because vision requests will block the main thread.

Finally, the exciting part:

Original photy by Aaron Andrew Ang

Original photy by Ben White

Original photy by x )

You can see a pattern here. A clear, sharp picture of a face watching towards the camera got a much better score.

This request can detect the quality of multiple faces:

Original photy by Jason Goodman

I would like to show you one last photo:

Original photy by Audrey Fretz

It has the lowest score. But… I like it. My bet is that the rating went down because the face is obstructed, a bit blurry, eyes are closed, it’s too close. But… I like it.

This request doesn’t tell you whether the selfie is good or bad. It tells whether the face is properly lit, positioned, sharp, and so on. You will know whether, technically speaking, one photo did a better job at exposing the face, than the other. Not which photo is more beautiful.

⚠️ A word of advice. If you use this request to allow users to purge all the bad quality selfies make sure you ask the user for confirmation before you delete them.

Below you can find a complete code. You need to provide the image you want to use for the request and have an imageView you want to use to display updated image:

let visionQueue = DispatchQueue.global(qos: .userInitiated)

func process(_ image: UIImage) {
    guard let cgImage = image.cgImage else { return }
    let faceQualityRequest = VNDetectFaceCaptureQualityRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])

    saveImageButton.isHidden = false
    visionQueue.async { [weak self] in
        do {
            try requestHandler.perform([faceQualityRequest])
        } catch {
            print("Can't make the request due to \(error)")
        }

        guard let results = faceQualityRequest.results else { return }
        
        let boxesAndNames = results
            .map { (box: $0.boundingBox.rectangle(in: image),
                    name: "\($0.faceCaptureQuality ?? 0.0)") }
            
        let rectangles = boxesAndNames.map { $0.box }
            .map { CGRect(origin: $0.origin.translateFromCoreImageToUIKitCoordinateSpace(using: image.size.height - $0.size.height),
                          size: $0.size) }

        let displayableTexts = zip(rectangles,
                                   boxesAndNames.map { $0.name })
            .map { DisplayableText(frame: $0.0,
                                   text: $0.1) }
        
        DispatchQueue.main.async {
            self?.imageView.image = image.draw(rectangles: rectangles,
                                               displayableTexts: displayableTexts)
        }
    }
}

extension UIImage {
    func draw(rectangles: [CGRect],
              displayableTexts: [DisplayableText],
              strokeColor: UIColor = .primary,
              lineWidth: CGFloat = 2) -> UIImage? {
        let renderer = UIGraphicsImageRenderer(size: size)
        return renderer.image { context in
            draw(in: CGRect(origin: .zero, size: size))

            context.cgContext.setStrokeColor(strokeColor.cgColor)
            context.cgContext.setLineWidth(lineWidth)
            rectangles.forEach { context.cgContext.addRect($0) }
            context.cgContext.drawPath(using: .stroke)

            let textAttributes = [NSAttributedString.Key.font: UIFont.systemFont(ofSize: 20, weight: .bold),
                                  NSAttributedString.Key.foregroundColor: strokeColor,
                                  NSAttributedString.Key.backgroundColor: UIColor.black]
            
            displayableTexts.forEach { displayableText in
                displayableText.text.draw(with: displayableText.frame!,
                                          options: [],
                                          attributes: textAttributes,
                                          context: nil)
            }
        }
    }
}

⚠️ I made a playground version of this code first but noticed that it was producing incorrect output. The same code used in the application on the device is providing correct output but if you run the code in the playgrounds or in the simulator — it’s wrong.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. You can find the code used in this article here.

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmail.com) or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

If you want to help me stay on my feet during the night when I’m working on my blog — now you can:

Kamil Tustanowski is AI enthusiast, iOS Dev, content cretor

[This was first published on my blog]

Speech synthesis using AVSpeechSynthesizer

Kamil Tustanowski — Fri, 21 Jul 2023 20:03:36 GMT

Do you know what is a dialogue?

Dialogue is a written or spoken conversational exchange between two or more people, and a literary and theatrical form that depicts such an exchange.

Wikipedia

To be honest I have a problem with this definition.

between two or more people

Why people? Why can’t we converse with other beings? What is the difference?

This made me think about the Turing test:

The Turing test, originally called the imitation game by Alan Turing in 1950, is a test of a machine’s ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine’s ability to render words as speech. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test results do not depend on the machine’s ability to give correct answers to questions, only how closely its answers resemble those a human would give.

Wikipedia

It’s a simpler, and actually real, predecessor of the Voight-Kampff test.

When the evaluator is conversing with human it’s dialogue. When the evaluator is conversing with a machine it’s not a dialogue any more. At least according to Wikipedia.

But what happens before the evaluator decides who is a person and who is not? Can we call this part of the test a dialogue? A dialogue with a machine?

I will leave you with this question open. Sorry but I’m not a philosopher.

My previous article was about Speech recognition using the Speech framework. I presented a way we can talk to our applications and make them recognize our speech. It’s time to allow the applications to speak back. It’s time to give them a voice.

Did you notice the AV prefix in AVSpeechSynthesizer? No new & fancy frameworks this time just good ol' AVFoundation:

import AVFoundation

Let’s imagine we are working on a cooking application. We want to allow the user to use the application without touching or even watching the screen. Consider the scenario where we want the application to inform the user that the chicken should be placed in the oven:

Bake the chicken in the oven for fifteen minutes

First, we decide what the application will say by using AVSpeechUtterance:

let englishUtterance = AVSpeechUtterance(string: "Bake the chicken in the oven for fifteen minutes")

I encourage you to immediately add:

englishUtterance.prefersAssistiveTechnologySettings = true

⚠️ There are a few ways we can tweak the way the application will speak our message. But what about the users with disabilities who are using VoiceOver? There is a high chance that the voice won’t be identical to the one from the VoiceOver which is confusing and uncomfortable to the user. This line makes sure that when VoiceOver is on our application will use an identical voice.

Next, we create AVSpeechSynthesizer which we will use in a moment to speak our AVSpeechUtterance:

let synthesizer = AVSpeechSynthesizer()

If you prefer a simple approach you can add:

synthesizer.usesApplicationAudioSession = false

But note that:

If the value of this property is false, the capture session makes use of a private AVAudioSession instance for audio recording, which may cause interruption if your app uses its own audio session for playback.

usesApplicationAudioSession documentation

The last part is passing utterance to speech synthesizer:

synthesizer.speak(englishUtterance)

As soon as you do this you will hear the application talking to you.

The code:

let englishUtterance = AVSpeechUtterance(string: "Bake the chicken in the oven for fifteen minutes")
englishUtterance.prefersAssistiveTechnologySettings = true
let synthesizer = AVSpeechSynthesizer()
synthesizer.usesApplicationAudioSession = false
synthesizer.speak(englishUtterance)

Yes. It’s that easy.

But that’s not all. You can specify concrete language and speech synthesis can speak many different languages. Including the Polish language which I use every day.

let polishUtterance = AVSpeechUtterance(string: "Piecz kurczaka w piekarniku przez piętnaście minut")
polishUtterance.prefersAssistiveTechnologySettings = true
let polishVoice = AVSpeechSynthesisVoice(language: "pl-PL")
polishUtterance.voice = polishVoice
let synthesizer = AVSpeechSynthesizer()
synthesizer.usesApplicationAudioSession = false
synthesizer.speak(polishUtterance)

As you can see we can create a voice matching the language of the text. When you have a voice you need to pass it to the utterance:

let polishVoice = AVSpeechSynthesisVoice(language: "pl-PL")
polishUtterance.voice = polishVoice

You can paste the code samples into a playground to hear how they sound.

AVSpeechUtterance has a few configuration options:

rate - Lower values correspond to slower speech, and higher values correspond to faster speech.
pitchMultiplier - The baseline pitch the speech synthesizer uses when speaking the utterance.
postUtteranceDelay and preUtteranceDelay - When multiple utterances are enqueued these values mark the delays between them. One from the start, the other after the end.
volume - The volume of the speech.
voice - The voice to be used to read the text. You can use a voice that doesn't match the country of the text but this won't end well.

You can use:

print(AVSpeechSynthesisVoice.speechVoices())

To see available voices:

Language: ar-SA, Name: Maged, Quality: Default [com.apple.ttsbundle.Maged-compact]
Language: cs-CZ, Name: Zuzana, Quality: Default [com.apple.ttsbundle.Zuzana-compact]
Language: da-DK, Name: Sara, Quality: Default [com.apple.ttsbundle.Sara-compact]
Language: de-DE, Name: Anna, Quality: Default [com.apple.ttsbundle.Anna-compact]
Language: el-GR, Name: Melina, Quality: Default [com.apple.ttsbundle.Melina-compact]
Language: en-AU, Name: Karen, Quality: Default [com.apple.ttsbundle.Karen-compact]
Language: en-GB, Name: Daniel, Quality: Default [com.apple.ttsbundle.Daniel-compact]
Language: en-IE, Name: Moira, Quality: Default [com.apple.ttsbundle.Moira-compact]
Language: en-IN, Name: Rishi, Quality: Default [com.apple.ttsbundle.Rishi-compact]
Language: en-US, Name: Samantha, Quality: Default [com.apple.ttsbundle.Samantha-compact]
Language: en-ZA, Name: Tessa, Quality: Default [com.apple.ttsbundle.Tessa-compact]
Language: es-ES, Name: Mónica, Quality: Default [com.apple.ttsbundle.Monica-compact]
Language: es-MX, Name: Paulina, Quality: Default [com.apple.ttsbundle.Paulina-compact]
Language: fi-FI, Name: Satu, Quality: Default [com.apple.ttsbundle.Satu-compact]
Language: fr-CA, Name: Amélie, Quality: Default [com.apple.ttsbundle.Amelie-compact]
Language: fr-FR, Name: Thomas, Quality: Default [com.apple.ttsbundle.Thomas-compact]
Language: he-IL, Name: Carmit, Quality: Default [com.apple.ttsbundle.Carmit-compact]
Language: hi-IN, Name: Lekha, Quality: Default [com.apple.ttsbundle.Lekha-compact]
Language: hu-HU, Name: Mariska, Quality: Default [com.apple.ttsbundle.Mariska-compact]
Language: id-ID, Name: Damayanti, Quality: Default [com.apple.ttsbundle.Damayanti-compact]
Language: it-IT, Name: Alice, Quality: Default [com.apple.ttsbundle.Alice-compact]
Language: ja-JP, Name: Kyoko, Quality: Default [com.apple.ttsbundle.Kyoko-compact]
Language: ko-KR, Name: Yuna, Quality: Default [com.apple.ttsbundle.Yuna-compact]
Language: nl-BE, Name: Ellen, Quality: Default [com.apple.ttsbundle.Ellen-compact]
Language: nl-NL, Name: Xander, Quality: Default [com.apple.ttsbundle.Xander-compact]
Language: no-NO, Name: Nora, Quality: Default [com.apple.ttsbundle.Nora-compact]
Language: pl-PL, Name: Zosia, Quality: Default [com.apple.ttsbundle.Zosia-compact]
Language: pt-BR, Name: Luciana, Quality: Default [com.apple.ttsbundle.Luciana-compact]
Language: pt-PT, Name: Joana, Quality: Default [com.apple.ttsbundle.Joana-compact]
Language: ro-RO, Name: Ioana, Quality: Default [com.apple.ttsbundle.Ioana-compact]
Language: ru-RU, Name: Milena, Quality: Default [com.apple.ttsbundle.Milena-compact]
Language: sk-SK, Name: Laura, Quality: Default [com.apple.ttsbundle.Laura-compact]
Language: sv-SE, Name: Alva, Quality: Default [com.apple.ttsbundle.Alva-compact]
Language: th-TH, Name: Kanya, Quality: Default [com.apple.ttsbundle.Kanya-compact]
Language: tr-TR, Name: Yelda, Quality: Default [com.apple.ttsbundle.Yelda-compact]
Language: zh-CN, Name: Ting-Ting, Quality: Default [com.apple.ttsbundle.Ting-Ting-compact]
Language: zh-HK, Name: Sin-Ji, Quality: Default [com.apple.ttsbundle.Sin-Ji-compact]
Language: zh-TW, Name: Mei-Jia, Quality: Default [com.apple.ttsbundle.Mei-Jia-compact]]

⚠️ You need to set these properties before enqueuing the utterance because setting it afterward has no effect.

This will get you going but it will take a lot more to make your application pass the Turing test.

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmail.com) or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

[This was first published on my blog]

Speech recognition using the Speech framework

Kamil Tustanowski — Fri, 21 Jul 2023 18:44:21 GMT

Interfaces. Interfaces everywhere. We use door handles when we want to go out outside, we use keys if we want to secure something, we use the steering wheel if we need to drive somewhere, we buy a ticket if we travel by bus or train.

Oh and I “almost forgot”, we use applications on our smartphones. The screen is an interface itself that hosts sub-interfaces for each application. Each application’s look and feel is different. Even if they share similarities thanks to conforming to design guidelines they are different and you need to learn how to use them.

Cognitive load refers to the amount of effort that is exerted or required while reasoning and thinking. Any mental process, from memory to perception to language, creates a cognitive load because it requires energy and effort. When cognitive load is high, thought processes are potentially interfered with. To the UX designer, a common goal when designing interfaces would be to keep users’ cognitive load to a minimum.

Check here if you want to know more

Did you hear about “Don’t make me think” or “The Best Interface is No Interface”? These are great books about design. The best about them is that you don’t even need to read them to start learning! Read the titles and think about their meaning for a second.

Mindblowing.

Now imagine an application presenting financial data and a user working on a task i.e. comparing reports and making assumptions based on the data. See the poor man touching the screen here and there, copying, binding, adding data to comparison, preparing intermediate results, and so on.

Imagine someone who is cooking a meal and has the recipe opened on an iPad. See this person working on steps of the recipe who needs to constantly wash and dry their hands to scroll the recipe on the screen.

Imagine yourself every time you are confused and annoyed by the application’s interface.

Wouldn’t it be great if you could tell the application what do you want it to do?

It’s not easy and the road is long and bumpy but this doesn’t mean we can’t start making the first steps.

Please, allow me to introduce the Speech framework. This will be the cornerstone of our no-interface approach:

import Speech

In this article, I will focus on the code needed to make speech recognition. I won’t clutter it with the application code. The full code will be linked at the end of the article for you to try on.

I want the code to be easy to understand and use therefore the output of this article will be a functioning SpeechAnalyzer class:

final class SpeechAnalyzer: ObservableObject {
}

It’s ObservableObject because I'm using SwiftUI in the demo application.

Since we are interested in “talking” to our applications we will analyze the live audio. We need AVAudioEngine to do that:

private let audioEngine = AVAudioEngine()

Our analyzer will have a simple, easy to use, API:

final class SpeechAnalyzer: ObservableObject {
    @Published var recognizedText: String?
    @Published var isProcessing: Bool = false
    
    func start() {}
    func stop() {}
}

First, we will tackle the start function. We are working on live audio therefore we want to configure a few things in the audio session:

private var inputNode: AVAudioInputNode?
private var audioSession: AVAudioSession?

func start() {
        audioSession = AVAudioSession.sharedInstance()
        do {
            try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            print("Couldn't configure the audio session properly")
        }
        
        inputNode = audioEngine.inputNode
}

I don’t want to get into too much detail on this code and will just say that .record will make sure the other audio is silenced and .measurement tells the session we want to:

minimize the amount of system-supplied signal processing to input and output signals

We will place taps on AVAudioInputNode in a few seconds.

Now we are finally getting to a place where it starts to be interesting. We make a few vars we will need to handle speech recognition:

private var speechRecognizer: SFSpeechRecognizer?
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?

We start by instantiating SFSpeechRecognizer:

self.speechRecognizer = SFSpeechRecognizer()

Which should use the current locale or we can specify a concrete locale:

self.speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "pl_PL"))

The initializers for SFSpeechRecognizer will fail and return nil if the locale is not supported for speech recognition:

public convenience init?() // Returns speech recognizer with user's current locale, or nil if is not supported

public init?(locale: Locale) // returns nil if the locale is not supported

SFSpeechRecognizer is a central object that facilitates the recognition but there are a few more needed. Next is:

recognitionRequest = SFSpeechAudioBufferRecognitionRequest()

Which allows us to customize the way we want to process the audio. We may choose whether we want to use the full power of Apple servers or process the speech on the device:

recognitionRequest?.requiresOnDeviceRecognition = true

Set this property to true to prevent an SFSpeechRecognitionRequest from sending audio over the network. However, on-device requests won’t be as accurate.

requiresOnDeviceRecognition documentation

Note: This will take effect if:

speechRecognizer.supportsOnDeviceRecognition

Returns true. In other words, this might be possible but is not guaranteed.

On-device speech recognition is available for some languages, but the framework also relies on Apple’s servers for speech recognition. Always assume that performing speech recognition requires a network connection.

Speech framework documentation

The other option is to allow the request to return partially recognized texts. This makes the process smoother because the results are coming right from the start and updating in real-time. If you are not interested you can wait for the final recognition:

recognitionRequest?.shouldReportPartialResults = false

If you want only final results (and you don’t care about intermediate results), set this property to false to prevent the system from doing extra work.

shouldReportPartialResults documentation

The next step is that we need to make sure we have everything we need and whether speech recognition is available:

guard let speechRecognizer = speechRecognizer,
      speechRecognizer.isAvailable,
      let recognitionRequest = recognitionRequest,
      let inputNode = inputNode
else {
    assertionFailure("Unable to start the speech recognition!")
    return
}

Note: always make sure whether the speech recognizer can recognize speech for the locale:

speechRecognizer.isAvailable

While testing this on various devices I noticed it wasn’t available on iPhone 12 Mini but was working perfectly fine for iPhone XR, iPhone 12 Pro max, or iPad Air.

The time has come to provide audio to our SFSpeechAudioBufferRecognitionRequest:

let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
    recognitionRequest.append(buffer)
}

This code will allow our application to tap into the live audio and pass the audio buffer to the request for speech recognition.

Now we need to create a concrete recognition task:

recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] result, error in
    self?.recognizedText = result?.bestTranscription.formattedString
    
    guard error != nil || result?.isFinal == true else { return }
    self?.stop()
}

Which is provided by the speechRecognizer and is using the recognitionRequest. It's the final piece of the puzzle that ties everything together. This is the place where we get our results. The result is of type SFSpeechRecognitionResult and we are interested in:

bestTranscription - Which returns the SFTranscription that is considered the most accurate. The transcription provides formattedString that returns a string we can use. You can see the other results in transcriptions if you like.
isFinal - This indicates whether transcription is final and finished.

We set the result string to our:

@Published var recognizedText: String?Which in turn provides this value to our application. We will provide the implementation for self?.stop() in a moment. For now, make a mental note that when there is an error or recognition is final it's a good time to stop the recognition process.

Everything is prepared now and wired together. But there is silence. It’s time to bring the sound:

audioEngine.prepare()

do {
    try audioEngine.start()
    isProcessing = true
} catch {
    print("Coudn't start audio engine!")
    stop()
}

First, we tell the AVAudioEngine to prepare, and later we start it and indicate that processing is in progress. If it couldn't start we call stop() to clear the resources.

We used the stop function a few times now. It's a good time to create it:

func stop() {
          recognitionTask?.cancel()
    
    audioEngine.stop()
    
    inputNode?.removeTap(onBus: 0)
    try? audioSession?.setActive(false)
    audioSession = nil
    inputNode = nil
    
    isProcessing = false
    
    recognitionRequest = nil
    recognitionTask = nil
    speechRecognizer = nil
}

The purpose of this function is to clear everything that is not needed anymore. It stops the currently running tasks, stops the audio engine, removes the tap on inputNode, informs that analyzer is not processing, and clears the memory.

The availability of speech recognition can change and we need to monitor this state and respond accordingly. Luckily there is a delegate for that: SFSpeechRecognizerDelegate. We need to change our SpeechAnalyzer declaration to implement this protocol:

final class SpeechAnalyzer: NSObject, ObservableObject, SFSpeechRecognizerDelegate

We need to additionally add NSObject because this delegate requires NSObjects to operate.

Hello Objective-C my old friend.

Now we let our SFSpeechRecognizer know we want to be its delegate:

speechRecognizer.delegate = self

Thanks to this when we add:

public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
    if available {
        print("✅ Available")
    } else {
        print("🔴 Unavailable")
        recognizedText = "Text recognition unavailable. Sorry!"
        stop()
    }
}

We will immediately start receiving availability information.

This was a lot to take in. I know. But it’s time to see it in action. I made a simple application with a record button and label for the speech recognition result. This application is using our SpeechAnalyzer.

It’s time to test. We run the application and the first tap on the button results in a crash:

This app has crashed because it attempted to access privacy-sensitive data without a usage description.  The app's Info.plist must contain an NSMicrophoneUsageDescription key with a string value explaining to the user how the app uses this data.

We need to provide NSMicrophoneUsageDescription key with a description of why we need the microphone access in the Info.plist.

Second run, and second tap on the button. This time the alert where the user can allow, or not, microphone access is presented. We tap allow and… the application crashes again:

Error Domain=kAFAssistantErrorDomain Code=1700 "User denied access to speech recognition" UserInfo={NSLocalizedDescription=User denied access to speech recognition}

The user has to deliberately allow the application to not only use the microphone but also do the speech recognition. We must add NSSpeechRecognitionUsageDescription with description to the Info.plist.

Note: If you can’t find the Info.plist file in project navigator tap on the top project and look for the Info tab.

Now when we run the application and tap the record button the alert for speech recognition is shown. Allow the recognition and start talking in English or any other language you created the speech recognizer for. The text will appear above the button.

Note: I did this the easy way to not complicate the example code but SFSpeechRecognizer offers methods to implement the authorization properly:

open class func authorizationStatus() -> SFSpeechRecognizerAuthorizationStatus

open class func requestAuthorization(_ handler: @escaping (SFSpeechRecognizerAuthorizationStatus) -> Void)

I encourage you to make use of these methods in your application.

This is the full code:

final class SpeechAnalyzer: NSObject, ObservableObject, SFSpeechRecognizerDelegate {
    private let audioEngine = AVAudioEngine()
    private var inputNode: AVAudioInputNode?
    private var speechRecognizer: SFSpeechRecognizer?
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var recognitionTask: SFSpeechRecognitionTask?
    private var audioSession: AVAudioSession?
    
    @Published var recognizedText: String?
    @Published var isProcessing: Bool = false

    func start() {
        audioSession = AVAudioSession.sharedInstance()
        do {
            try audioSession?.setCategory(.record, mode: .measurement, options: .duckOthers)
            try audioSession?.setActive(true, options: .notifyOthersOnDeactivation)
        } catch {
            print("Couldn't configure the audio session properly")
        }
        
        inputNode = audioEngine.inputNode
        
        speechRecognizer = SFSpeechRecognizer()
        print("Supports on device recognition: \(speechRecognizer?.supportsOnDeviceRecognition == true ? "✅" : "🔴")")

        // Force specified locale
        // self.speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "pl_PL"))
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        
        // Disable partial results
        // recognitionRequest?.shouldReportPartialResults = false
        
        // Enable on-device recognition
        // recognitionRequest?.requiresOnDeviceRecognition = true

        guard let speechRecognizer = speechRecognizer,
              speechRecognizer.isAvailable,
              let recognitionRequest = recognitionRequest,
              let inputNode = inputNode
        else {
            assertionFailure("Unable to start the speech recognition!")
            return
        }
        
        speechRecognizer.delegate = self
        
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            recognitionRequest.append(buffer)
        }

        recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { [weak self] result, error in
            self?.recognizedText = result?.bestTranscription.formattedString
            
            guard error != nil || result?.isFinal == true else { return }
            self?.stop()
        }

        audioEngine.prepare()
        
        do {
            try audioEngine.start()
            isProcessing = true
        } catch {
            print("Coudn't start audio engine!")
            stop()
        }
    }
    
    func stop() {
        recognitionTask?.cancel()
        
        audioEngine.stop()
        
        inputNode?.removeTap(onBus: 0)
        try? audioSession?.setActive(false)
        audioSession = nil
        inputNode = nil
        
        isProcessing = false
        
        recognitionRequest = nil
        recognitionTask = nil
        speechRecognizer = nil
    }
    
    public func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
        if available {
            print("✅ Available")
        } else {
            print("🔴 Unavailable")
            recognizedText = "Text recognition unavailable. Sorry!"
            stop()
        }
    }
}

Additionally, this is the code for the application:

struct SpeechRecognitionView: View {
    private enum Constans {
        static let recognizeButtonSide: CGFloat = 100
    }
    
    @ObservedObject private var speechAnalyzer = SpeechAnalyzer()
    var body: some View {
        VStack {
            Spacer()
            Text(speechAnalyzer.recognizedText ?? "Tap to begin")
                .padding()
            
            Button {
                toggleSpeechRecognition()
            } label: {
                Image(systemName: speechAnalyzer.isProcessing ? "waveform.circle.fill" : "waveform.circle")
                    .resizable()
                    .frame(width: Constans.recognizeButtonSide,
                           height: Constans.recognizeButtonSide,
                           alignment: .center)
                    .foregroundColor(speechAnalyzer.isProcessing ? .red : .gray)
                    .aspectRatio(contentMode: .fit)
            }
            .padding()
        }
    }
}

private extension SpeechRecognitionView {
    func toggleSpeechRecognition() {
        if speechAnalyzer.isProcessing {
            speechAnalyzer.stop()
        } else {
            speechAnalyzer.start()
        }
    }
}

This is all you need to start communicating verbally with your application.

The quality of this service is at least good enough. I was testing English in both on-device and regular and it was working fine. When I switched to polish, my native language, I was surprised by how accurate the recognition was.

⚠️ Important Note: The on-device recognition is less accurate but it’s not limited. The speech recognition over the network is limited:

The current rate limit for the number of SFSpeechRecognitionRequest calls a device can make is 1000 requests per hour. Please note this limit is on the number of requests that a device can make and is not tied to the application making it. This is regardless of the length of audio associated with the request. For a given SFSpeechRecognitionRequest, you are allowed up to one minute of audio per request.

The source

In short, each device can make up to 1k requests per hour. Each request can take up to 1 minute in total. This sounds reasonable but… you should be aware of these limitations.

You can download the demo application here.

Enjoy!

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmail.com) or tweet to @tustanowskik

If you want to be up to date and always be the first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

[This was first published on my blog]

Sound classification using the Sound Analysis framework

Kamil Tustanowski — Fri, 21 Jul 2023 18:26:57 GMT

Sound means to our applications as little as the images. It’s data without any meaning. Users can decode it, enjoy or hate it, understand it. Applications can analyze the sound patterns, reduce pauses, do other transformations and improvements but not understand what the sound file represents.

It’s time to change it.

I have written a few articles about the Vision framework in applications and I’m far from over but I love to experiment and look for new ways of user-device communication and interaction. This brings me to today’s topic which is not vision-related but it’s like vision but for sounds.

Please meet the Sound Analysis framework.

import SoundAnalysis

Today’s code samples will be provided in a playground. I will add a link to this playground at the end of the article.

Sound classification requests are more eager to throwing errors than the vision requests therefore we will start with do-catch:

do {
    /* Our classification code will go here */
} catch {
    print("Something went terribly wrong!")
}

And we create a SNClassifySoundRequest request:

let soundClassifyRequest = try SNClassifySoundRequest(classifierIdentifier: .version1)

We initiate SNClassifySoundRequest with the only currently available version - which is 1. But this request is much more flexible:

Alternatively, you identify a custom set of sounds by providing the sound request with a custom Core ML model

The Sound Analysis framework documentation

This means we can train our own machine learning models with custom sounds we want to classify. Let me know whether you would like to know more because this is material for the whole article.

Now let’s get back to our built-in, revision1, request and make sure it can classify the sounds we are interested in. This code will print all the supported identifiers:

soundClassifyRequest.knownClassifications
    .enumerated()
    .forEach { index, identifier in print("\(index). \(identifier)") }

0. speech
1. shout
2. yell
3. battle_cry
4. children_shouting
5. screaming
6. whispering
7. laughter
8. baby_laughter
9. giggling
10. snicker
11. belly_laugh
12. chuckle_chortle
13. crying_sobbing
14. baby_crying
15. sigh
16. singing
17. choir_singing
18. yodeling
19. rapping
20. humming
21. whistling
22. breathing
23. snoring
24. gasp
25. cough
26. sneeze
27. nose_blowing
28. person_running
29. person_shuffling
30. person_walking
[...]

The full list contains more than 300 identifiers. You can check it here.

Next, let’s add a sound we will analyze to the playground Resources. Please check Working with the Vision framework in the playgrounds for the instruction on how to add resources to the playground.

I used Approaching thunderstorm with light rain from zapsplat.

With the file in place, we need to construct the URL which we can then pass to the SNAudioFileAnalyzer:

let fileUrl = Bundle.main.url(forResource: "storm", withExtension: "mp3")

The URL is optional and will be nil if the file is not located. We need this file for processing and can't do anything without it. We are in the do-catch block therefore we should make use of it. First, we create an error:

enum FileError: Error {
    case notFound
}

And then we throw it if the file is not there:

guard let filePath = Bundle.main.path(forResource: "storm", ofType: "mp3") else { throw FileError.notFound }

Let’s move this above the line where we create a request. The is no need for a request if we don’t have a file.

The next step is to create the SNAudioFileAnalyzer. This is where we use the file URL:

let audioFileAnalyzer = try SNAudioFileAnalyzer(url: fileUrl)

Analyzer needs a file to analyze. To start the analysis we need to add a request we want the analyzer to use to process the file:

open func add(_ request: SNRequest, withObserver observer: SNResultsObserving) throws

The problem is that it requires an observer which must conform to SNResultsObserving protocol. This protocol consists of three methods:

public protocol SNResultsObserving : NSObjectProtocol {
    func request(_ request: SNRequest, didProduce result: SNResult)
    optional func request(_ request: SNRequest, didFailWithError error: Error)
    optional func requestDidComplete(_ request: SNRequest)
}

You can find more info in the documentation.

This is how the analyzer can communicate the status of the analysis. We need to make a small class conforming to this protocol and NSObject due to the NSObjectProtocol requirement. Which can trigger a wave of memories of Objective-C days in those who remember those times.

Let’s conform to the protocol. DidComplete and DidFail are self-explanatory:

final class AudioAnalysisObserver: NSObject, SNResultsObserving {
    func requestDidComplete(_ request: SNRequest) {
        print("Processing completed!")
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Failed with \(error)")
    }
}

But there is more to do in the function that provides the SNResults:

func request(_ request: SNRequest, didProduce result: SNResult)

The first problem we need to solve is casting the result to the concrete result type. SNResult is an empty protocol. The documentation is here.

The SNClassificationResult is what we need:

guard let result = result as? SNClassificationResult else  { return }

SNClassificationResult contains:

classifications - An array of SNClassification which contains information about the identifiers and the confidence (as usual within [0, 1.0] range).
timeRange - CMTimeRange informing about the time range of the analysis.

Classifications are sorted therefore if we get the first one we will have the best match:

guard let result = result as? SNClassificationResult,
      let bestClassification = result.classifications.first else  { return }

Next, we get time information from timeRange:

let timeStart = result.timeRange.start.seconds

The last piece is putting this together into a meaningful message and providing it to the user:

print("Found \(bestClassification.identifier) at \(Int((bestClassification.confidence) * 100))% at \(timeStart)s")

This is our observer:

final class AudioAnalysisObserver: NSObject, SNResultsObserving {
    func requestDidComplete(_ request: SNRequest) {
        print("Processing completed!")
    }
    
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let result = result as? SNClassificationResult,
              let bestClassification = result.classifications.first else  { return }
        let timeStart = result.timeRange.start.seconds
        
        print("Found \(bestClassification.identifier) at \(Int((bestClassification.confidence) * 100))% at \(timeStart)s")
    }
    
    func request(_ request: SNRequest, didFailWithError error: Error) {
        print("Failed with \(error)")
    }
}

Finally, we can add our request to the analyzer:

let resultsObserver = AudioAnalysisObserver()
try audioFileAnalyzer.add(soundClassifyRequest, withObserver: resultsObserver)

The last part is to start the analysis:

audioFileAnalyzer.analyze()

Please remember that analyze() is executing synchronously therefore you shouldn't call it from the main thread or it will block the application. More info in the documentation

I’m working in the playground so synchronous is fine. But in the application, it is a no-go.

Additionally:

You can run the same sound analysis request on multiple file analyzers, and each analyzer can process multiple requests.

Working with the API calls in the playgrounds

Kamil Tustanowski — Fri, 21 Jul 2023 18:02:16 GMT

Making the network requests is the true core functionality of many applications. Logging in, getting the data, making changes, and many, many more.

Implementing these calls would be a pain if not the contracts we have that determine how the network calls works, what they require and what they provide. It’s great that they always work and they don’t have any bugs. The only thing we need to do is to implement the handling on the application side and we are done.

In an ideal world that is.

In the real world, it’s located somewhere between this ideal case and:

Oh man, we have no documentation and nothing is working

Handling API calls tend to be time-consuming and annoying. In addition, the only way to test the code is to run the application. One of my personal workarounds was to write unit tests without mocks during the development. When I was running them they were contacting the API and I was able to see what is going on without running the application. In the end, I was either mocking them or discarding/disabling. It wasn’t ideal but much better than the regular approach.

Workarounds are not needed any more thanks to the playgrounds.

Let’s imagine a situation where we need to implement an API call that will return a list of items. We will use https://api.publicapis.org/entries as an example. To make it harder and realistic at the same time we have the URL and no documentation.

First, we need to create a playground. In Xcode click on File > New > Playground or use shift + option + command + N. Then select Blank playground from the list.

Now we need to add a small tweak to make it asynchronous-execution friendly:

import PlaygroundSupport
PlaygroundPage.current.needsIndefiniteExecution = true

By default, all top-level code is executed, and then execution is terminated. When working with asynchronous code, enable indefinite execution to allow execution to continue after the end of the playground’s top-level code is reached. This, in turn, gives threads and callbacks time to execute.

The Documentation

Then we need to make the call to see what is there. We could use the browser but where is fun in that?

let apiListRequestUrl = URL(string: "https://api.publicapis.org/entries")!
let apiListTask = URLSession.shared.dataTask(with: apiListRequestUrl) { data, response, error in
    guard let data = data else { return }
    let jsonString = String(data: data, encoding: .utf8)!
    print(jsonString)
}

The code above makes the request to a specified URL and provides data or error in return.

When we run the playground we see:

What is great is that all the features we used in Working with the Vision framework in the playgrounds are there.

This means we can tap on the eye button:

And we can do the same with the rectangle / embed button:

Regular prints are working as well:

{"count":1185,"entries":[{"API":"Axolotl","Description":"Collection of axolotl pictures and facts","Auth":"","HTTPS":true,"Cors":"yes","Link":"https://theaxolotlapi.netlify.app/","Category":"Animals"},{"API":"Cat Facts","Description":"Daily cat facts","Auth":"","HTTPS":true,"Cors":"no","Link":"https://alexwohlbruck.github.io/cat-facts/","Category":"Animals"},{"API":"Cataas","Description":"Cat as a service (cats pictures and gifs)","Auth":"","HTTPS":true,"Cors":"no","Link":"https://cataas.com/","Category":"Animals"},{"API":"catAPI","Description":"Random pictures of cats","Auth":"","HTTPS":true,"Cors":"yes","Link":"https://thatcopy.pw/catapi","Category":"Animals"},{"API":"Cats","Description":"Pictures of cats from Tumblr","Auth":"apiKey","HTTPS":true,"Cors":"no","Link":"https://docs.thecatapi.com/","Category":"Animals"},{"API":"Dog Facts","Description":"Random dog facts","Auth":"","HTTPS":true,"Cors":"yes","Link":"https://dukengn.github.io/Dog-facts-API/","Category":"Animals"},\
[...]

Based on the gathered info we can make a few assumptions. The results are in a container that has two main variable count and entries:

{"count":1185,
 "entries":[\* Entry items *\]

We can start implementing the simple structure:

struct ApiListEnvelope: Codable {
    let count: Int
}

And add a code for decoding instead of print:

let decoded = try! JSONDecoder().decode(ApiListEnvelope.self, from: data)

What happens is:

This confirms that the count is properly decoded but this is not the end. We need an Entry:

{"API":"Axolotl","Description":"Collection of axolotl pictures and facts","Auth":"","HTTPS":true,"Cors":"yes","Link":"https://theaxolotlapi.netlify.app/","Category":"Animals",

struct Entry: Codable {
    let api: String
    let description: String
    let link: String
    let category: String
}

To make the envelope complete:

struct ApiListEnvelope: Codable {
    let count: Int
    let entries: [Entry]
}

But when we run the playground we see there is a problem:

__lldb_expr_34/TheAPIPlayground.playground:30: Fatal error: 'try!' expression unexpectedly raised an error: Swift.DecodingError.keyNotFound(CodingKeys(stringValue: "api", intValue: nil), Swift.DecodingError.Context(codingPath: [CodingKeys(stringValue: "entries", intValue: nil), _JSONKey(stringValue: "Index 0", intValue: 0)], debugDescription: "No value associated with key CodingKeys(stringValue: \"api\", intValue: nil) (\"api\").", underlyingError: nil))

The names in JSON doesn't match which means we need to provide CodingKeys for entry:

enum CodingKeys: String, CodingKey {
    case api = "API"
    case description = "Description"
    case link = "Link"
    case category = "Category"
}

Now our response is decoded properly. Note: we decode only the information we intend to use:

This is it. We implemented the loading of the list of items we want to display in the application.

We investigated the structure of the call and implemented decoding step-by-step fixing issues as soon as they arise. Lightning-fast feedback included.

We didn’t need to run the application to achieve this. We didn’t have to tackle all the issues at once.

This is the full code:

import UIKit
import PlaygroundSupport

struct Entry: Codable {
    let api: String
    let description: String
    let link: String
    let category: String
    
    enum CodingKeys: String, CodingKey {
        case api = "API"
        case description = "Description"
        case link = "Link"
        case category = "Category"
    }
}

struct ApiListEnvelope: Codable {
    let count: Int
    let entries: [Entry]
}

let apiListRequestUrl = URL(string: "https://api.publicapis.org/entries")!
let apiListTask = URLSession.shared.dataTask(with: apiListRequestUrl) { data, response, error in
    guard let data = data else { return }
    let decoded = try! JSONDecoder().decode(ApiListEnvelope.self, from: data)
}

apiListTask.resume()

PlaygroundPage.current.needsIndefiniteExecution = true

⚠️ If you are using M1 and run Xcode using Rosetta playgrounds won’t work. Check a Krzysztof’s tweet to know more.

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmil.com) or tweet to @tustanowskik

If you want to be up to date and always be first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

[This was first published on my blog]

Working with the Vision framework in the playgrounds

Kamil Tustanowski — Fri, 21 Jul 2023 14:09:29 GMT

When playgrounds were first introduced I didn’t grasp the idea at first. Time has passed and I finally started to treat playgrounds like… playgrounds. Place where I could quickly test my ideas, make prototypes, and more.

What is great about playgrounds is the constant feedback they provide. In a few seconds, you know whether the changes you made are working properly or not. This is a game-changer for anyone used to compiling and running the application to check how the new code behaves.

This time I won’t introduce any new requests. Instead, I will show you how you can work with the Vision framework in playgrounds. Where you can verify whether the framework suits your needs or experiment with the requests.

First, we need to create a playground. In Xcode click on File > New > Playground or use shift + option + command + N. Then select Blank playground from the list.

With playground in place we need to add imports first:

import UIKit
import Vision

Then it’s time for the request. I will reuse the code from Image classification using the Vision framework from last week:

func process(_ image: UIImage) {
    guard let cgImage = image.cgImage else { return }
    let request = VNClassifyImageRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    
    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try requestHandler.perform([request])
        } catch {
            print("Can't make the request due to \(error)")
        }
        
        guard let results = request.results as? [VNClassificationObservation] else { return }
        
        results
            .filter { $0.confidence > 0.7 }
            .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }
    }
}

And an extension:

extension CGImagePropertyOrientation {
    init(_ uiOrientation: UIImage.Orientation) {
        switch uiOrientation {
        case .up: self = .up
        case .upMirrored: self = .upMirrored
        case .down: self = .down
        case .downMirrored: self = .downMirrored
        case .left: self = .left
        case .leftMirrored: self = .leftMirrored
        case .right: self = .right
        case .rightMirrored: self = .rightMirrored
        @unknown default:
            self = .up
        }
    }
}

Which is described in Detecting body pose using Vision framework. If you want to know more about the code please refer to the mentioned articles.

We have the processing function ready. Now we need images. The playground has a special place called Resources for providing images and any other assets:

We drag and drop the images:

Instantiate cupcake image in the playground:

let cupcake = UIImage(named: "cupcake.jpg")!

Add code for processing and start the playground by tapping the play icon:

process(cupcake)

The result is printed to the console:

food - 86%
baked_goods - 78%
dessert - 71%
cake - 71%
cupcake - 71%

And that’s it! In a couple of minutes, we created functional proof-of-concept code. No application is needed. We could focus on the main task alone.

We could end here but there is a few cool features playground has to offer and it would be a shame to not use them.

What we can improve is how we present inputs and outputs. If we look closely at the line where we created the image:

We will see two icons on the right. If we tap the eye we get the large preview:

The other icon displays the current value inside the playground:

Now let’s present the processing results better. First, we need to make a small change to our process function. We drop the dispatch queue code and make it work on the main thread and additionally return the result as an array of strings:

func process(_ image: UIImage) -> [String] {
    guard let cgImage = image.cgImage else { return [] }
    let request = VNClassifyImageRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }
    
    guard let results = request.results as? [VNClassificationObservation] else { return [] }
    
    return results
        .filter { $0.confidence > 0.7 }
        .map { "\($0.identifier) - \((Int($0.confidence * 100)))%" }
}

It’s the easiest way to get the results right where we want them:

Note: As you can see the rectangle button can show more than the images.

With this approach, we have a clear overview of our inputs and outputs to check and we don’t have to constrain ourselves:

We can do the same for the saliency requests described in Saliency detection using the Vision framework:

Animals requests from Animals detection using the Vision framework:

Or any other.

Playgrounds are a great tool to have.

This is the whole code needed:

import UIKit
import Vision

func process(_ image: UIImage) -> [String] {
    guard let cgImage = image.cgImage else { return [] }
    let request = VNClassifyImageRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }
    
    guard let results = request.results as? [VNClassificationObservation] else { return [] }
    
    return results
        .filter { $0.confidence > 0.7 }
        .map { "\($0.identifier) - \((Int($0.confidence * 100)))%" }
}

extension CGImagePropertyOrientation {
    init(_ uiOrientation: UIImage.Orientation) {
        switch uiOrientation {
        case .up: self = .up
        case .upMirrored: self = .upMirrored
        case .down: self = .down
        case .downMirrored: self = .downMirrored
        case .left: self = .left
        case .leftMirrored: self = .leftMirrored
        case .right: self = .right
        case .rightMirrored: self = .rightMirrored
        @unknown default:
            self = .up
        }
    }
}

let cupcake = UIImage(named: "cupcake.jpg")!
process(cupcake)
let plane = UIImage(named: "plane.jpg")!
process(plane)
let lake = UIImage(named: "lake.jpg")!
process(lake)

⚠️ Remember to add your assets to Resources and provide proper images names in UIImage(named: "cupcake.jpg")!.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here where you will find TheVision.playground with three pages for you to experiment with:

⚠️ Playgrounds are dependent on the project so you need to build the project first before using them.

⚠️ I was working with these playgrounds on Xcode 13.

⚠️ If you are using M1 and run Xcode using Rosetta playgrounds won’t work. Check a Krzysztof’s tweet to know more. Have fun! ⚠️ There are requests that require device to operate and won’t work properly in the playgrounds and simulators.

If you have any feedback, or just want to say hi, you are more than welcome to write me an [e-mail] (mailto:kamil.tustanowski@gmail.com) or tweet to @tustanowskik

If you want to be up to date and always be first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

P.S. Another way for getting quick feedback is using unit tests and snapshot tests. But this is a topic for another article. Let me know if you like to know more!

Photos I used in this article are made by: Ibrahim Boran, Lacie Slezak, Ursa Bavcar, and The Lucky Neko.

[This was first published on my blog]

Image classification using the Vision framework

Kamil Tustanowski — Thu, 20 Jul 2023 19:48:05 GMT

For the last couple of weeks, I was presenting awesome features of the Vision framework and simple ways of presenting the results. They were, I hope, visually appealing and interesting.

In today’s article, I won’t be presenting anything in the image. No overlays, rectangles, lines, texts. I know this might be disappointing but please don’t give up on reading.

The request I want to talk about today might seem modest compared to others previously presented. It returns an array of strings and nothing more.

But the strings in the list are describing the contents of an image.

It’s fast, it’s simple and it can change the way you handle images in your application, how you classify them, what you can allow users to do and how.

This is the same image I used in the saliency detection article. Last week we were able to detect where users will focus their attention. This week I will show you how to try to understand what is on the image.

Keep reading if you find getting from here:

To here interesting:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%

VNClassifyImageRequest is the request used to classify images. Unlike other requests, I presented this one doesn't provide any points or rectangles. It generates an array of identifiers with associated levels of confidence.

Let’s create the request:

let request = VNClassifyImageRequest()

And check what can be identified:

// iOS 15 and up
let supportedIdentifiers = try? request.supportedIdentifiers()

Note: This is for iOS 15 and above.

// Below iOS 15
let supportedIdentifiers = try? VNClassifyImageRequest.knownClassifications(forRevision: VNClassifyImageRequestRevision1)

You can find a complete list of 1303 (VNClassifyImageRequestRevision1) supported identifiers here.

The next step is running the request on a selected image:

let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                           orientation: .init(image.imageOrientation),
                                           options: [:])

DispatchQueue.global(qos: .userInitiated).async {
    do {
        try requestHandler.perform([request])
    } catch {
        print("Can't make the request due to \(error)")
    }
}

Please check my Detecting body pose using Vision framework article if you need more information on running the requests.

And getting the results:

guard let results = request.results as? [VNClassificationObservation] else { return }

Note: In iOS 15 and above we don’t need to map the results anymore. They come with the correct type and not arrays of Any.

In VNClassificationObservation we are interested in:

The identifier which holds information about the contents of the image.
The confidence which contains values from 0.0 to 1.0 describing whether Vision is certain of this observation or not.

Let’s see what the Vision has to say about our image using a print function:

results
    .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }

The result:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%
structure - 28%
rocks - 28%
hill - 16%
rainbow - 13%
mountain - 12%
river - 7%
cliff - 7%
grass - 7%
blue_sky - 4%
canyon - 4%
creek - 4%
sunset_sunrise - 2%
plant - 2%
moss - 2%
shrub - 1%
sun - 1%
foliage - 0%
painting - 0%
bridge - 0%
forest - 0%
/* A lot of other identifiers */
xylophone - 0%
yacht - 0%
yarn - 0%
yoga - 0%
yogurt - 0%
yolk - 0%
zebra - 0%
zoo - 0%
zucchini - 0%

The total count of the results is 1303. The same as the number of identifiers supported by the machine learning model used for this request.

It’s the time when confidence shines. We filter the results to accept identifiers with confidence larger than 70%:

results
    .filter { $0.confidence > 0.7 }
    .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }

Which gives us the list I presented in the beginning of the article:

outdoor - 99%
land - 98%
liquid - 98%
water - 98%
water_body - 98%
waterways - 98%
waterfall - 98%
sky - 82%
cloudy - 82%

This is the whole code needed to get these results:

func process(_ image: UIImage) {
    guard let cgImage = image.cgImage else { return }
    let request = VNClassifyImageRequest()
    
    let requestHandler = VNImageRequestHandler(cgImage: cgImage,
                                               orientation: .init(image.imageOrientation),
                                               options: [:])
    
    DispatchQueue.global(qos: .userInitiated).async {
        do {
            try requestHandler.perform([request])
        } catch {
            print("Can't make the request due to \(error)")
        }
        
        guard let results = request.results as? [VNClassificationObservation] else { return }
        
        results
            .filter { $0.confidence > 0.7 }
            .forEach { print("\($0.identifier) - \((Int($0.confidence * 100)))%") }
    }
}

Imagine what happens when you classify all images you have in the application and allow users to filter them by content? If someone wants to find the most beautiful sky — it’s done.

And this is just the beginning.

If you want to play with Vision and see it for yourself you can check the latest version of my vision demo application here. The example code is located in this file.

If you have any feedback, or just want to say hi, you are more than welcome to write me an e-mail or tweet to @tustanowskik

If you want to be up to date and always be first to know what I’m working on tap follow @tustanowskik on Twitter

Thank you for reading!

[This was first published on my blog]