From Idea to Impact: Flo’s Search Evolution and Learning

Kseniya Yurevich
Flo Health UK
Published in
8 min readFeb 13, 2024

How we experimented with the Search in Flo. By Kseniya Yurevich, a Product Manager, and Aleksandr Popitich, a Software Engineer.

A good search is vital for a product — it allows users to quickly find the answers to their questions and fully discover a product. But with all the benefits come challenges: How can to make sure the search feature allows users to find the answers to their questions smoothly? How do we make the search results relevant and intuitive? How do we make sure the search results actually solve users’ problems?

At Flo, we’ve faced it all. Flo’s search never had a massive active audience compared to other product features. We’ve struggled a lot with identifying the uniqueness of search as a Flo feature. Why would people search for “irregular periods” in the app if they can search the web for it? Even though we saw the potential, search improvement initiatives constantly lost out to other priorities.

But then, LLM entered the market, and all eyes turned to artificial intelligence (AI). We saw then how we could leverage this industry trend to revive search as a feature.

Revamping search in Flo

Imagine you want to understand why you feel a certain way, what’s going on, and if it’s normal. You open your health app, tap search, and type in your question. You see an endless list of articles, all of them more or less relevant. But finding an answer to your question — “Is it normal?” — still requires you to open those articles and spend time reading them. In today’s world, where we want results in the blink of an eye, only an incredibly engaged user would do that.

There’s also a funnel problem. As a product manager, you need to ensure that users find the answers to their questions in as few steps as possible. By offering our users more articles to read, we’re giving them more to sort through and potentially getting them further away from their answer, lengthening the funnel.

Flo offers thousands of articles created by health experts worldwide. We’ve recognized our unique advantage: by providing answers directly from our content in fewer steps, users can access well-researched, expert-backed information for their health questions. This also prepares our users for informed discussions with their healthcare providers.

After a few iterations with the design team, we ended up with a simple but elegant solution: The user types in their query, and the app shows an answer at the top of the search result screen. Using only our content, the app picks the most relevant piece and puts it in the search results screen. The best part is that all answers will be medically correct, as they originate from our content.

We also wanted to tackle another problem: Sometimes users’ queries aren’t specific enough, so they don’t get an answer to their actual question. For example, based on our data, we know that typical users would usually type in something like “ovulation” — a short keyword. Based on such a query, our search will provide a definition of ovulation. However, the user might have a more specific question in their mind; for example, “How do I know when I’m ovulating?”, “What happens when I ovulate?”, or “Can I get pregnant if I’m ovulating?” To enable users to easily find the answers to all of their questions, we decided to introduce hints of potential questions they can ask under the answer.

When discussing the road map and solution, we also kicked off our collaboration with the medical, legal, and privacy departments. As a female health app, we needed to ensure that our users get only 100% credible and safe information in their search results. It was crucial to guarantee that these answers would not pose any medical risks by providing inaccurate or dangerous information.

Once everyone was on board, we started to act. At Flo, we strive to make data-driven decisions. When developing any feature, we usually start with something small and simple. Then, through a series of iterations, we consistently work to enhance the solution, ensuring it becomes more valuable and user friendly. In the following paragraphs, we have outlined our journey toward launching the experiment.

AI health search delivery

In the women’s well-being field, having accurate and trustworthy information is crucial. Keeping this in mind, we tested different prototypes using Large Language Models (LLM) and basic retrieval augmented generation (RAG) and concluded that the straightforward LLM-based solutions didn’t meet our quality standards at this stage. While creating a prototype is quick, ensuring its accuracy and safety for production is challenging and time-consuming. So, as the very first step, we chose a simpler and less risky solution that can still provide value to users and assist us in validating the product hypothesis.

One vital requirement for reducing medical risks was to rely solely on Flo’s content when providing answers to user queries. We intentionally chose not to use LLM and, instead, decided to leverage featured snippets extracted from existing Flo articles.

A featured snippet is a coherent, independent, and meaningful piece of text sufficient to be considered as an answer to a limited set of questions. Snippets could be extracted from all existing Flo articles and annotated by calculating embeddings. An embedding, in this context, refers to a dense vector (list) of floating-point numbers. The distance between two vectors serves as a measure of their relatedness. When a user performs a search, the top snippet with the highest semantic similarity (or the lowest semantic distance) is selected and displayed on a position zero.

We tested multiple approaches to snippet extraction. These were the two most successful:

  • Semantic sentence analysis and grouping: This involves splitting the entire text into sentences, calculating embeddings for each sentence, constructing a similarity matrix, and identifying split points by detecting local minimums (places in the text where the similarity between neighboring sentences changes significantly).
The X-axis corresponds to ordinal sentence numbers, while the Y-axis represents a weighted sum of similarities. Each vertical line on the graph signifies a splitting point.
  • Splitting text using markdown paragraphs: The underlying assumption was that our content editors already separate meaningful sections of text into paragraphs when creating articles.

A straightforward split by paragraphs yielded better results. The semantic grouping approach was not consistent; in some cases, it led to snippets with missing or redundant (irrelevant) sentences at the beginning or end.

It’s important to understand that Flo articles might not have answers for every user question. To manage this, we use a similarity threshold. If the closest match to a user’s question isn’t close enough, we don’t give an answer (featured snippet) and just show regular search results (a list of relevant articles).

However, later on, we realized that a static threshold doesn’t work well for very short or very long queries. To tackle this issue, we introduced a threshold as a function of query length with the following assumptions:

  • If users enter a short query (consisting of one or two keywords), their primary motivation may not be a specific health problem but a general interest in a particular topic (e.g., sex, pregnancy, polycystic ovary syndrome, etc.). Therefore, it’s OK to relax the threshold and display more generic snippets.
  • Conversely, when users input longer questions, they are more than likely attempting to describe a specific problem and expect a highly relevant answer. In such cases, we need to strengthen the threshold to reduce instances of showing irrelevant or partially relevant snippets.
This function determines the relevance threshold. The X-axis denotes the number of words (tokens) in the incoming question (query), and the Y-axis represents semantic distance, where a lower value indicates higher relevance.

Clearly, the responses provided by this solution cannot be compared to LLM-based responses. In some instances, the responses may appear more like statements of fact rather than direct answers to the questions. For instance, when a user asks, “Can green discharge be a sign of infection?” the response might be, “Green discharge is usually a sign of infection,” instead of providing a more direct answer like, “Yes, it’s likely a sign of infection. Please consult a doctor for further guidance.”

Furthermore, since we use only the single most relevant snippet, there may be instances where the answer doesn’t comprehensively address the question. Although we could consider utilizing the top number of results, it’s a challenging task to seamlessly assemble an answer using multiple independent pieces of text without resorting to text generation.

Moreover, pure featured snippets, lacking generative AI, fall short in delivering a high level of personalization. There are instances where we may need to address the same question differently based on the user’s specificities. For instance, the response to the query, “Why do I have pain?” might vary for an 18-year-old compared to a 65-year-old.

Despite these limitations, featured snippets provided a solid starting point for testing our hypothesis on real users. Moreover, it was less risky in regards to potential users’ perception of the response as personalized advice.

Next steps

With the experiment launched, we clearly see the next steps and improvements we can make from both product and technical perspectives.

  • In the case of search, it is not enough to simply deliver better search features; product managers should also make sure users can easily find where to ask their questions and can easily interact with the feature itself. This is why it’s crucial to deliver not only the feature alone, but also make sure that you have thought about the funnel before it. As a next step, we want to improve our entry points for search, making them more visible, engaging, and intuitive for users. We believe that this will allow our users to find answers to their questions more quickly and for us as a business to increase the number of active users of this feature as well as engagement metrics.
  • We are also considering launching a promo campaign in the app and in onboarding to inform our users about the value of our new search feature and unfold its value for them.
  • As for technical improvements, a clear next step is to cater to text generation to provide direct and comprehensive answers. To ensure the answers are correct, one idea is to add an online assessor that looks at the results from the language model. This assessor would check things like how accurate the answers are and reject ones that don’t make sense or seem made up.
  • Manual medical assessments can be time-consuming and labor-intensive. Therefore, in addition to manual testing, it is crucial to establish an automated quality feedback loop. A potential way to proceed is by utilizing use human-verified LLM-based evaluations. This could involve using tools like Ragas, Evals, TruLens, or others. This way, we can promptly estimate the extent of improvement or regression when making changes, whether it involves experimenting with the LLM prompt, adjusting the context size, or tweaking a relevancy threshold. The sooner and more comprehensive the feedback you receive, the faster you can iterate and achieve higher-quality results.

--

--