FoodScrap: Speech-Enabled Food Journaling

How speech input can help food journaling.

Yuhan Luo
Sparks of Innovation: Stories from the HCIL
7 min readMay 18, 2021


Photo by Dan Gold on Unsplash.

Yuhan Luo, Young-Ho Kim, Bongshin Lee, Naeemul Hassan, and Eun Kyoung Choe

Food journaling can be achieved by many different approaches, such as searching the name of food within existing databases, taking a photo of the meal, using barcode scanning, or wearing automated sensors that automatically detect eating episodes. While most of these approaches focus on capturing calorie and nutrient information, recent research has highlighted the importance of capturing broader eating contexts — such as social environment, mood, and feelings — to foster reflection on one’s food decisions.

Food decision-making is an individualized, multilevel, and context-dependent process involving time, location, food preparation, and social activities. As such, it is not realistic to capture “unified” factors that influence everyone’s food decisions through automated approaches. Manual approaches (e.g., text input), on the other hand, allow flexible description of food practice, but they can impose high data capture burden.

How About using Speech Input?

In this study, we are interested in exploring how speech input can support capturing people’s everyday food practice. This idea is partly inspired by the growing popularity of speech input in our daily life, especially with the introduction of embedded voice assistants in mobile phones, such as Google Assistant and Siri. More importantly, we see the potential of speech input to facilitate food journaling in two aspects:

  1. People speak faster than typing, by capturing the same amount of information faster, speech input can lower the data capture burden.
  2. People tend to be expressive when they talk, so that speech input can help collect rich details that might otherwise be overlooked in traditional manual typing.
Figure 1. Statistics on the usage of mobile voice assistants in recent years.


To examine how speech input supports capturing everyday food practice, we created a food journaling app called FoodScrap using OmniTrack Research, a research platform that enables creation and deployment of a mobile self-tracking app.

FoodScrap captures food components, preparation methods, and food decisions in free-form audio recordings. In particular, four guided prompts (Q5 ~ Q8) ask why people decide when to eat, what to eat, when they make the decision, and how much to eat, which are key questions in examining the multifaceted aspects in food decision-making.

Figure 2. The screenshot of the data capture interface of FoodScrap.

One-week Data Collection Study

We conducted a one-week remote data collection study to examine the data richness and data capture burden with FoodScrap. Eleven participants from diverse food cultures (e.g., African, American, Asian, European) participated in our study, with specific eating goals such as losing weight and developing mindful eating habits. At the end of the study, each participant filled out a questionnaire on user burden scale, and joined in a debriefing interview to share their study experience.

Data Richness

Throughout the study, we found that participants provided rich details regarding food components and preparation methods, including dish names, ingredient types and items, spice and sauce, and preparation types and procedures, etc.

A typical transcribed entry looks like this:

My breakfast today is a homemade piece of focaccia bread that I baked on Saturday. It has green olives betta cheese, and is made with classic bread ingredients like flour, yeast, saltwater, and olive oil. Yeah, and then there is a small cup of ranch dressing probably a couple of tablespoons of ice water with a splash of grapefruit juice in the bottle. Ever since the COVID-19 lockdown, I’ve been trying to bake more foods. And it’s been rather enjoyable.” (P8, Day1, Breakfast)

Figure 3. Breakdown of details in food components and preparation methods that participants provided in Q4.

In addition, participants elaborated on their responses to food decision questions through describing the eating moments, explaining their eating strategies, and self-assessment. For example, instead of simply saying that “I chose the food because it’s healthy,” participants explained the reasons why they thought the food was healthy:

“They are portion controlled, only 90 calories, and sweet. And they’ve got nutrition like fiber, so it’s a good snack.” (P10, Day 5, Snack)

Occasionally, participants judged their food healthiness or compared their current meal with other meals they had earlier:

“I’ve been eating a lot of junk [food] yesterday so I thought I had to keep it a little [more] fresh for sustainability and health. So I bought these veggies instead of eating those foods in my pantry for a long time.” (P7, Day 3, Lunch)

Figure 4. They ways that participants elaborated their food decisions in responding to Q5 ~ Q8.

Data Capture Burden

The results of the burden scale showed that participants’ perceived data capture burden is relatively low (less than 1 on a scale of 0: “No burden at all” or “Never (felt it burdensome)” to 4: “Extremely burdensome” or “All of the time (it was burdensome)”). During the interviews, they also acknowledged that speech input was easy and fast. However, participants raised concerns regarding speech-enabled food journaling, including:

  1. Re-recording effort: When participants lost their train of thought, they would need to re-record the entire audio.
  2. Mental load in organizing the responses: Some of the questions in FoodScrap were difficult to answer, which took an extra mental load for participants to articulate.
  3. Social environmental constraints: Participants sometimes felt embarrassing talking to their phone in public settings, and needed to stay in quiet spaces to make sure their speech input was stored clearly.
  4. Privacy concerns: Participants believed that “voice is more identifiable than text,” and did not want their food practice to be judged.

Reflection With Guided Prompts and Speech Input

Although we deployed FoodScrap mainly as a data collection tool, we were surprised to find that participants used FoodScrap as a tool for self-reflection. This was partly because the guided prompts (Q5 ~ Q8) made participants think more about their food decisions that they had rarely thought about before. Participants also highlighted that responding to the prompts with speech input facilitated such reflection:

“I’m extremely outgoing and I’m very verbal. Even though I was talking into an electronic [phone], I feel like interacting with people, so it made me want to talk more. I feel more accountable, you know, to explain my food [decisions], to really think about it, like why am I eating this now.” (P10)

In addition, participants had mixed reactions to listening to their audio recordings: some never listened to their recordings because they did not like their own voices, while others replayed their recordings to reflect on their past eating episodes.

Lessons Learned and Future Work

This study showed that in a food journaling context, speech input is promising in lowering the data capture burden while promoting situated reflection. However, we need to consider how to process and present the speech input data so that they can be useful for self-trackers, healthcare providers, and researchers. Here, we describe three opportunities for future research.

Effectively process the speech data for healthcare use. In dietary assessment, nutritionists and dietitians often employ long questionnaires to collect details in patients’ food intake. With the advances in natural language processing (NLP), we can extract food-related information (e.g., food group, portion size, ingredients) from the transcribed text and support sorting and filtering the information based on providers’ needs.

Enabling reflection-on-action through feedback. The focus of our study was on the data capture aspect, so we did not provide any feedback beyond the capability of replaying the audio. To fully support a reflective food journaling experience, we can present common factors influencing one’s food decisions in text summarization or word cloud, while enabling more efficient audio searching.

Supporting data capture in varying contexts leveraging multimodal input. To address the constraints that come with speech input, we can leverage multimodal input combining speech, text, and photo across multiple devices (e.g., smartphones, smart speakers, wearable devices, wireless earphones) so that people can choose when to use which input modality. For example, in a privacy-sensitive situation, people may choose text input on a smartphone; at home where the smartphone is not close by, people can use speech input on a smart speaker or wearable devices with hands-free interaction.


This research was in part funded by National Science Foundation (#1753452) and the Research Improvement Grant from the College of Information Studies at the University of Maryland, College Park.

Full citation:

  • Yuhan Luo, Young-Ho Kim, Bongshin Lee, Naeemul Hassan, and Eun Kyoung Choe. 2021. FoodScrap: Promoting Rich Data Capture and Reflective Food Journaling Through Speech Input. In Proceedings of the 2021 Conference on Designing Interactive System (DIS ’21). ACM.