Gemma 3n models made possible the full AI stack entirely on mobile!
Written by Georgios Soloupis, AI and Android GDE.
Running complex AI workflows, like converting speech to text (STT), invoking functions, generating visual insights with vision-language models (VLMs), and synthesizing speech (TTS) entirely on mobile devices is no longer a futuristic ideal; it’s becoming a practical reality. This shift signals more than just a technical achievement, it represents a new era of AI accessibility, privacy, and responsiveness. As user expectations evolve and data privacy concerns deepen, moving AI processing on-device avoids cloud dependency, reduces latency, and keeps sensitive data where it originates. It’s a powerful step toward democratizing AI, making it faster, safer, and more personal. For developers and decision-makers alike, this isn’t just about optimization, it’s about reshaping the way we interact with intelligent systems in everyday life.
In this blog post, we explore how to run the full AI stack entirely on a mobile device, covering Speech-to-Text (STT), function calling, vision-language model (VLM) inference, and Text-to-Speech (TTS) in a single android application. With the release of Gemma 3n, which now supports image input, and the latest update to the MediaPipe library enabling local function calling, fully on-device AI workflows are no longer aspirational, they’re here. We’ll walk through each component step by step, showing how all these tasks can be executed in just a few seconds on mobile hardware, delivering performance that’s surprisingly competitive with traditional cloud-based pipelines.
Speech-to-Text (STT)
Interacting with mobile devices today is more intuitive than ever, often starting with simple voice commands. One of the most effective open-source models for handling speech input is Whisper by OpenAI. Around for over two years, Whisper has proven to be both reliable and versatile, capable of handling tasks ranging from transcription to translation. It comes in various model sizes to suit different performance needs, and for this project, we opted for the lightweight Tiny English version, perfectly suited for on-device execution without compromising accuracy for everyday use.
Within the mobile application, you’ll find a complete implementation of Whisper, from loading the .bin
vocabulary file to initializing the TensorFlow Lite Interpreter for on-device inference. The model is powered by a Mel spectrogram input, which is computed using C++ for faster and more efficient execution. This setup ensures both speed and accuracy, making real-time transcription feasible directly on mobile hardware:
JNIEXPORT jfloatArray JNICALL
Java_com_example_jetsonapp_whisperengine_WhisperEngine_transcribeFileWithMel(JNIEnv *env,
jobject thiz,
jlong nativePtr,
jstring waveFile,
jfloatArray filtersJava) {
talkandexecute *engine = reinterpret_cast<talkandexecute *>(nativePtr);
const char *cWaveFile = env->GetStringUTFChars(waveFile, NULL);
// Step 1: Get the native array from jfloatArray
jfloat *nativeFiltersArray = env->GetFloatArrayElements(filtersJava, NULL);
jsize filtersSize = env->GetArrayLength(filtersJava);
// Step 2: Convert the native array to std::vector<float>
std::vector<float> filtersVector(nativeFiltersArray, nativeFiltersArray + filtersSize);
// Release the native array
env->ReleaseFloatArrayElements(filtersJava, nativeFiltersArray, JNI_ABORT);
// Call the engine method to transcribe the file and get the result as a vector of floats
std::vector<float> result = engine->transcribeFileWithMel(cWaveFile, filtersVector);
env->ReleaseStringUTFChars(waveFile, cWaveFile);
// Convert the result vector to a jfloatArray
jfloatArray resultArray = env->NewFloatArray(result.size());
env->SetFloatArrayRegion(resultArray, 0, result.size(), result.data());
return resultArray;
}
You can follow the initialization process and see how the Whisper model operates directly within the Android application here.
Function calling (FC)
MediaPipe recently released a guide for implementing function calling directly on mobile devices! Alongside the documentation, they’ve also provided a GitHub example that anyone can build and deploy on an Android phone. This quickstart demo leverages the LLM Inference API with Hammer 2.1 (1.5B), offering a streamlined way to get started. Here’s a look at the basic initialization and usage of the task:
private fun createGenerativeModel(): GenerativeModel {
val getCameraImage = FunctionDeclaration.newBuilder()
.setName("getCameraImage")
.setDescription("Function to open the camera")
.build()
val openPhoneGallery = FunctionDeclaration.newBuilder()
.setName("openPhoneGallery")
.setDescription("Function to open the gallery")
.build()
val tool = Tool.newBuilder()
.addFunctionDeclarations(getCameraImage)
.addFunctionDeclarations(openPhoneGallery)
.build()
val formatter =
HammerFormatter(ModelFormatterOptions.builder().setAddPromptTemplate(true).build())
val llmInferenceOptions = LlmInferenceOptions.builder()
// hammer2.1_1.5b_q8_ekv4096.task
// gemma-3n-E2B-it-int4.task
.setModelPath("/data/local/tmp/Hammer2.1-1.5b_seq128_q8_ekv1280.task")
.setMaxTokens(512)
.apply { setPreferredBackend(Backend.GPU) }
.build()
val llmInference =
LlmInference.createFromOptions(context, llmInferenceOptions)
val llmInferenceBackend =
LlmInferenceBackend(llmInference, formatter)
val systemInstruction = Content.newBuilder()
.setRole("system")
.addParts(
Part.newBuilder()
.setText("You are a helpful assistant that will open the camera or the phone gallery.")
)
.build()
val model = GenerativeModel(
llmInferenceBackend,
systemInstruction,
listOf(tool).toMutableList()
)
return model
}
....
val chat = generativeModel?.startChat()
val response = chat?.sendMessage(userPrompt.value)
Log.v("function", "Model response: $response")
if (response != null && response.candidatesCount > 0 && response.getCandidates(0).content.partsList.size > 0) {
val message = response.getCandidates(0).content.getParts(0)
// If the message contains a function call, execute the function.
if (message.hasFunctionCall()) {
val functionCall = message.functionCall
// Call the appropriate function.
when (functionCall.name) {
"getCameraImage" -> {
Log.v("function", "getCameraImage")
_cameraFunctionTriggered.value = true
updateJetsonIsWorking(false)
}
"openPhoneGallery" -> {
Log.v("function", "openPhoneGallery")
_phoneGalleryTriggered.value = true
updateJetsonIsWorking(false)
}
else -> {
Log.e("function", "no function to call")
withContext(Dispatchers.Main) {
Toast.makeText(
context,
"No function to call, say something like \"open the camera\"",
Toast.LENGTH_LONG
).show()
}
updateJetsonIsWorking(false)
}
}
....
Once the task successfully identifies a function, it triggers the corresponding action, for example, opening the camera. More at the JetsonViewModel.kt file of the application. The user can then interact with the camera API to capture an image, which is seamlessly passed along in the workflow to the Vision-Language Model (VLM) for further processing and insight generation.
Vision-Language Model (VLM)
The long-awaited Gemma 3n models have finally been released, available in two variants, 2B and 4B, both featuring vision support. While several companies and developers have previously introduced their own vision-language models, this article explores why Gemma stands out as a compelling choice for production use, focusing on performance, efficiency and safety.
MediaPipe was the first library to support execution of the Gemma 3n models from day one. With built-in options to run models on either CPU or GPU, and its signature high-level APIs, integrating Gemma 3n into the Android project was smooth and straightforward, delivering reliable performance without any unexpected hurdles.
private fun createSession(context: Context): LlmInferenceSession {
// Configure inference options and create the inference instance
val options = LlmInferenceOptions.builder()
.setModelPath("/data/local/tmp/gemma-3n-E2B-it-int4.task")
.setMaxTokens(1024)
.setPreferredBackend(Backend.GPU)
.setMaxNumImages(1)
.build()
val llmInference = LlmInference.createFromOptions(context, options)
// Configure session options and create the session
val sessionOptions = LlmInferenceSession.LlmInferenceSessionOptions.builder()
.setTopK(40) // Default
.setTopP(0.9f)
.setTemperature(1.0f)
.setGraphOptions(GraphOptions.builder().setEnableVisionModality(true).build())
.build()
return LlmInferenceSession.createFromOptions(llmInference, sessionOptions)
}
....
val mpImage = BitmapImageBuilder(bitmap).build()
session?.addQueryChunk(userPrompt.value + " in 20 words") // Limit if you do not want a vast output.
session?.addImage(mpImage)
var stringBuilder = ""
session?.generateResponseAsync { chunk, done ->
updateJetsonIsWorking(false)
stringBuilder += chunk
// Log.v("image_partial", "$stringBuilder $done")
updateVlmResult(transcribedText.trim() + "\n\n" + stringBuilder)
....
You can follow along with the code in this file.
Text-To-Speech (TTS)
While the output of the Vision-Language Model (VLM) is text, this project takes it a step further by adding offline Text-to-Speech (TTS) capabilitie, enabling full voice responses without requiring an internet connection. A key intermediary step involves detecting the language of the VLM’s output. This ensures the system can dynamically load the appropriate pronunciation, especially when the output is in a language other than English. With Gemma 3 and 3n offering built-in support for over 140 languages, this feature lays the groundwork for seamless multilingual voice interaction in future versions. The language detection is carried out with ML Kit’s language identification API.
private val languageIdentifier = LanguageIdentification.getClient()
private fun speakOut(text: String) {
val defaultLocale = Locale("en")
languageIdentifier.identifyLanguage(text)
.addOnSuccessListener { languageCode ->
val locale = if (languageCode == "und") defaultLocale else Locale(languageCode)
textToSpeech.setLanguage(locale)
// Log.v("available_languages", textToSpeech.availableLanguages.toString())
textToSpeech.speak(text, TextToSpeech.QUEUE_ADD, null, "speech_utterance_id")
}
.addOnFailureListener {
textToSpeech.setLanguage(defaultLocale)
textToSpeech.speak(text, TextToSpeech.QUEUE_FLUSH, null, "speech_utterance_id")
}
}
You can learn more about the TextToSpeech android API here.
Check out a short demo video recorded on a fully capable Samsung S24 with 12GB of RAM, allowing both the function-calling and VLM inference models to run smoothly on the GPU:
Summary of the models used in this project:
- Whisper Tiny English model for Speech to Text (Download vocab.bin, whisper.tflite and put them in the assets folder)
- Hammer 2.1 1.5B for function calling (Download)
- Gemma 3n 2B for VLM (Download)
- ML KIT language identification API
You can build directly the project from this branch.
Conclusion
This project demonstrates just how far on-device AI has come, bringing together Speech-to-Text, Function Calling, Vision-Language Modeling, and Text-to-Speech in a single mobile application. With open-source models like Whisper, Hammer, and Gemma 3n, and frameworks like MediaPipe and ML Kit, it’s now possible to deliver intelligent, multimodal experiences entirely offline. Beyond showcasing technical feasibility, this end-to-end mobile AI stack emphasizes real-world impact: reducing latency, enhancing privacy, and enabling richer, voice-driven interactions, all without relying on the cloud. As AI becomes more personal and embedded in our daily devices, building locally-executed solutions is no longer an edge case.
#AISprint Google Cloud credits are provided for this project.