What Is MediaPipe LLM Inference?
If you’re building an Android app that needs to run large language models (LLMs) on-device, MediaPipe LLM Inference API is one of the most accessible ways to get there. It handles model loading, quantisation, and hardware acceleration so you can focus on the experience — no cloud dependency, no server bills, no data leaving the device.
You might wonder how this compares to Gemini Nano on-device Android AI. Great question. Gemini Nano runs through Android’s built-in AICore framework and is optimised at the OS level, but you’re locked into one model. MediaPipe gives you model flexibility — swap in Gemma or another supported architecture without touching the OS. Both are valuable, and many teams use them together. This guide focuses on MediaPipe.
One important heads-up before we dive in: Google has started recommending migration to LiteRT-LM, a newer on-device LLM inference framework. MediaPipe LLM Inference API still works and is actively maintained, but if you’re starting a brand new project in 2026 you may want to keep an eye on LiteRT-LM as the ecosystem matures. We’ll revisit this at the end of the post.
Supported Models and Their Use Cases
MediaPipe supports a growing list of on-device models. The ones you’ll reach for most often in 2026 are:
- Gemma 3 1B — Google’s smallest Gemma 3 model in 4-bit quantised format. Runs well on mid-range devices, low memory footprint, great for text generation, Q&A, and summarisation. This is the recommended starting point for most apps.
- Gemma-3n E2B / E4B — The Gemma-3n family (effective 2B and 4B parameters) brings multimodal support: these models accept text, images, and audio as input. If your use case involves understanding user photos or voice, Gemma-3n is what you want.
- Phi-2 — Microsoft’s compact model, useful for reasoning-heavy tasks. A solid option if you need strong logic in a small package.
Model files are distributed in .task format from the Google Hugging Face organisation. You download the model separately from your app and either bundle it in assets (for small models) or download it at first launch and cache it in internal storage.
Setting Up Your Android Project
The correct dependency is tasks-genai — not the older mediapipe-tasks-llm package name you’ll see floating around in outdated tutorials. Add this to your build.gradle.kts:
dependencies {
implementation("com.google.mediapipe:tasks-genai:0.10.27")
implementation("androidx.lifecycle:lifecycle-runtime-ktx:2.8.7")
}
No special storage permission is needed if you save model files to your app’s internal storage directory (context.filesDir). If you download models at runtime you’ll need the INTERNET permission:
One common mistake: adding READ_EXTERNAL_STORAGE. This permission is deprecated on Android 13+ and is not needed here. Store models in internal app storage and you need no storage permission at all.
Loading and Initialising the Model
The main class is LlmInference — watch out for old examples using the incorrect name LlmInferencer. Model-level configuration (path, max tokens, preferred backend) goes into LlmInference.LlmInferenceOptions:
import com.google.mediapipe.tasks.genai.llminference.LlmInference
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("${context.filesDir}/gemma-3-1b-it-int4.task")
.setMaxTokens(1000)
.build()
val llmInference = LlmInference.createFromOptions(context, options)
Model loading is expensive — do it once, in your Application class or a singleton managed by DI. Pass the instance into your ViewModel rather than recreating it on configuration changes.
Creating a Session and Running Inference
This is where the current API differs most from older examples you might find online. Inference is now session-based. You create a LlmInferenceSession from your loaded model, and per-call parameters like temperature and top-K belong in LlmInferenceSessionOptions — not in the model options:
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceSession
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceSessionOptions
val sessionOptions = LlmInferenceSessionOptions.builder()
.setTemperature(0.8f)
.setTopK(40)
.setTopP(0.95f)
.build()
val session = llmInference.createSession(sessionOptions)
Then add your prompt as a query chunk and call generateResponseAsync with a progress listener:
session.addQueryChunk("Explain quantum computing in simple terms.")
session.generateResponseAsync { partialResult, done ->
appendToOutput(partialResult)
if (done) {
println("Generation complete")
session.close()
}
}
The callback fires for each token as it’s generated — users see text appearing in real time. Once done is true, close the session to release resources. You can call addQueryChunk multiple times before generating, which is how you build multi-turn conversations by interleaving user and model turns as separate chunks.
Multimodal Prompting with Gemma-3n
Load a Gemma-3n model and you unlock image and audio inputs alongside text. This opens up use cases like summarising what’s in a photo, describing a chart, or transcribing a voice note — all entirely on-device. Enable the vision modality in your session options:
val sessionOptions = LlmInferenceSessionOptions.builder()
.setTemperature(0.7f)
.setTopK(40)
.setEnableVisionModality(true)
.build()
val session = llmInference.createSession(sessionOptions)
With vision modality enabled you can pass an image to the session using addImage before calling addQueryChunk. The model then responds in text. Keep in mind this only works with Gemma-3n models — calling setEnableVisionModality(true) with a text-only model like Gemma 3 1B will throw at runtime.
ViewModel Integration Pattern
Here’s a production-ready pattern that wires everything together cleanly. Notice how LlmInference is injected (created once at app scope) while sessions are created per prompt — that’s the right mental model:
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceSessionOptions
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.flow.MutableStateFlow
import kotlinx.coroutines.flow.StateFlow
import kotlinx.coroutines.flow.asStateFlow
import kotlinx.coroutines.launch
class LlmChatViewModel(private val llmInference: LlmInference) : ViewModel() {
private val _responseFlow = MutableStateFlow("")
val responseFlow: StateFlow = _responseFlow.asStateFlow()
fun generateResponse(userPrompt: String) {
_responseFlow.value = ""
viewModelScope.launch(Dispatchers.Default) {
val sessionOptions = LlmInferenceSessionOptions.builder()
.setTemperature(0.8f)
.setTopK(40)
.build()
val session = llmInference.createSession(sessionOptions)
try {
session.addQueryChunk(userPrompt)
session.generateResponseAsync { token, _ ->
_responseFlow.value += token
}
} catch (e: Exception) {
_responseFlow.value = "Error: ${e.message}"
} finally {
session.close()
}
}
}
}
Collect responseFlow in your Composable with collectAsStateWithLifecycle() for a fully reactive, lifecycle-safe streaming UI. This pairs nicely with the Kotlin Result type and runCatching if you’d rather model errors as values than catch exceptions.
Performance Considerations
On-device LLM performance varies significantly across devices. A few things worth knowing as you tune your integration:
- Model size vs. device RAM — Gemma 3 1B in INT4 needs roughly 1–2 GB of free memory, making it viable on most devices sold in the last three years. Gemma-3n E4B needs more headroom; always test on your minimum target device.
- Temperature and Top-K live in session options — This is the most common migration mistake. If you set them in LlmInferenceOptions by habit from old examples, they’ll be silently ignored.
- Preferred backend — Call setPreferredBackend() in LlmInferenceOptions to hint GPU acceleration on supported devices. MediaPipe falls back to CPU automatically when GPU isn’t available.
- Session lifecycle — Always close sessions when done. Unclosed sessions leak GPU/CPU resources and will cause slowdowns or OOM errors in long-running sessions.
Keep inference off the main thread — the ViewModel pattern above uses Dispatchers.Default for this. For broader AI feature patterns in your app, the structured AI development workflow is a useful reference.
MediaPipe vs. AICore: When to Use Each
Should you use MediaPipe LLM Inference or Android’s built-in AICore (which powers Gemini Nano)?
Use MediaPipe LLM Inference if: you need model flexibility, want to swap models without app updates, or need capabilities like multimodal input that Gemini Nano doesn’t offer. You get more control over inference settings and the session lifecycle.
Use AICore/Gemini Nano if: you want Google’s OS-managed, system-optimised model with the smallest possible app overhead. Tightest integration with Android’s ML stack, but zero model flexibility.
In practice, many teams use both — MediaPipe for specialised tasks with custom models, and Gemini Nano as a fast fallback. The layered approach gives you the best of both worlds.
Looking Ahead: LiteRT-LM
Google has released LiteRT-LM as the long-term successor to MediaPipe’s LLM Inference API. It introduces a new .litertlm file format (an evolution of .task files with richer metadata and better compression), a cleaner API surface, and improved performance across Android devices.
For existing apps: there’s no urgent need to migrate. Google has committed to supporting .task format files through the transition, and MediaPipe continues to receive updates. For brand new projects starting today, it’s worth evaluating LiteRT-LM before locking in — the migration path is expected to be straightforward once LiteRT-LM reaches its first stable release.
Wrapping Up
MediaPipe LLM Inference gives you a solid, production-capable path to on-device LLMs in Android apps today. The key things to get right: use the correct tasks-genai dependency, understand the separation between LlmInference (model, created once) and LlmInferenceSession (inference call, created per request), put temperature and top-K in your LlmInferenceSessionOptions, and always close sessions when you’re done.
Start with Gemma 3 1B on a range of test devices, measure latency and memory, and only step up to larger or multimodal models when the use case genuinely needs it. The on-device AI story on Android is maturing fast — and you’re well-positioned to build something great with what’s available right now.
This post was written by a human with the help of Claude, an AI assistant by Anthropic.
