opncrafter

Gemini Nano: The OS-Level LLM

Dec 30, 2025 • 18 min read

Traditionally, shipping AI in an Android app meant bundling the model: a 2-8GB ONNX or TFLite file that inflates your APK, gets downloaded along with your app, and sits on the user's device consuming storage — even for users who never use the AI feature. Install rates drop dramatically above 100MB. Google's AICore service for Android 14+ solves this: the OS manages Gemini Nano as a shared system resource, much like how Android manages fonts or icon packs. Your app binds to the AICore service and runs inference against the shared model, adding zero bytes to your APK.

1. AICore Architecture and Device Support

  • Model size: Gemini Nano is approximately 3B parameters, quantized to 4-bit (~2GB on disk)
  • Managed by OS: Google Play Services downloads and updates the model — your app version doesn't change when the model does
  • APK size: Zero increase — your app just contains API calls, not the model
  • Privacy: All inference runs locally — no data sent to Google servers

Supported devices as of early 2025:

Google Pixel

Pixel 8, 8a, 8 Pro, 9, 9 Pro, 9 Pro XL, 9 Pro Fold

Samsung Galaxy

S24, S24+, S24 Ultra, S24 FE, S25 series (via Galaxy AI SDK wrapper)

Others

Any device with Snapdragon 8 Gen 3 or later, Android 14+, declared AICore support

2. Implementation with Google AI Edge SDK

// build.gradle.kts (app module)
dependencies {
    implementation("com.google.ai.edge.aicore:aicore:1.0.0-alpha05")
}

// AndroidManifest.xml - declare AICore requirement
<uses-feature
    android:name="android.software.aicore"
    android:required="false" />  // false = graceful fallback if unavailable

// ViewModel.kt
import com.google.ai.edge.aicore.*
import kotlinx.coroutines.flow.collectLatest

class AiViewModel(application: Application) : AndroidViewModel(application) {
    
    private val generativeModel: GenerativeModelClient = 
        GenerativeModelClient.create(application)
    
    // CRITICAL: Always check availability before calling generate()
    // The model might be: NOT_SUPPORTED, DOWNLOADING, or READY
    suspend fun checkAvailability(): DownloadStatus {
        return generativeModel.getDownloadStatus()
    }
    
    // Request model download (no-op if already downloaded / downloading)
    fun downloadModelIfNeeded() {
        viewModelScope.launch {
            when (generativeModel.getDownloadStatus()) {
                DownloadStatus.NOT_STARTED -> {
                    // Trigger download (happens in background via AICore)
                    generativeModel.requestDownload()
                    _uiState.value = UiState.Downloading
                }
                DownloadStatus.DOWNLOADING -> {
                    _uiState.value = UiState.Downloading
                }
                DownloadStatus.READY -> {
                    _uiState.value = UiState.Ready
                }
                DownloadStatus.NOT_SUPPORTED -> {
                    // Device doesn't support Gemini Nano — use fallback
                    _uiState.value = UiState.FallbackRequired
                }
            }
        }
    }
    
    // Generate with streaming (tokens appear as they are generated)
    fun generateStreaming(prompt: String) {
        viewModelScope.launch {
            _generatedText.value = ""
            
            generativeModel.generateContent(prompt)
                .text  // This is a Flow<String>
                .collectLatest { partialText ->
                    _generatedText.value += partialText  // Append each chunk
                    // Update UI in real-time as tokens arrive
                }
        }
    }
}

3. Production Pattern: Graceful Fallback

// Robust production implementation with fallback to cloud API
class AiRepository(
    private val nanoClient: GenerativeModelClient,
    private val cloudClient: GeminiCloudClient,  // Your server-side Gemini API client
) {
    suspend fun generateText(prompt: String): Flow<String> {
        return when (nanoClient.getDownloadStatus()) {
            DownloadStatus.READY -> {
                // Use local model: fast, private, $0 cost
                nanoClient.generateContent(prompt).text
            }
            DownloadStatus.DOWNLOADING -> {
                // Model still downloading — show progress, use cloud meanwhile
                cloudClient.generateStream(prompt)
            }
            DownloadStatus.NOT_SUPPORTED,
            DownloadStatus.NOT_STARTED -> {
                // Device doesn't support AICore — always use cloud
                cloudClient.generateStream(prompt)
            }
        }
    }
}

// UI Pattern: Show source of inference to user
@Composable
fun AiResponseCard(text: String, usingLocalModel: Boolean) {
    Card {
        Column(modifier = Modifier.padding(16.dp)) {
            Row(verticalAlignment = Alignment.CenterVertically) {
                Icon(
                    imageVector = if (usingLocalModel) Icons.Default.PhoneAndroid else Icons.Default.Cloud,
                    contentDescription = null,
                    modifier = Modifier.size(16.dp),
                )
                Spacer(modifier = Modifier.width(4.dp))
                Text(
                    text = if (usingLocalModel) "On-device AI" else "Cloud AI",
                    style = MaterialTheme.typography.labelSmall,
                    color = MaterialTheme.colorScheme.outline,
                )
            }
            Spacer(modifier = Modifier.height(8.dp))
            Text(text = text)
        }
    }
}

4. MediaPipe LLM Inference API: Support for All Devices

// MediaPipe LLM: works on ALL Android devices (no AICore required)
// But: you bundle a small model with your app (or download separately)
// Use case: Max QoS for features where local AI is critical

dependencies {
    implementation("com.google.mediapipe:tasks-genai:0.10.20")
}

// Download and cache a small model at app first-launch
// Gemma 2B (Q4) = ~1.5GB; Phi-2 (Q4) = ~1.2GB
val modelPath = downloadAndCacheModel(
    url = "https://your-cdn.com/gemma-2b-q4.bin",
    filename = "gemma_2b.bin",
)

// Initialize MediaPipe LLM
val llmOptions = LlmInference.LlmInferenceOptions.builder()
    .setModelPath(modelPath)
    .setMaxTokens(1024)
    .setNumDecode(256)  // Max new tokens to generate
    .setNumThread(4)
    .build()

val llmInference = LlmInference.createFromOptions(context, llmOptions)

// Inference (blocking - run on background thread)
val response = llmInference.generateResponse("Summarize: $articleText")

// Streaming:
llmInference.generateResponseAsync(
    prompt = "Summarize: $articleText",
    resultListener = { partialResult, done ->
        if (partialResult != null) updateUI(partialResult)
        if (done) llmInference.close()
    },
)

Frequently Asked Questions

What use cases work well with Gemini Nano's capability level?

Good fit: Smart Reply suggestions (2-3 short reply options), text summarization (article or notification summaries), grammar/tone correction ("Make this email more professional"), keyword extraction, basic translation for common language pairs, form auto-fill from voice input, meeting note summarization. Poor fit: Code generation (too complex for 3B parameters), creative writing, complex reasoning, nuanced Q&A, multilingual tasks in less common languages. Think of Nano as a powerful text transformation tool rather than a general-purpose AI assistant.

How do I handle the "model is downloading" state gracefully?

The best UX pattern: trigger model download at app first-launch in the background (not on first AI feature use), poll download status periodically with WorkManager, show a subtle UI indicator when the model is ready ("AI features available"), and fall back to cloud API transparently during the download period. Avoid blocking the user on a download progress screen — the file is large and downloads can take 5-15 minutes on slower connections. Users shouldn't feel the transition between cloud and on-device inference.

Conclusion

Android AICore represents a new model for mobile AI deployment: OS-managed shared models that add zero APK size, update automatically through system services, and provide privacy by keeping inference fully local. For supported Pixel and Galaxy devices, Gemini Nano via GenerativeModelClient is the lowest-friction path to on-device AI. For universal device support, MediaPipe LLM Inference API provides similar capability at the cost of bundling or side-loading the model. Always implement graceful fallback to cloud APIs for unsupported devices — designing with the fallback path in mind from the start prevents brittle AI features.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK