opncrafter

TinyML: The 2KB Revolution

Dec 30, 2025 • 22 min read

Everyone talks about 100GB H100 clusters. But the real volume of AI deployment is in 256KB microcontrollers (MCUs): $0.50 chips running vibration sensors in industrial machines, wake word detectors in smart earbuds, gesture classifiers in IoT devices that run for 5 years on a coin cell battery. TinyML is the engineering discipline of compressing neural networks to fit inside these constraints — where you measure VRAM not in gigabytes but in kilobytes, and where floating-point arithmetic doesn't exist because the chip has no FPU.

1. The TinyML Constraint Stack

ResourceCloud GPU (H100)Raspberry Pi 4ESP32Arduino Nano
Flash/Storage80GB VRAM32GB SD4MB256KB
RAM80GB HBM4GB DDR4520KB2KB
CPU Speed60 TFLOPS1.8GHz ARM240MHz16MHz
FPU (float32)✅ Yes✅ Yes✅ Yes (limited)❌ None
Power Draw700W3-5W0.5W0.015W
Model Size (max)50B+ params100M params~1M params~5K params

2. Model Training and Int8 Quantization

import tensorflow as tf
import numpy as np

# Step 1: Train a small model for your specific task
# Example: vibration anomaly detection using FFT features

def build_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(16, activation='relu', input_shape=(32,)),  # 32 FFT features
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dense(2, activation='softmax'),   # Normal / Anomaly
    ])
    # Total parameters: (32*16 + 16) + (16*8 + 8) + (8*2 + 2) = 659 params
    # At float32: 659 × 4 bytes = 2.6KB of weights
    # At int8: 659 × 1 byte = 659 bytes — fits on Arduino!

model = build_model()
model.fit(X_train, y_train, epochs=50, batch_size=32)

# Step 2: Representative dataset for calibration (required for int8 quantization)
# The converter needs to see typical input values to set quantization scales
def representative_data_gen():
    for sample in X_train[:200]:   # 200 samples is usually sufficient
        yield [sample.astype(np.float32).reshape(1, -1)]

# Step 3: Convert to TensorFlow Lite with full int8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# CRITICAL for microcontrollers:
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # Enable quantization
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8   # MCU input must be int8
converter.inference_output_type = tf.int8  # MCU output must be int8

# Why int8?
# 1. Most MCUs have NO hardware float support — software float is 100x slower
# 2. int8 is 4x smaller than float32 (precious RAM!)
# 3. Arduino's Cortex-M4 has limited DSP extensions that accelerate int8 operations

tflite_model = converter.convert()

# Step 4: Convert to C byte array for inclusion in Arduino firmware
with open("model.tflite", "wb") as f:
    f.write(tflite_model)

# Convert to C array:
# xxd -i model.tflite > model_data.h
# This generates: const unsigned char model_data[] = {0x1c, 0x00, ...};
# Copy model_data.h to your Arduino sketch directory

print(f"Model size: {len(tflite_model)} bytes = {len(tflite_model)/1024:.1f}KB")

3. Arduino C++ Inference Code

#include <TensorFlowLite.h>
#include <tensorflow/lite/micro/micro_interpreter.h>
#include <tensorflow/lite/micro/micro_mutable_op_resolver.h>
#include <tensorflow/lite/schema/schema_generated.h>
#include "model_data.h"  // Generated by: xxd -i model.tflite > model_data.h

// === Memory Allocation (CRITICAL) ===
// Tensor Arena is a static memory pool for ALL tensor operations.
// Too small → AllocateTensors() fails silently → model crashes.
// Too large → wastes precious RAM.
// Start with 4KB and halve until you find minimum that works.
constexpr int kTensorArenaSize = 4 * 1024;  // 4KB
uint8_t tensor_arena[kTensorArenaSize];

// Global interpreter pointer
tflite::MicroInterpreter* interpreter = nullptr;

void setup() {
    Serial.begin(115200);
    
    // 1. Load and validate model from C byte array
    const tflite::Model* model = tflite::GetModel(model_data);
    
    // Check model schema version matches TFLM version
    if (model->version() != TFLITE_SCHEMA_VERSION) {
        Serial.println("MODEL VERSION MISMATCH!");
        while(1);  // Halt
    }
    
    // 2. Register ONLY the operations your model uses
    // MicroMutableOpResolver<N>: N must exactly match number of ops you add
    // Not registering an op used by model → crash at AllocateTensors()
    static tflite::MicroMutableOpResolver<3> resolver;
    resolver.AddFullyConnected();  // Dense layers
    resolver.AddSoftmax();         // Output softmax
    resolver.AddRelu();            // Hidden layer activations
    
    // Why not register all ops? AllOpsResolver includes every op (>200KB!)
    // MicroMutableOpResolver includes only what you declare (<1-2KB)
    
    // 3. Create interpreter using our static tensor arena
    static tflite::MicroInterpreter static_interpreter(
        model, resolver, tensor_arena, kTensorArenaSize
    );
    interpreter = &static_interpreter;
    
    // 4. Allocate tensor memory within arena
    TfLiteStatus status = interpreter->AllocateTensors();
    if (status != kTfLiteOk) {
        Serial.println("AllocateTensors() FAILED — increase kTensorArenaSize!");
        while(1);
    }
    
    Serial.print("Arena used: ");
    Serial.println(interpreter->arena_used_bytes());  // Monitor actual usage
}

void loop() {
    // 5. Read sensor and prepare input
    // Input tensor expects int8 (because we used inference_input_type=tf.int8)
    int8_t* input_buffer = interpreter->input(0)->data.int8;
    
    // Read 32-point FFT from your sensor
    float raw_features[32];
    read_accelerometer_fft(raw_features);  // Your sensor reading function
    
    // Quantize: float → int8 using the scale/zero_point stored in the model
    // These values come from the calibration step during conversion
    float input_scale = interpreter->input(0)->params.scale;
    int32_t input_zero_point = interpreter->input(0)->params.zero_point;
    
    for (int i = 0; i < 32; i++) {
        float quantized = raw_features[i] / input_scale + input_zero_point;
        input_buffer[i] = static_cast<int8_t>(
            max(-128.0f, min(127.0f, quantized))  // Clamp to int8 range
        );
    }
    
    // 6. Run inference
    TfLiteStatus invoke_status = interpreter->Invoke();
    if (invoke_status != kTfLiteOk) {
        Serial.println("Invoke() FAILED");
        return;
    }
    
    // 7. Read output
    int8_t* output = interpreter->output(0)->data.int8;
    float output_scale = interpreter->output(0)->params.scale;
    int32_t output_zero_point = interpreter->output(0)->params.zero_point;
    
    // Dequantize output class probabilities
    float prob_normal = (output[0] - output_zero_point) * output_scale;
    float prob_anomaly = (output[1] - output_zero_point) * output_scale;
    
    if (prob_anomaly > 0.8f) {
        Serial.println("ANOMALY DETECTED!");
        digitalWrite(LED_BUILTIN, HIGH);
    }
    
    delay(100);  // 10 Hz inference rate
}

Frequently Asked Questions

How do I know how large to set the tensor arena?

Start with a generous size (16KB) and call interpreter->arena_used_bytes() after AllocateTensors() to see actual usage. Set kTensorArenaSize to ~120% of the reported usage to leave a safety margin (some ops require temporary scratch space beyond their declared tensor sizes). For truly RAM-constrained MCUs (Arduino Nano's 2KB!), the "binary search" approach works: try 2048, then 1024, then 1536... until you find the minimum size where AllocateTensors() succeeds.

What's the difference between Edge Impulse and raw TFLM for production?

Edge Impulse (edgeimpulse.com) provides a complete cloud-based pipeline: record sensor data, design signal processing (FFT, MFCC, spectrogram), train the model, optimize for your target MCU, and generate ready-to-deploy Arduino/ESP32 libraries with all the boilerplate handled. Raw TensorFlow Lite for Microcontrollers requires you to handle signal processing, quantization, and C++ integration yourself. For prototyping and learning: Edge Impulse is dramatically faster. For production systems where you need maximum control over every byte: raw TFLM. Most production TinyML systems use Edge Impulse for development and validation, then extract the optimized model for direct TFLM deployment.

Conclusion

TinyML brings neural network inference to the billions of microcontrollers already in production — no cloud connectivity required, no data privacy exposure, no battery drain from WiFi radio. The key constraints are absolute: int8 quantization is non-negotiable (no FPU), the tensor arena must be statically allocated (no heap), and only required operations should be registered (code space is precious). With these constraints respected, models from 500 bytes to 100KB can run vibration analysis, voice activity detection, gesture recognition, and anomaly detection on chips that cost less than a cup of coffee and run for years on a coin cell battery.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK