Running AI Models on Edge Devices: A Practical Guide to On-Device Inference

Training an AI model is only half the story. Getting that model to run correctly, quickly, and efficiently on a microcontroller with 256 KB of RAM and 1 MB of flash is where most embedded engineers discover how different edge deployment is from cloud inference. On-device inference — running a neural network directly on the target hardware without cloud connectivity — is the core engineering challenge of AIoT, and it requires a fundamentally different set of skills than model training.

This guide walks through every stage of the on-device inference pipeline: choosing a framework, preparing your model, handling memory constraints, optimizing for speed and power, and validating results. Whether you are targeting an ESP32, a Nordic nRF5340, or an STM32 with an NPU, the principles apply across the board.

Why On-Device Inference and Not Cloud Inference?

Before diving into implementation, it is worth being precise about when on-device inference is the right choice. The answer is not always “yes.”

Choose on-device when:

Inference latency must be below 100 ms (e.g., motor fault detection, fall detection)
The device operates in environments with unreliable or no connectivity
Raw data is sensitive and must not leave the device (medical, security)
Battery life is critical and continuous data streaming is prohibitive
Per-query cloud costs become significant at scale (millions of devices × millions of queries per day)

Cloud inference is preferable when:

The model is large and accuracy is paramount (GPT-class models, high-res image segmentation)
Compute requirements change frequently and hardware cannot be updated
The dataset for retraining grows rapidly and requires large compute

For most sensor-driven IoT products, on-device is the right default for real-time classification and anomaly detection, with cloud reserved for retraining and fleet analytics.

Choosing the Right Inference Framework

Three frameworks dominate edge ML deployment in 2026:

TensorFlow Lite Micro (TFLM)

TensorFlow Lite Micro is the most mature embedded ML runtime. It runs on bare-metal systems with no OS dependency, requires no dynamic memory allocation, and supports a well-defined subset of TensorFlow Lite operators. TFLM is the right choice when you need broad hardware support and a large community.

Key constraints: operator support is limited to those explicitly ported to the micro runtime. Complex architectures with custom ops may require fallback to a less-optimized path.

Edge Impulse SDK

Edge Impulse generates a complete C++ inference library from your trained model, including optimized DSP feature extraction. The generated SDK is self-contained and integrates cleanly into any embedded project. Edge Impulse is ideal when you want an end-to-end workflow and do not need to hand-tune the runtime.

CMSIS-NN

ARM’s CMSIS-NN is a library of optimized neural-network kernels for Cortex-M processors. It is not a standalone inference runtime — you use it as a backend beneath TFLM or a custom runtime. CMSIS-NN uses SIMD instructions available on Cortex-M4, M7, and M33 to achieve 4–10× speedup over naive C implementations for convolution and fully-connected layers.

Model Preparation: The Compression Pipeline

A model trained in full float32 precision on a GPU is rarely deployable as-is on an MCU. The standard compression pipeline has three stages:

1. Quantization

Quantization converts float32 weights and activations to lower-precision formats — typically int8 or uint8. This reduces model size by 4× and dramatically improves inference speed on hardware without an FPU.

There are two approaches:

Post-training quantization (PTQ): Convert a trained float32 model using a calibration dataset. Fast and easy, but accuracy may drop 1–3%.
Quantization-aware training (QAT): Simulate quantization during training, so the model learns to be robust to precision reduction. Recovers most accuracy but requires retraining.

For most IoT applications, PTQ with a representative calibration dataset produces acceptable results. QAT is worth the effort when accuracy requirements are stringent (e.g., medical-grade devices).

2. Pruning

Pruning removes weights that contribute little to the output, creating a sparse network. Structured pruning — removing entire filters or neurons — is more hardware-friendly than unstructured pruning because it reduces computation rather than just storage.

3. Knowledge Distillation

A large, accurate “teacher” model is used to train a smaller “student” model optimized for the target hardware. The student learns from the teacher’s soft outputs rather than hard labels, retaining more accuracy than training from scratch with a small architecture.

Running AI on edge devices — inference pipeline

Memory Constraints and Practical Budgeting

MCU memory has two components that matter for ML:

Flash (program memory): Stores the model weights. A quantized MobileNetV1 for 10-class image classification fits in ~250 KB. A small 1D CNN for vibration analysis may be only 8–16 KB.
SRAM (working memory): Stores activations during inference. This is often the binding constraint. Activations can be 3–5× the size of the model weights for convolutional networks.

Rule of thumb: budget your SRAM such that the largest activation tensor, plus the input buffer, plus the stack, fits within available SRAM with at least 20% headroom for other tasks running concurrently.

Tools like Edge Impulse’s EON compiler analyze activation memory requirements at each layer and optimize the computation graph to minimize peak SRAM usage.

Target hardware memory profiles for common MCUs:

MCU	Flash	SRAM	ML suitability
STM32F4 series	512 KB–2 MB	128–256 KB	Small 1D CNNs, shallow NNs
ESP32-S3	8 MB (PSRAM)	512 KB	MobileNet, LSTM, larger models
nRF5340	1 MB	512 KB	Keyword spotting, anomaly detection
STM32H7	2 MB	1 MB	Audio classification, gesture recognition

Optimizing Inference Speed and Power

Latency optimization:

Use hardware-accelerated backends (CMSIS-NN for Cortex-M, ESP-NN for Xtensa LX7)
Profile each operator to find bottlenecks — convolutions dominate for image models; dense layers dominate for small NNs
Use depthwise-separable convolutions (MobileNet architecture) instead of standard convolutions — ~8–9× fewer operations
Run inference at the highest safe clock frequency, then return to low-power mode

Power optimization:

Duty-cycle the inference engine: wake on interrupt, run inference, sleep. A 10 ms inference window at 64 MHz followed by 990 ms of deep sleep yields an effective duty cycle of 1%.
Run feature extraction (FFT, filtering) at lower clock speeds — these operations are less compute-intensive than NN inference
Power-gate the sensor when inference is not needed

Embedded.com’s MCU power benchmarking guide provides detailed methodology for measuring inference energy per sample.

Validating On-Device Inference Accuracy

A model that scores 95% on the validation set during training may perform differently in production. Causes of accuracy degradation include:

Quantization error: INT8 arithmetic introduces rounding errors not present during float32 training
Distribution shift: The calibration dataset used for quantization may not represent the full range of real-world inputs
Sensor variation: Unit-to-unit variation in MEMS sensors means a model trained on data from one sensor may perform differently on another

Best practice is to run a golden dataset — a fixed set of labeled inputs — through both the float32 reference model and the quantized on-device model, and compute accuracy, latency, and output divergence. Any accuracy drop above your threshold should trigger QAT or recalibration.

Hackster.io has published reproducible benchmarking methodologies for TinyML models across common MCU targets.

A Minimal Deployment Example: TFLM on STM32

Here is a condensed C++ snippet showing the key steps for TFLM inference:

#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/all_ops_resolver.h"
#include "model.h"  // generated flatbuffer array

// Static memory arena — no heap allocation
constexpr int kArenaSize = 48 * 1024;
uint8_t tensor_arena[kArenaSize];

tflite::AllOpsResolver resolver;
const tflite::Model* model = tflite::GetModel(g_model_data);
tflite::MicroInterpreter interpreter(model, resolver, tensor_arena, kArenaSize);

interpreter.AllocateTensors();

// Copy input
float* input = interpreter.input(0)->data.f;
memcpy(input, sensor_buffer, input_size * sizeof(float));

// Run inference
interpreter.Invoke();

// Read output
float* output = interpreter.output(0)->data.f;
int predicted_class = std::distance(output, std::max_element(output, output + num_classes));

This pattern — static arena, pre-allocated tensors, synchronous Invoke() — is the foundation of virtually every TFLM deployment.

Conclusion

Running AI models on edge devices demands careful attention to the full stack: model architecture selection, compression pipeline, framework choice, memory budgeting, and hardware-specific optimization. The gap between a model that works in a Colab notebook and one that runs reliably on a Cortex-M4 is significant, but it is bridgeable with the right methodology.

At UABit, our embedded ML engineers handle the full deployment pipeline — from model conversion and optimization to hardware-in-the-loop validation. If you are building an AIoT product and need to get from trained model to production firmware, we can accelerate your path to deployment.