Computer Vision on the Edge: Running Image Classification on ESP32 and Cortex-M Devices

Computer vision on microcontrollers was considered impractical just four years ago. Running a convolutional neural network on a device with 512 KB of RAM seemed incompatible with the memory requirements of any useful image model. The landscape has changed dramatically. ESP32-S3 with its vector extensions, STM32H7 with its 1 MB SRAM and hardware JPEG accelerator, and purpose-built vision chips like the Himax HM01B0 have made on-device image classification not just feasible but production-ready.

This article covers everything you need to build and deploy a computer vision model on a resource-constrained microcontroller: camera hardware selection, model architecture choices, the quantization pipeline, memory management strategies, and deployment via TensorFlow Lite Micro and Edge Impulse.

The Computer Vision MCU Landscape

Not all microcontrollers are suitable for image inference. The minimum requirements for viable on-device vision are roughly:

SRAM: 256 KB minimum, 512 KB recommended. A 96×96 grayscale input image alone occupies 9 KB; activation memory for even a small CNN can require 64–200 KB.
Flash: 1 MB minimum for a quantized MobileNetV1 at 0.25 alpha plus TFLM runtime
Clock speed: 80 MHz minimum, 240 MHz preferred for acceptable latency (< 500 ms per frame)
Camera interface: Dedicated parallel camera interface (DVP) or SPI for camera modules; raw GPIO bit-banging is too slow for real-time image capture

ESP32-S3

Espressif’s ESP32-S3 is the most popular platform for edge vision in 2026. Its Xtensa LX7 dual-core processor runs at 240 MHz and includes vector extensions (ESP-NN) that accelerate convolutional operations by 5–8× versus scalar code. With 512 KB internal SRAM and support for up to 8 MB of external PSRAM, the ESP32-S3 can run MobileNetV2 at 0.35 width multiplier on 96×96 images in under 100 ms.

The ESP32-S3 supports OV2640 and OV5640 camera modules via its LCD/camera interface peripheral, making complete camera-to-inference pipelines straightforward.

STM32H7 Series

STMicroelectronics’ STM32H7 combines a Cortex-M7 core at 480 MHz with 1 MB of SRAM, dual-bank flash, a hardware JPEG codec, DCMI (camera interface), DMA2D (graphics accelerator), and an L1 cache. For computer vision workloads requiring the lowest possible latency or the largest model, the H7 outperforms the ESP32-S3.

The DCMI peripheral accepts camera sensor output (OV7670, HM01B0) directly into SRAM buffers via DMA, enabling zero-CPU-overhead frame capture.

STM32N6 with NPU

For highest performance, ST’s STM32N6 includes a dedicated Ethos-U55 NPU capable of 256 MAC/cycle, enabling MobileNetV2 at full resolution in under 30 ms. This chip targets industrial inspection and smart camera applications where accuracy and throughput justify the higher price point.

Choosing the Right Vision Model Architecture

The key trade-off in edge vision is model accuracy versus compute and memory requirements. The standard progression from least to most capable:

MobileNetV1 (0.25 or 0.10 alpha)

Size: 68–242 KB (quantized INT8)
SRAM needed: 50–150 KB for activations
Inference time: 50–200 ms on ESP32-S3 @240 MHz
Top-1 accuracy on ImageNet: 50–63% (at small alpha)
Best for: Binary classification, small N-class problems (N ≤ 10) where training data covers the domain well

MobileNetV2 (0.35 alpha)

Size: 384 KB quantized
SRAM needed: 200–300 KB peak activations
Inference time: 90–250 ms on ESP32-S3
Best for: More challenging classification tasks; requires ESP32-S3 or STM32H7 for acceptable latency

EfficientNet-Lite0

Size: 1.3 MB quantized
SRAM needed: 400+ KB (requires PSRAM on ESP32-S3)
Inference time: 200–400 ms
Best for: Cases where accuracy is more critical than latency; industrial inspection with 1–5 fps requirement

Custom Person/Object Detector (FOMO)

Edge Impulse’s FOMO (Faster Objects, More Objects) is an architecture specifically designed for MCU object detection. It detects object presence and approximate location without full bounding-box regression, running in under 50 ms on ESP32-S3 with a model smaller than 200 KB.

Computer vision on edge devices — model and hardware selection

Camera Hardware and Image Acquisition

Camera Sensor Selection

For edge vision applications, the most commonly used camera sensors are:

Sensor	Resolution	Interface	Notes
OV2640	2MP (1600×1200)	SCCB + parallel DVP	Most popular ESP32-S3 camera; AI-Thinker ESP32-CAM ships with this
OV7670	0.3MP (640×480)	I2C + parallel	Low cost, works on STM32 DCMI
HM01B0	0.3MP (320×240)	SPI or DVP	Ultra-low power (1 mW at 1 fps); used in wearables
OV5640	5MP (2592×1944)	MIPI CSI-2	For high-resolution inspection; requires application processor

Resolution and Color Space

For MCU-class computer vision, work in the lowest resolution that maintains acceptable accuracy. Most TinyML vision models use:

96×96 grayscale: minimum viable for binary classification
96×96 RGB888: for color-sensitive tasks; 27 KB per frame
160×120 grayscale: balance of resolution and memory
224×224 RGB: standard ImageNet input; only feasible on H7 or with PSRAM

Always resize on the MCU (not at the camera sensor) at the native resolution closest to your model input, then downsample in firmware. Resizing from a higher-resolution sensor provides better optical quality than a sensor native at the target resolution.

Memory Management for On-Device Vision

Memory management is the dominant engineering challenge in MCU vision. The three memory consumers are:

Frame buffer: Raw camera frame. For 160×120 RGB565: 38.4 KB. For 320×240 RGB888: 230 KB. Double-buffer for continuous inference: 2×.
Model tensor arena: Activation memory needed during inference. This is the binding constraint for most vision models.
TFLM/runtime overhead: ~20–30 KB for the TFLM interpreter and scratch buffer.

Strategy for ESP32-S3 (8 MB PSRAM): Place the frame buffer and, if necessary, the tensor arena in PSRAM. Internal SRAM should hold the RTOS stacks, DMA buffers, and latency-sensitive data structures. PSRAM access is slower than internal SRAM (~10 ns vs. ~2 ns), which slightly increases inference time.

Strategy for STM32H7 (1 MB SRAM, no PSRAM): Use DTCM (Data Tightly Coupled Memory, 128 KB) for activation memory (fastest access), SRAM1 (512 KB) for the frame buffer, and AXI SRAM for model weights.

// STM32H7: Place tensor arena in DTCM for fastest inference
__attribute__((section(".dtcm_data")))
static uint8_t tensor_arena[ARENA_SIZE];

// Place frame buffer in AXI SRAM for DMA compatibility
__attribute__((section(".axisram")))
static uint8_t frame_buffer[FRAME_SIZE];

Complete Inference Pipeline: ESP32-S3 + OV2640 + Edge Impulse

Here is the end-to-end flow for image classification on ESP32-S3:

Camera initialization: Configure OV2640 via SCCB, set output format to JPEG (hardware JPEG decode saves time) or RGB565
Frame capture: Trigger frame capture via esp_camera_fb_get()
Preprocessing: Convert to model input format (resize to 96×96, normalize to float or INT8)
Inference: Run run_classifier() from Edge Impulse SDK or interpreter.Invoke() in TFLM
Post-processing: Apply softmax threshold, output classification result
Camera release: Return frame buffer with esp_camera_fb_return(fb)

#include "esp_camera.h"
#include <your_model_inferencing.h>

camera_fb_t *fb = esp_camera_fb_get();
if (!fb) return;

// Convert JPEG to RGB
uint8_t *rgb888 = (uint8_t*)malloc(96 * 96 * 3);
jpg2rgb888(fb->buf, fb->len, rgb888, JPG_SCALE_4X); // scale down during decode

// Run inference
signal_t signal;
int err = numpy::signal_from_buffer((float*)rgb888, 96*96*3, &signal);
ei_impulse_result_t result;
run_classifier(&signal, &result, false);

esp_camera_fb_return(fb);
free(rgb888);

Production Considerations

Power management: The OV2640 draws 50–100 mW when active. For battery-powered applications, power-gate the camera between inference cycles. At 1 fps with a 100 mW camera and 240 MHz inference, average power is manageable.

Inference triggering: For always-on applications, use a motion detector (PIR sensor or simple pixel-difference algorithm) to wake the CNN, avoiding full-resolution inference on empty frames.

Model updates: Edge vision models need periodic retraining as environmental conditions (lighting, background) change. Use the same OTA update pipeline described in our Edge Impulse tutorial.

Hackster.io has extensive community projects showcasing ESP32 + OV2640 vision systems that provide real-world implementation reference.

Conclusion

On-device computer vision has moved from research to production. The ESP32-S3 and STM32H7 provide sufficient compute, memory, and camera interface hardware to run useful image classification and object detection models in real time. The software stack — TFLM, Edge Impulse, and hardware-optimized backends — handles the complexity of model quantization and efficient inference.

The engineering challenges are real — memory management, camera integration, and model optimization for the target hardware all require expertise — but they are solvable with the right approach. UABit’s firmware and embedded development team has deployed production edge vision systems across manufacturing inspection, agricultural monitoring, and smart home applications. If you are building a vision-enabled IoT product, we can help you design a system that meets your accuracy, latency, and power requirements.