Managing IoT Data at Scale: Ingestion, Storage, and Processing for Large Device Fleets

A single IoT device publishing one reading per minute is trivial to manage. A fleet of 50,000 devices doing the same generates 72 million data points per day—before you add events, alerts, and firmware logs. At that scale, the architectural choices you made for your MVP will almost certainly break: a relational database will thrash, a single MQTT broker will saturate, and your application will start dropping data under peak load. Managing IoT data at scale is a distinct engineering discipline, and the right patterns will determine whether your platform is a competitive advantage or a constant source of operational pain. This article walks through the full data pipeline—from device to dashboard—with concrete technology choices and trade-offs.

The IoT Data Pipeline Architecture

Before diving into components, it’s useful to have a shared model of the pipeline. IoT data flows through four logical stages:

Ingestion — receiving data from devices via MQTT, HTTP, CoAP, or WebSocket
Stream Processing — transforming, filtering, enriching, and routing data in real time
Storage — persisting time-series measurements, device state, and events durably
Serving — querying stored data for dashboards, APIs, ML training sets, and reports

Each stage can be scaled independently, and the choice of technology at each layer constrains what’s possible downstream. A common mistake is over-indexing on storage (everyone picks a database first) while neglecting ingestion capacity—which is typically where things break first.

Ingestion: MQTT Brokers and Message Queues

The dominant protocol for device-to-cloud ingestion is MQTT. An MQTT broker receives published messages from devices on topics like devices/{deviceId}/telemetry and forwards them to subscribers—typically your stream processing layer or a cloud IoT service.

At small scale, a single Mosquitto or EMQX broker handles tens of thousands of concurrent connections. At large scale, you need clustering. EMQX supports horizontal clustering with millions of concurrent connections per cluster node. HiveMQ is a popular enterprise choice with a Kubernetes-native deployment model. Managed options include AWS IoT Core (which scales automatically but charges per message and connection) and Azure IoT Hub.

Beyond MQTT, many architectures insert a message queue between the broker and downstream processing. Apache Kafka is the dominant choice here. MQTT messages land in Kafka topics partitioned by device ID or geographic region, and multiple consumers (stream processors, storage writers, alerting engines) read from Kafka independently at their own pace. Kafka’s durable log retention means you can replay events if a downstream service is temporarily down—a critical property for reliability.

For HTTP-based ingestion (common for less-constrained devices or batch uploads), an API gateway fronting a Kafka producer is a clean pattern. Keep HTTP endpoints stateless; push work downstream.

Stream Processing: Kafka Streams, Flink, and Rule Engines

Raw device telemetry is rarely immediately useful in its raw form. You need to:

Validate — reject malformed or out-of-range values
Enrich — attach device metadata (location, owner, firmware version) to each message
Aggregate — compute rolling averages, min/max windows for anomaly detection
Route — send critical alerts to a notification queue while routing bulk telemetry to cold storage

Apache Kafka Streams handles this well for teams already invested in the Kafka ecosystem. It’s a lightweight library (not a separate cluster) that processes records in micro-batches with exactly-once semantics. For more complex stateful processing—windowed joins, complex event processing—Apache Flink provides a richer API and better operational tooling, at the cost of a heavier deployment footprint.

At smaller scales, AWS Lambda functions or Azure Stream Analytics triggered by IoT Hub events can handle processing without managing infrastructure. For rule-based alerting (common in industrial IoT), Node-RED and purpose-built IoT platforms like InfluxDB’s Flux alerts or ThingsBoard’s rule engine offer low-code approaches that reduce engineering overhead.

Key consideration: keep your stream processing stateless where possible. Stateful processors are harder to scale and recover. Push state into your storage layer (Redis for hot state, your time-series DB for historical state) and read from there in your processing logic.

IoT data pipeline architecture diagram showing ingestion to storage to serving

Time-Series Storage: Choosing the Right Database

This is where most teams spend the most time. IoT telemetry is fundamentally time-series data: measurements indexed by timestamp and device ID, with high write throughput and time-range query patterns. Traditional relational databases (PostgreSQL, MySQL) can handle time-series data with proper indexing and partitioning, but they weren’t designed for it and will struggle above a few thousand writes/second without significant tuning.

Purpose-built time-series databases include:

InfluxDB is the most widely adopted dedicated TSDB for IoT. InfluxDB 3.0 (now built on Apache Arrow and Parquet) offers columnar storage, SQL support via Flight SQL, and a separation of compute and storage that makes it cloud-native. It handles millions of writes/second on appropriate hardware. The Flux query language (InfluxDB 2.x) is powerful but has a steep learning curve; the move to SQL in 3.0 lowers the barrier.

TimescaleDB extends PostgreSQL with hypertables, automatic partitioning, and time-series-specific functions. If your team knows PostgreSQL well, TimescaleDB lets you use familiar tooling (pgAdmin, Sequelize, standard JDBC drivers) while gaining TSDB performance. See timescale.com for benchmarks.

QuestDB is a newer entrant optimized for write throughput and columnar storage with a SQL interface. It excels at ingesting extremely high-frequency data (millions of rows/second) and running analytical queries.

For cold storage (data older than 30–90 days), columnar file formats like Apache Parquet on object storage (S3, GCS) cost dramatically less than keeping data in a hot TSDB. Tools like Apache Spark, DuckDB, or BigQuery can query Parquet directly. A tiered storage architecture—hot in InfluxDB/Timescale, cold in Parquet—is standard for mature IoT platforms.

Device State vs. Telemetry: Don’t Conflate Them

A subtle but important architectural point: telemetry (the stream of measurements—temperature readings, GPS coordinates, power consumption) and device state (the current configuration and status of a device—firmware version, connectivity status, desired vs. reported settings) are different kinds of data that need different storage patterns.

Telemetry belongs in your time-series database. Device state belongs in a document store (MongoDB, DynamoDB) or in a dedicated device registry. AWS IoT Core’s Device Shadow and Azure IoT Hub’s Device Twin are cloud-native solutions for the device state problem. For self-hosted architectures, a Redis hash per device for hot state and a document DB for the persistent record is a common pattern.

Conflating these two data types into a single time-series table creates headaches: your TSDB fills up with configuration values that change infrequently, and your JOIN-heavy queries for “what is the current state of all devices in Building 3” become painful.

Data Lifecycle Management

Storing every data point forever is expensive and usually unnecessary. Define your data retention policy early:

Hot tier (0–7 days): Full-resolution data in your TSDB, fast queries, high-cost storage
Warm tier (7–90 days): Downsampled data (e.g., 1-minute averages instead of per-second) in TSDB
Cold tier (90+ days): Parquet on object storage, query with Spark or BigQuery when needed
Regulatory hold: Some industries (medical, energy) require raw data retention for 7+ years

InfluxDB’s continuous queries and TimescaleDB’s retention policies with continuous aggregates automate downsampling on a schedule. Kafka’s log compaction and time-based retention limits cap message queue storage. Build these policies before you have a petabyte of data to migrate.

Scaling the Pipeline: Horizontal vs. Vertical

When your pipeline starts struggling, the question is where the bottleneck is:

MQTT broker saturated? Add broker nodes to a cluster (EMQX, HiveMQ support this). Check if devices can batch messages or reduce publish frequency.
Kafka lagging? Add partitions and consumer group instances. Profile consumer processing time.
TSDB write saturation? Tune batch insert sizes. Add InfluxDB ingester nodes. Consider switching from row-based to line protocol batch writes.
Query latency high? Add read replicas, implement query result caching (Redis), pre-compute common aggregations with continuous queries.

Horizontal scaling is almost always the right path for IoT workloads because device fleet growth is gradual and predictable. Architect each pipeline stage to be stateless and independently scalable from the beginning.

For a practical guide to the transport layer that feeds these pipelines, see our MQTT Deep Dive article. If you’re evaluating cloud platforms that bundle many of these components, our IoT Cloud Platforms Overview compares AWS IoT, Azure IoT, Google Cloud, and Particle.

Our team at UABit designs and implements full-stack IoT data pipelines. Visit our IoT Connectivity Integration services page to learn more about how we architect data platforms for growing device fleets.

Conclusion

Scaling IoT data from prototype to production is an exercise in choosing the right tool at each pipeline stage and designing each stage to scale independently. MQTT + Kafka for ingestion, InfluxDB or TimescaleDB for time-series storage, Kafka Streams or Flink for stream processing, and a tiered storage strategy for lifecycle management form a battle-tested stack. Equally important: keep device state separate from telemetry, define retention policies before your storage costs spiral, and benchmark each stage under realistic load early in development. The patterns exist—applying them consistently is what separates an IoT platform that scales gracefully from one that becomes a maintenance burden at 10,000 devices.