Why observability needs Apache Iceberg

jeudi 2 octobre 2025, 11:00 , par InfoWorld

Apache Iceberg is a mature open table format that’s been battle-tested in the broader analytics world for years. Now it’s time to apply the benefits of an open and scalable standard to an observability field that badly needs to break out of its siloed heritage.

It isn’t that observability has entirely resisted standards. OpenTelemetry is a well-adopted model for collecting metrics, logs, and traces. But once that data lands, most stacks still fragment it into silos. Joining observability with business data typically means exporting, duplicating, or downsampling. It’s a costly and error-prone process that makes simple questions, such as “Which customers were affected by an outage?” or “What was the revenue impact?” into a bespoke data project.

Iceberg standardizes how large analytical data sets are stored and evolved on object storage, with ACID transactions, snapshot isolation, time travel, and schema evolution. It’s a neutral table layer that any compatible compute engine can use, including Spark, Flink, Trino/Presto, Dremio, and the major cloud data platforms. That turns telemetry into first-class data that lives alongside customer, finance, and product tables without endless copy pipelines.

Iceberg advantages

The breakthrough Iceberg achieves for observability is that it lets you keep logs, metrics, and traces as data sets in the same lakehouse that already holds business data. This enables you to explore telemetry with a SQL engine, a notebook, or your existing BI tools without transferring terabytes of data. The glue code and format translation steps that create drift go away.

And there are other Iceberg features that are ideally matched to observability’s quirks:

Iceberg’s seamless schema evolution is well-matched with observability’s frequent schema changes. Adding a new label, renaming a field there, or adding a late-arriving dimension from an upstream service is no longer a big deal.

Iceberg’s hidden partitioning features let you add or rename columns and adjust partition specifications over time without rewriting historical data or breaking queries. It’s a much better fit for high-cardinality telemetry than rigid, pre-declared schemas.

Iceberg’s manifest and snapshot model provides atomic commits, compaction, and data skipping so engines can sustain write pressure while keeping read latencies predictable. That’s critical for high-volume telemetry pipelines, which need consistent, append-heavy writes with occasional deletes and backfills. You can correct bad batches, enforce retention, and backfill missing windows without tearing down the table.

Iceberg’s “time travel” feature, which lets you query a table exactly as it looked at some point in the past, is ideally suited for queries like “What changed between 09:00 and 09:05?” Snapshot metadata lets you query the table as it existed at a prior point in time, compare revisions, and reconstruct the state without requiring parallel stores or ad-hoc archives.

Iceberg’s format is open and widely supported, allowing you to grant access to the same telemetry tables across teams and tools, apply lake-level governance, and avoid the export/import loop that causes delays and spikes costs. Your observability data benefits from the same security and governance controls as the rest of your data platform.

Observability has always been about looking back to understand technical failures after they occur. That’s still important, but the job is bigger now. Business leaders want to understand user impact, apply business metrics, and feed product decisions with real usage data. That means the ability to join telemetry with customer, billing, and feature data should be a standard feature, not a special project.

Iceberg lets you ask higher-order questions directly.

Join error logs or degraded traces to customer tables to see who was affected by a slowdown or outage without exporting data into a separate warehouse.

Correlate latency spikes with conversion rates or cart abandonment using the same compute engine that analysts already trust.

Create forecasts by combining metrics with historical usage and seasonality models using time travel for “what if” modeling.

Apply retention and tiering policies as part of table maintenance rather than a set of disconnected and bespoke life cycle rules.

Iceberg and OTel

Iceberg enhances the value of OpenTelemetry. Whereas “OTel” standardized generation and collection, Iceberg standardizes persistence and evolution. You can now ingest data reliably and keep it useful over time. Telemetry becomes durable, queryable, and shareable at enterprise scale. The combination of Iceberg and OTel doesn’t eliminate the need for good instrumentation and sampling strategies, but it creates a storage control plane that complements existing collectors, stream processors, and alerting systems without the need to replace them.

You don’t have to overhaul your infrastructure to see how Iceberg can reduce costs and increase agility. Start by picking a high-value data set that’s revenue-critical or uses high-cardinality metrics that are expensive to keep elsewhere. Move them to an Iceberg table managed by current observability tools. Choose an Iceberg-compatible engine like Trino, Spark, or Dremio and use it to validate some basic queries, partition pruning, and time-range scans on your telemetry.

Identify two or three routine joins your incident calls always ask for and make them first-class SQL instead of scripts. Use Iceberg’s table maintenance features to enforce data life cycles and compact small files generated by streaming ingestion. Compare query latency, cost, and throughput to your current path. Expand to traces or derived metrics once the basics are stable.

A storage standard has been the missing piece between well-collected telemetry and broadly useful telemetry. Apache Iceberg enables observability teams to plug into the same data platform their companies already rely on, allowing them to ask better questions without waiting on bespoke pipelines or copying data. Use it to rewire the systems you already have without having to recreate them from scratch.

Lire la suite sur InfoWorld