Google Cloud’s BigLake-driven lakehouse updates aim to optimize performance, costs

jeudi 29 mai 2025, 16:08 , par InfoWorld

Google Cloud has introduced new updates to its BigLake-driven data lakehouse that are designed to optimize performance and reduce costs.

The updates were made to BigQuery, Cloud Storage, Dataplex Universal Catalog, and Apache Spark.

BigLake is a service that allows data analytics and data engineering on both structured and unstructured data, and combines it with other existing offerings such as BigQuery, Cloud Storage, Dataplex Universal Catalog, and Apache Spark on Google Cloud to build out a lakehouse.

Vendor-neutral way to interact with metadata services

The first update came in the form of a new REST API Catalog for BigLake’s metastore — a managed, unified metadata service that stores information about data stored in an enterprise’s data lake and makes it available for analytics without redundancy.

Constellation Research’s principal analyst, Michael Ni, said that the new Rest API Catalog will give developers a vendor-neutral, programmable way to interact with metadata services, making it easier to integrate BigLake with custom workflows, CI/CD pipelines, and non-Google analytics engines.

According to SanjMo’s chief analyst, Sanjeev Mohan, REST API Catalog could help enterprises reduce costs.

“For example, typically if an enterprise wanted to use Snowflake data for analytics, then it had to use Snowflake’s compute engine. But now, a developer can bring open source DuckDB to analyze Snowflake data using the Iceberg REST API Catalog and reduce cost,” Mohan said.

The REST API Catalog is currently in preview.

BigQuery updates target SQL performance

The second set of updates, which focuses on improving SQL analytics performance, was added to BigQuery, a managed data warehouse offering from Google.

These updates, which are all nearly automated SQL engine enhancements, include BigQuery advanced runtime, a low-latency query API, column metadata indexing, and an order of magnitude speedup for fine-grained updates and deletes in preview.

The BigQuery advanced runtime, which is currently in preview, can automatically accelerate analytical workloads, using enhanced vectorization and short query optimized mode, without requiring any user action or code changes, Google wrote in a blog post.

In order to increase query efficiency further, Google has added the BigQuery column metadata index (CMETA), which has been made generally available, that helps process queries on large tables through more efficient, system-managed data pruning.

However, Mohan pointed out that Microsoft and AWS already have similar offerings already.

While Microsoft in Fabric uses proprietary technologies called VertiPaq and VertiParquet that use an in-memory columnar version of OneLake data to speed up interactive Direct Lake queries using PowerBI clients, AWS has continued to introduce enhancements such as vectorized scans, dictionary encoding for string columns, and automatic table optimization in Redshift, Mohan said.

Microsoft also recently introduced metadata mirroring in OneLake via Fabric to speed up Databricks queries.

In addition, Google has also added a new Lightning Engine to improve the performance of its Apache Spark module — it is currently in preview.

“The Lightning Engine accelerates Apache Spark performance through highly optimized data connectors for Cloud Storage and BigQuery storage, efficient columnar shuffle operations, and intelligent in-built caching mechanisms,” Google wrote in a blog post.

Google expects the Lightning Engine to deliver 3.6x faster Apache Spark performance than its predecessor.

Amalgam Insights’ chief analyst Hyoun Park pointed out that the performance enhancements made to the lakehouse components are Google’s way of solving a core issue with having Apache Iceberg as the table format or running Spark in a lakehouse, especially in the age of AI.

“AI fundamentally requires faster access to more distributed data sources and appropriate context. Although Iceberg is treated as a data lake standard, it often struggles with the small changes, metadata updates, and transactional volumes needed to be a highly performant solution,” Park said.

“Google’s steps are in the direction of what it thinks should matter to enterprises when it comes to improving performance,” Park added.

Embedding Gemini’s capabilities in Dataplex Universal Data Catalog

Google is also embedding Gemini’s capabilities into the Dataplex Universal Data Catalog.The new AI-based capabilities of the Catalog, according to Google, will help enterprises get data-ready without any manual labor by automatically discovering and organizing metadata.

Automating metadata discovery and management will help enterprises increase the accuracy of AI and Agentic applications, Mohan said.

“Models need real-time data, and they need a catalog to look up semantics so they can reduce hallucinations. Dataplex Universal Catalog embedded in the lakehouse effectively serves that purpose,” Mohan added.

Lire la suite sur InfoWorld