The definitive guide to data pipelines

lundi 26 août 2024, 11:00 , par InfoWorld

For a simplistic view of data processing architectures, we can look at the structure and functions of a house. The foundation of the house is data management platforms that provide storage, query, transactions, security, and other fundamental data functions. Throughout the house are various appliances, including microservices, APIs, applications, analytics, machine learning models, and genAI models. These are used for searching, analyzing, and publishing data to end users and other services.

Connecting all these systems are pipes, plumbing, and filters—the data processing tools that move data from one system to another. Data processes can be relatively simple in small organizations with few data sources and appliances. Larger businesses often require a wider range of applications to meet end-user needs and different data types.

This article is a deep dive into data pipelines. You’ll learn the basics of data pipelines and the wide array of architectures and platforms used to implement them. You’ll also learn about the different business objectives supported by data pipelines and some of the newer use cases emerging with generative AI. I’ll also discuss data transformations, data ops, and the future of data pipelines.

Data pipelines: One way to move data

Moving data involves several operational functions, including data replication, data migration, and data synchronization. But when we think about data movements for business needs, three functions are most prevalent:

Data integration involves extracting data from multiple sources and combining them for downstream usage, often using a mix of automation and manual data processing.

Data pipelines imply automation, where data from one system is made accessible to downstream consumers, but not necessarily in real-time.

Data streams imply real-time, highly scalable, and robust data pipelines that meet target service-level objectives around performance, latency, and error rates.

So, once again, data integration, pipelines, and streams are the plumbing that lets you move and share data across systems. If your data management architecture is distributed—like a collection of condominiums in multiple buildings across different locations—then you may also need tools like data meshes and data fabrics and master data management techniques to support more robust and scalable data sharing.

Data pipeline technologies

While they take many forms, data pipelines are fundamental for automating and sharing data. They can be as simple as webhooks, APIs, pub-sub patterns, or IFTTT services, and they can scale up to incorporate more sophisticated data pipeline design patterns. Data pipelines include:

Batch processing architectures, where data movement Is not real-time, and groups of records are moved from one system to another on a fixed schedule or triggered by an event.

Event-driven architectures that provide a scalable approach connecting data producers, consumers, and transformation services.

Lambda and Kappa architectures that combine real-time and batch processing capabilities.

Microservices-based data pipelines, which are relatively small, can be released independently, and are usually managed by a single development team.

“Data pipelines are fundamental to any enterprise data strategy as they move, transform, and manage data that eventually become valuable reports and analytics,” says Emily Washington, SVP of product management at Precisely. “Ensuring data integrity within these pipelines is crucial, requiring efficient data integration from source to target, cleansing data where it resides, and adding attributes to ensure the data is fit for its intended use and informs decision-making processes.”

Platforms for data pipelines, integration, and streaming

Data pipeline design patterns can be deployed to serverless architectures such as AWS Lambda, Azure Functions, or Google Cloud Functions. They can be a component of data warehouses and data lakes to move and transform data or be deployed as independent services. Developers can code data pipelines in virtually any language, though data scientists and engineers typically use Python.

There are several platforms and many products for developing pipelines, integrations, and streams:

Data pipelines connecting SaaS platforms can be performed in if-this-then-that and other data automation platforms such as IFTTT, Integrately, Make (formerly Integromat), Microsoft PowerAutomate, Quickbase, Workato, Tray.io, and Zoho Flow, The pipelines created in these tools generally connect one source to one destination and offer common data transformation capabilities.

Top data integration platforms on the Gartner Magic Quadrant include Ab Initio, AWS, Denodo, Fivetran, Google, IBM, Informatica, K2view, Oracle, Matillion, Microsoft, Palantir, Precisely Qlik, SAP, SnapLogic, Talend, and Tibco.

Data pipeline platforms include Actian, Apache Airflow, Ascend.io, Astera, Astronomer, AWS Glue, CData, Databricks, Dremio, dbt Labs, Hevo, Integrate.io, Nexla, Peliqan, Prophecy, Rivery, Skyvia, Stitch, Stonebranch, and StreamSets.

Data pipelines are also a function of integration platforms as a service (IPaaS), and Gartner’s 2024 magic quadrant includes platforms from Boomi, Celigo, Informatica, Jitterbit, Microsoft, Oracle, Salesforce, SAP, SnapLogic, Software AG, Tray.io, and Workato.

Data streaming platforms include Apache Fink, Apache Kafka, Apache Pulsar, Apache Storm, AWS Kinesis, Ataccama, Azure Steam Analytics, Cloudera, Confluent, DataStax, Google Cloud Dataflow, Hazelcast, Pravega, Red Hat, Redpanda, Redis, Spark Structured Streaming, StreamNative, and Tibco.

Data integration and pipeline capabilities are also built into many databases, data warehouses, data lakes, and AI/ML workflow platforms.

“Building data pipelines is a critical aspect of modern data management, but this can be complex as there are many technologies and architectural and design patterns,” says Sunil Kalra, head of data engineering at LatentView Analytics. “As data volumes grow, efficient data pipelines become increasingly important.”

Pipelines support different business objectives

Basic data pipelines are needed whenever information is shared across multiple systems of record. For example, an employee onboarding workflow often requires setting up new employees in HR, financial, IT, and other systems. While some user information may be stored in a directory such as Microsoft Entra ID (formally Azure Active Directory), each system of record requires some common user data to set up new employees. Data pipelines are one way to trigger workflow and data sharing between these systems, and the most basic pipelines push one record of information from a system of record to others with minimal data transformations.

More sophisticated data integration platforms can join data from multiple sources, perform sophisticated multi-record data transformations, and connect to multiple downstream systems in one data pipeline.

Beyond basic simple data pipelines are many other business use cases that orchestrate complex workflows, enable data science activities, and process IoT sensor data. Emerging technologies, including genAI, computer vision, and AR/VR, dramatically scale data pipeline complexities. IT and data teams must consider current and future business needs as part of their data management strategies and how they will develop and support a growing number of data pipelines.

For example, data scientists are both data pipeline consumers and producers. “Data scientists spend weeks or months curating data to bring it to the form relevant for machine learning,” says Hema Raghavan, VP of engineering at Kumo. “An example could be manipulating application page view and click logs to extract the fields needed for the data scientist or resolving product names across such events collected by siloed engineering organizations.”

Similarly, devops teams create data pipelines to understand application health, diagnose performance issues, and troubleshoot errors. “A critical subset of data pipelines are telemetry pipelines that capture data types ranging from logs, metrics, traces, alerts, events, profiles, and others from IT ops, devops, and secops environments,” says Ranjan Parthasarathy, chief product and technology officer at Apica. “Telemetry pipelines allow data normalization, improve quality, reduce clutter, enable context, and provide on-demand availability of data where it’s needed the most, resulting in significant cost savings.”

Generative AI use cases for data pipelines

Beyond workflow, devops, and data science use cases are new genAI user experiences. Data pipelines are needed to connect vector databases, data lakes, and large language models (LLMs) to support retrieval augmented generation (RAG). These connections essentially enable enterprise data to be connected with genAI capabilities.

“While everybody wants AI to streamline processes and boost productivity, those benefits can’t be realized without quality data pipelines connecting information, workflows, teams, and projects,” says Jon Kennedy, SVP of engineering at Quickbase. “This puts an even bigger premium on understanding the source of the data, verifying its integrity, and knowing how it changes as it is used throughout the organization.”

Organizations are adding data sources and analytics capabilities to support machine learning and AI. The implication is that the underlying data pipelines must enable the full development, testing, deploying, monitoring, and retraining of machine learning models (MLOps) and adhering to data and AI governance models.

“GenAI pipelines involve creating and orchestrating data engineering steps, but more importantly, they require embedding models, vector stores, prompt engineering steps, upstream predictive AI models, downstream LLMs, and integration with downstream systems,” says Kjell Carlsson, head of data science strategy and evangelism at Domino. “At a minimum, companies will need to integrate their data pipeline capabilities with new data stores, MLOps, and ML governance capabilities.”

The added scope and business demand for robust data pipelines will require large organizations to consider how to scale the process for developing and updating data pipelines. Taylor McGrath, VP of solutions engineering at Rivery, suggests. “To succeed with such volume and prevent bottlenecks, centralized data platform teams should find the right balance between enabling decentralized teams to build their own pipelines while maintaining the right governance over data access, cloud computing consumption usage, and the health of the executed data pipelines.”

Implementing transformations in data pipelines

The guts of data pipelines are the data transformations required to translate data from source systems to the requirements of downstream systems. Simple transformations map, combine, and cleanse single records for the pipeline’s consumers. More complex transformations include aggregating, joining, summarizing, and enriching groups of records, documents, and other data types.

“Traditionally, data warehouses have been filled through extract-transform-load (ETL) processes: extracting raw data from sources, transforming it, and then storing it,” says Giovanni Lanzani, managing director at Xebia Data. “Once storage and processing costs dropped, data teams started storing raw data in the data warehouse before transforming it (ELT), increasing the flexibility to create new insights.”

Julian LaNeve, CTO of Astronomer, adds, “ETL is suitable for scenarios requiring pre-processed data for analysis, while ELT leverages the processing power of modern systems like data lakes or cloud-based data warehouses and can handle larger volumes of data more efficiently.”

ETL and ELT transformations are terms generally used for data pipelines that load data into data warehouses and data lakes. In-transit and streaming data transformations are terms used when data pipelines or streams transform data in their process flow without storing the resulting data. Use cases include real-time analytics, IoT data streams, credit card transaction processing, and fraud detection. Data transformations include filtering, aggregation, windowing, enrichment, and anomaly detection.

A key data pipeline capability is to track data lineage, including methodologies and tools that expose data’s life cycle and help answer questions about who, when, where, why, and how data changes. Data pipelines transform data, which is part of the data lineage’s scope, and tracking data changes is crucial in regulated industries or when human safety is a consideration. Platforms that have data lineage capabilities include Alex Solutions, Alation, Atlan, Boomi, Collibra, Erwin, IBM, Informatica, Manta, Microsoft, Octopai, Oracle, Precisely, Secoda, Solidatus, SAP, SAS, and Talend. Other data catalog, data governance, and AI governance platforms may also have data lineage capabilities.

“Business and technical stakeholders must equally understand how data flows, transforms, and is used across sources with end-to-end lineage for deeper impact analysis, improved regulatory compliance, and more trusted analytics,” says Felix Van de Maele, CEO of Collibra.

The data ops behind data pipelines

When you deploy pipelines, how do you know whether they receive, transform, and send data accurately? Are data errors captured, and do single-record data issues halt the pipeline? Are the pipelines performing consistently, especially under heavy load? Are transformations idempotent, or are they streaming duplicate records when data sources have transmission errors?

These are just some of the dataops issues that occur in data pipelines. “Operations for AI workflows can be especially challenging because a series of data flows often feed data into the next one,” says Raghavan of Kumo. “Data corruption in even one flow can result in compound effects in downstream pipelines.”

Data pipelines used to support ML models and genAI LLMs have greater performance and quality concerns because of the data scales required and user expectations around model performance.

“Resiliency, recoverability, and reproducibility of generative AI pipelines and the unstructured, structured, and semi-structured data sets used for their model training makes data governance more complex at AI scales,” adds Colleen Tartow, field CTO and head of strategy at VAST Data.

Key approaches to improve dataops include ensuring data pipeline observability, using monitoring tools to alert on performance issues, tracking data quality, and monitoring ML models for data drifting in ModelOps. Data observability technologies include Acceldata, Apica, Cribl, DataKitchen, IBM Databand, Metaplane, Monte Carlo, Sifflet, Soda, Unravel, and Validio.

“The monitoring of data as it moves through pipelines is critical because of its impact on the quality of data used for analytics and AI initiatives,” says Washington of Precisely. “Data observability looks at real-time information and enables analysts to trust the data they use immediately. Implementing observability of data through pipelines helps prevent business disruption and costly downstream data and analytics issues because it can proactively alert users to data anomalies and outliers.”

One of the more challenging aspects of dataops is detecting and quickly fixing data pipeline issues resulting from changes in APIs and data source schemas. GenAI is emerging as a dataops and data engineering platform to simplify data pipeline development and support.

“By combining the scriptability of data pipelines with a language model’s ability to generate code, you get dynamic, self-updating ETL processes,” says Mike Finley, CTO and co-founder of AnswerRocket. “Using the language model’s ability to understand and correct errors, those ETLs can be self-healing when disruptions like a typical schema change or numeric overflow would have previously crippled the pipe.”

The future of data pipelines

As most companies increase investments in analytics and AI capabilities, there will be a growing need to integrate new data sets and create data pipelines connecting data across platforms. The scale and variety of data, new AI capabilities, and emerging end-user experiences virtually guarantee that IT and data engineering teams will need to evolve their data management and integration strategies.

“Data pipelines serve as the foundation of modern data management, ensuring smooth data flow from source to destination,” says Ashwin Rajeeva, CTO of Acceldata.

Returning to the analogy I opened with, businesses are more like villages and cities of homes, with pipelines serving as the backbone for delivering clean water and removing waste. Businesses will have an ongoing need to monitor and improve their existing data pipelines while developing new ones for areas of expansion.

Lire la suite sur InfoWorld

https://www.infoworld.com/article/3487711/the-definitive-guide-to-data-pipelines.html

56 sources (32 en français)

Date Actuelle

dim. 21 déc. - 09:52 CET