Datazone

Docs

Quick search…

Concepts

Dataset

Source

Extract

Repository

Transform

Pipeline

Execution

Schedule

References

Pyspark Examples in Transforms

Tools

CLI

Datazone

Docs

Datazone

Docs

Datazone

Docs

Concepts

Pipeline

The Pipeline entity represents a series of data processing steps organized into a coherent workflow within the data platform. It typically involves a sequence of transformations, data movements, and other processing tasks, structured to accomplish a specific data management goal. Pipelines are fundamental in orchestrating the flow of data from source to destination, ensuring that each step is executed in the correct order and manner.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Pyspark Examples in Transforms

Previous

Execution

Next

© Copyright 2024. All rights reserved.

Concepts

Pipeline

The Pipeline entity represents a series of data processing steps organized into a coherent workflow within the data platform. It typically involves a sequence of transformations, data movements, and other processing tasks, structured to accomplish a specific data management goal. Pipelines are fundamental in orchestrating the flow of data from source to destination, ensuring that each step is executed in the correct order and manner.

Properties

ID: A unique identifier for the pipeline.
Name: A descriptive name for the pipeline, indicating its purpose or the type of data processing it performs.
Schedule: (Optional) If the pipeline is scheduled to run automatically, details of the scheduling (e.g., frequency, time).

Usage

Data Processing Workflows: Pipelines automate and manage complex workflows involving multiple steps of data processing.
Error Handling and Recovery: They include mechanisms to handle failures in individual steps and provide options for recovery and reruns.
Monitoring and Optimization: Pipelines are monitored for performance and can be optimized for efficiency, speed, and resource utilization.

Best Practices

Modular Design: Design pipeline steps to be modular and reusable, facilitating maintenance and scalability.
Documentation: Maintain clear documentation for each pipeline step, including its purpose, input, output, and any special considerations.

Pyspark Examples in Transforms

Previous

Execution

Next

© Copyright 2024. All rights reserved.