Engineering Blog
5
min read
November 6, 2024
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.
At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.
The Simple Path to Streaming
Here's how simple it is to set up a DynamoDB stream processor:
from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource
@transform(
input_mapping={
"orders_stream": Input(
Stream(
source=DynamoDBSource(
table_name="orders",
region="us-west-2",
stream_enabled=True
)
)
)
},
output_mapping={
"processed_orders": Output(
Dataset(
id="processed_orders",
tags=["orders", "real-time"]
)
)
}
)
def process_orders_stream(orders_stream):
# The stream data comes in with the standard DynamoDB stream format
return orders_stream.select(
"eventID",
"eventName",
"dynamodb.NewImage.orderId.S as order_id",
"dynamodb.NewImage.customerEmail.S as customer_email",
"dynamodb.NewImage.orderTotal.N as order_total",
"dynamodb.NewImage.status.S as status"
)
That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.
What's Really Happening Under the Hood
Smart Delta Lake Integration
We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:
Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance
State Management Done Right
Stream processing is stateful by nature, and state management is crucial for reliability. Our system:
Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed
Smart Batching and Merging
Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:
Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events
Production-Grade Reliability
In production environments, many things can go wrong. Our system is designed to handle:
Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes
Best Practices
Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:
Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.
When to Optimize
While our default settings work well for most cases, here are situations where you might want to tune:
Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream
Conclusion
At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.
We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.