Datazone

Home

Customer Stories

Docs

Pricing

Changelog

Blog

Contact

Demo

Sign Up

Login

All blogs

Real-time DynamoDB Stream Ingestion with Datazone

Engineering Blog

5 min read

0 min read

November 6, 2024

Content

At Datazone, we believe in making complex things simple, but not simplistic. Today, we want share how our platform handles DynamoDB streams with enterprise-grade reliability while keeping the developer experience straightforward and pythonic.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

We've built our streaming infrastructure on top of Delta Lake, bringing ACID transactions to your streaming workloads. When a DynamoDB change arrives, we don't simply append it to your target table. Instead, we:

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

At Datazone, we believe that streaming shouldn't be complex. By handling the intricate details of stream processing internally while exposing a clean, pythonic interface, we enable developers to focus on their data transformations rather than infrastructure.

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

The Simple Path to Streaming

Here's how simple it is to set up a DynamoDB stream processor:

from datazone import transform, Input, Stream, Dataset
from datazone.sources import DynamoDBSource

@transform(
    input_mapping={
        "orders_stream": Input(
            Stream(
                source=DynamoDBSource(
                    table_name="orders",
                    region="us-west-2",
                    stream_enabled=True
                )
            )
        )
    },
    output_mapping={
        "processed_orders": Output(
            Dataset(
                id="processed_orders",
                tags=["orders", "real-time"]
            )
        )
    }
)
def process_orders_stream(orders_stream):
    # The stream data comes in with the standard DynamoDB stream format
    return orders_stream.select(
        "eventID",
        "eventName",
        "dynamodb.NewImage.orderId.S as order_id",
        "dynamodb.NewImage.customerEmail.S as customer_email",
        "dynamodb.NewImage.orderTotal.N as order_total",
        "dynamodb.NewImage.status.S as status"
    )

That's it. No need to manage checkpoints, handle retries, or write complex merge logic. But don't let the simplicity fool you - there's a lot happening under the hood.

What's Really Happening Under the Hood

Smart Delta Lake Integration

Track the event timestamp and sequence for each record
Automatically detect schema changes from your DynamoDB table
Perform intelligent merge operations that preserve your data's timeline
Optimize the underlying Delta Lake tables for read and write performance

State Management Done Right

Stream processing is stateful by nature, and state management is crucial for reliability. Our system:

Maintains distributed checkpoints with exactly-once processing guarantees
Automatically recovers from failures by resuming from the last successful checkpoint
Handles backpressure adaptively to prevent memory overflow
Scales processing across multiple nodes when needed

Smart Batching and Merging

Rather than processing each record individually (expensive) or in fixed-size batches (inefficient), we:

Dynamically adjust batch sizes based on processing lag and system load
Group similar operations to optimize Delta Lake merge performance
Maintain order guarantees while maximizing throughput
Automatically handle late-arriving data and out-of-order events

Production-Grade Reliability

In production environments, many things can go wrong. Our system is designed to handle:

Network partitions and DynamoDB throttling
Instance failures and spot terminations
Varying load patterns and data skew
Schema evolution and data type changes

Best Practices

Through our experience helping companies process significant DynamoDB changes daily, we've learned a few things:

Start With Default Configurations: Our defaults are optimized for most use cases. Tune only when you have specific requirements.
Monitor Processing Lag: Keep an eye on the lag between DynamoDB changes and processing. Set up alerts for unusual lag patterns.
Plan for Schema Evolution: DynamoDB schemas often evolve. Our system handles this automatically, but plan your schema changes thoughtfully.
Consider Partition Strategy: Think about how you'll query the processed data. Use appropriate partition keys in your target dataset.

When to Optimize

While our default settings work well for most cases, here are situations where you might want to tune:

Processing lag consistently above 5 minute
Individual record processing taking too long
High number of concurrent streams
Complex transformations on the stream

Conclusion

We're constantly working on making our stream processing even more robust and efficient. If you're handling DynamoDB streams, we'd love to hear about your use case and how we can help.

Founder, CEO

6 Nov 2024

5 min read

Founder, CEO

6 Nov 2024

5 min read

Founder, CEO

6 Nov 2024

5 min read

Founder, CEO

6 Nov 2024

5 min read

RELATED BLOGS

Our latest news and articles

Data Platform

2025 Data & AI Trends

From Hype to Real Business Impact

Founder & CEO

20 Feb 2025

4 min read

Data Platform

2025 Data & AI Trends

From Hype to Real Business Impact

Founder & CEO

20 Feb 2025

4 min read

Data Platform

2025 Data & AI Trends

From Hype to Real Business Impact

Founder & CEO

20 Feb 2025

4 min read

Data Platform

2025 Data & AI Trends

From Hype to Real Business Impact

Founder & CEO

20 Feb 2025

4 min read

Engineering Blog

Powering Feature Stores with Datazone

A Practical Guide for Data Engineers

Founder & CEO

9 Oct 2024

8 min read

Engineering Blog

Powering Feature Stores with Datazone

A Practical Guide for Data Engineers

Founder & CEO

9 Oct 2024

8 min read

Engineering Blog

Powering Feature Stores with Datazone

A Practical Guide for Data Engineers

Founder & CEO

9 Oct 2024

8 min read

Engineering Blog

Powering Feature Stores with Datazone

A Practical Guide for Data Engineers

Founder & CEO

9 Oct 2024

8 min read

Use Cases

The Dawn of Intelligent Apps

Beyond Data and AI

Founder and CEO

21 Sept 2024

3 min read

Use Cases

The Dawn of Intelligent Apps

Beyond Data and AI

Founder and CEO

21 Sept 2024

3 min read

Use Cases

The Dawn of Intelligent Apps

Beyond Data and AI

Founder and CEO

21 Sept 2024

3 min read

Use Cases

The Dawn of Intelligent Apps

Beyond Data and AI

Founder and CEO

21 Sept 2024

3 min read

Contact us

Ready to Elevate Your Experience? Get in Touch!

Contact us

Ready to Elevate Your Experience? Get in Touch!

Contact us

Ready to Elevate Your Experience? Get in Touch!

Contact us

Ready to Elevate Your Experience? Get in Touch!

Contact us

Datazone

Simplified Data & AI Platform for Enhanced Productivity and Efficiency

Product

Pricing

Docs

Changelog

Company

About us

Contact

Blog

Resources

DPA

Privacy Policy

Terms of service

Report a vulnerability

Datazone

Simplified Data & AI Platform for Enhanced Productivity and Efficiency

Product

Pricing

Docs

Changelog

Company

About us

Contact

Blog

Resources

DPA

Privacy Policy

Terms of service

Report a vulnerability

Datazone

Simplified Data & AI Platform for Enhanced Productivity and Efficiency