Amazon SageMaker Lakehouse and Amazon Redshift support integrations with zero ETL from applications | Amazon Web Services

Amazon SageMaker Lakehouse and Amazon Redshift support integrations with zero ETL from applications | Amazon Web Services

Voiced by Polly

Today we announced the general availability of Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications. Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and AI/ML applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in-place using all Apache Iceberg compatible tools and engines. Zero-ETL is a suite of fully managed integrations from AWS that minimizes the need to build ETL pipelines for common ingest and replication use cases. With zero-ETL integrations from applications like Salesforce, SAP, and Zendesk, you can reduce the time you spend building data pipelines and focus on running unified analytics on all your data in Amazon SageMaker Lakehouse and Amazon Redshift.

As organizations rely on an increasingly diverse array of digital systems, data fragmentation has become a significant challenge. Valuable information is often scattered across multiple repositories, including databases, applications, and other platforms. To realize the full potential of their data, businesses must enable access and consolidation from these diverse sources. In response to this challenge, users are creating data pipelines to extract and load (EL) from various applications into centralized data lakes and data warehouses. With zero-ETL, you can efficiently replicate valuable data from your customer support, relationship management, and enterprise resource planning (ERP) analytics and AI/ML into data lakes and data warehouses, saving you weeks of engineering effort needed to design, build, and test data channels.

Prerequisites

  • Amazon SageMaker Lakehouse catalog configured through AWS glue data catalog and AWS Lake Formation.
  • An AWS Glue database that is configured for Amazon S3 where the data will be stored.
  • The secret key in AWS Secret Manager to use to connect to the data source. Credentials must contain the username and password you use to log in to the application.
  • An AWS Identity and Access Management (IAM) role for an Amazon SageMaker Lakehouse or Amazon Redshift job. The role must grant access to all resources used by the job, including Amazon S3 and AWS Secrets Manager.
  • A valid AWS Glue connection to the requested application.

How it works – making the assumption of a glued joint
I’ll start by creating a connection using the AWS Glue console. I chose the Salesforce integration as the data source.

Next, I’ll provide the location of the Salesforce instance to use for the connection, along with the rest of the required information. Make sure you use .salesforce.com domain instead .force.com. Users can choose between two authentication methods, a JSON Web Token (JWT) obtained through Salesforce access tokens, or OAuth login through a browser.

I will look at all the information and then make a choice Create a connection.

After I login to the Salesforce instance via the popup (not shown here), the connection is successfully established.

How it works – creating an integration with zero ETL
Now that I have a connection, I choose integration with zero ETL from the left navigation bar and then select Create an integration with zero ETL.

First, I choose the source type for my integration – in this case Salesforce, so I can use my recently created connection.

Next, I select the objects from the data source that I want to replicate to the target database in AWS Glue.

During the process of adding objects, I can quickly preview both data and metadata to make sure I’m selecting the right object.

By default, zero-ETL integration will sync data from source to target every 60 minutes. However, you can change this interval to reduce replication costs in cases that do not require frequent updates.

I will check and then choose Build and run the integration.

The data in the source (Salesforce instance) has now been replicated to the target database salesforcezeroETL on my AWS account. This integration has two phases. Phase 1: the initial load processes all the data for the selected objects and can take anywhere from 15 minutes to several hours depending on the size of the data in those objects. Phase 2: incremental loading detects any changes (such as new records, updated records, or deleted records) and applies them to the target.

Each of the objects I selected earlier was stored in the appropriate table in the database. I can browse from there Tabular data for each of the objects that were replicated from the data source.

Finally, here’s a look at the data in Salesforce. As new entities are created or existing entities are updated or changed in Salesforce, the data changes are automatically synced to the target in AWS Glue.

Now available
Amazon SageMaker Lakehouse and Amazon Redshift support for zero-ETL integrations from applications is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Singapore ), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland) and Europe (Stockholm) AWS. See the AWS Glue Pricing page for pricing information.

To learn more, visit our AWS Glue User Guide. Send feedback to AWS re:Post for AWS Glue or through your usual AWS support contacts. Get started by creating a new integration with zero ETL today.

– Veliswa

Leave a Reply

Your email address will not be published. Required fields are marked *