![]() |
Today, I’m very excited to announce the general availability of Amazon SageMaker Lakehouse, a feature that unifies data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses to help you create powerful analytics and AI and machine learning . (AI/ML) application on a single copy of the data. SageMaker Lakehouse is part of the next generation of Amazon SageMaker, a unified data, analytics, and AI platform that combines AWS’s widely adopted machine learning and analytics capabilities to provide an integrated analytics and AI environment.
Customers want to do more with data. To move faster on their analytics journey, they choose the right repositories and databases to store their data. Data is spread across data lakes, data warehouses, and various applications, creating data silos that make it difficult to access and exploit. This fragmentation leads to duplicate copies of data and complex data pipelines, which in turn increase costs for the organization. In addition, customers are forced to use specific query engines and tools because how and where data is stored limits their options. This limitation hinders their ability to work with the data in the way they would prefer. Finally, inconsistent access to data makes it difficult for customers to make informed business decisions.
SageMaker Lakehouse addresses these challenges by helping you unify data across Amazon S3 data lakes and Amazon Redshift data warehouses. It gives you the flexibility to access and query data in-place with all Apache Iceberg compatible engines and tools. With SageMaker Lakehouse, you can centrally define fine-grained permissions and enforce them across different AWS services, simplifying data sharing and collaboration. Transferring data to your SageMaker Lakehouse is easy. In addition to seamlessly accessing data from your existing data lakes and data warehouses, you can leverage zero-ETL from operational databases such as Amazon Aurora, Amazon RDS for MySQL, Amazon DynamoDB, as well as applications such as Salesforce and SAP. SageMaker Lakehouse fits into your existing environments.
Get started with SageMaker Lakehouse
For this demo, I’m using a pre-configured environment that has multiple AWS data sources. I’m moving to the Amazon SageMaker Unified Studio console ( preview ), which provides an integrated development environment for all your data and AI. With Unified Studio, you can seamlessly access and query data from multiple sources through SageMaker Lakehouse while using familiar AWS analytics and AI/ML tools.
Here you can create and manage projects that serve as shared workspaces. These projects allow team members to collaborate, work with data, and develop artificial intelligence models together. Creating a project automatically sets up AWS Glue Data Catalog databases, creates a catalog for Redshift Managed Storage (RMS) data, and provides the necessary permissions. You can start by creating a new project or continue with an existing project.
To create a new project, select Create a project.
I have 2 project profile options to build and interact with a lake house. The first one is Data analysis and AI-ML model developmentwhere you can analyze data and build ML and generative AI models using Amazon EMR, AWS Glue, Amazon Athena, Amazon SageMaker AI, and SageMaker Lakehouse. The second one is SQL analystwhere you can analyze your data in SageMaker Lakehouse using SQL. For this demo I continue SQL analyst.
I enter the name of the project into Project name field and select SQL analyst under Project profile. i choose Continue.
I enter values for all parameters below Tools. I enter values to create mine House by the lake databases. I enter the values to create mine Redshift Serverless resources. Finally, I enter the name of my catalog under Lakehouse catalog.
In the next step I will check the sources and select Create a project.
After creating a project, I monitor the details of the project.
I’m going to Data in the navigation pane and choose the + (plus) sign for Add Data. i choose Create a catalog create a new catalog and select Add dates.
After creating the RMS catalog I select Create from the navigation bar and then select Query editor under Data analysis and integration To create a schema in the RMS catalog, create a table and then load the table with sample sales data.
After entering SQL queries in the designated cells, I choose Select a data source from the drop-down menu on the right, create a database connection to the Amazon Redshift data warehouse. This connection allows me to perform queries and retrieve the required data from the database.
Once the connection to the database is successfully established, I choose Run all execute all queries and monitor the execution progress until all results are displayed.
I am using two other pre-configured catalogs for this example. A catalog is a container that organizes your lakehouse object definitions, such as schemas and tables. The first is the Amazon S3 data lake catalog (test-s3-catalog), which maintains customer records containing detailed transactional and demographic information. The second is the Lakehouse catalog (churn_lakehouse) dedicated to the storage and management of customer exit data. This integration creates a unified environment where I can analyze customer behavior along with predicting customer churn.
I select from the navigation panel Data and search my catalogs under House by the lake section. SageMaker Lakehouse offers several analysis options, including Query with Athena, Query with Redshiftand Open Jupyter Lab in your notebook.
Note that you have to choose Data analysis and AI-ML model development profile when creating the project if you want to use it Open Jupyter Lab in your notebook choice. If you choose Open Jupyter Lab in your notebookyou can communicate with SageMaker Lakehouse using Apache Spark via EMR 7.5.0 or AWS Glue 5.0 by configuring the Iceberg REST catalog, allowing you to process data across your data lakes and data warehouses in a unified manner.
This is what querying looks like with a Jupyter Lab notebook:
I continue to choose Query with Athena. With this option, I can use Amazon Athena’s serverless querying capability to analyze sales data right in SageMaker Lakehouse. When choosing Query with Athenatea Query editor it starts automatically and provides a workspace where I can compose and run SQL queries against Lakehouse. This integrated query environment offers seamless data exploration and analysis, complete with syntax highlighting and auto-completion for increased productivity.
I can also use Query with Redshift ability to run SQL queries against Lakehouse.
SageMaker Lakehouse offers a comprehensive solution for modern data management and analysis. By unifying data access from multiple sources, supporting a wide range of analytics and ML engines, and providing fine-grained access controls, SageMaker Lakehouse helps you get the most out of your data assets. Whether you’re working with data lakes in Amazon S3, data warehouses in Amazon Redshift, or operational databases and applications, SageMaker Lakehouse provides the flexibility and security you need to drive innovation and data-driven decision-making. You can use hundreds of connectors to integrate data from different sources. Additionally, you can access and query data in-place using federated query capabilities across third-party data sources.
Now available
You can access SageMaker Lakehouse through the AWS Management Console, the API, the AWS Command Line Interface (AWS CLI), or the AWS SDKs. You can also access via AWS Glue Data Catalog and AWS Lake Formation. SageMaker Lakehouse is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (Stockholm), Europe (London), Asia Pacific (Sydney), Asia Pacific (Hong Kong), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Seoul), South America (Sao Paulo) AWS Regions.
See the Amazon SageMaker Lakehouse Price List for pricing information.
For more information about Amazon SageMaker Lakehouse and how it can simplify your data analysis and AI/ML workflows, see the Amazon SageMaker Lakehouse documentation.
– Esra
12/6/2024: Updated list of regions