Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas

Data is at the heart of machine learning (ML). Including relevant data to comprehensively represent your business problem ensures that you effectively capture trends and relationships so that you can derive the insights needed to drive business decisions. With Amazon SageMaker Canvas, you can now import data from over 40 data sources to be used for no-code ML. Canvas expands access to ML by providing business analysts with a visual interface that allows them to generate accurate ML predictions on their own—without requiring any ML experience or having to write a single line of code. Now, you can import data in-app from popular relational data stores such as Amazon Athena as well as third-party software as a service (SaaS) platforms supported by Amazon AppFlow such as Salesforce, SAP OData, and Google Analytics.

The process of gathering high-quality data for ML can be complex and time-consuming, because the proliferation of SaaS applications and data storage services has created a spread of data across a multitude of systems. For example, you may need to conduct a customer churn analysis using customer data from Salesforce, financial data from SAP, and logistics data from Snowflake. To create a dataset across these sources, you need to log into each application individually, select the desired data, and export it locally, where it can then be aggregated using a different tool. This dataset then needs to be imported into a separate application for ML.

With this launch, Canvas empowers you to capitalize on data stored in disparate sources by supporting in-app data import and aggregation from over 40 data sources. This feature is made possible through new native connectors to Athena and to Amazon AppFlow via the AWS Glue Data Catalog. Amazon AppFlow is a managed service that enables you to securely transfer data from third-party SaaS applications to Amazon Simple Storage Service (Amazon S3) and catalog the data with the Data Catalog with just a few clicks. After your data is transferred, you can simply access the data source within Canvas, where you can view table schemas, join tables within or across data sources, write Athena queries, and preview and import your data. After your data is imported, you can use existing Canvas functionalities such as building an ML model, viewing column impact data, or generating predictions. You can automate the data transfer process in Amazon AppFlow to activate on a schedule to ensure that you always have access to the latest data in Canvas.

Solution overview

The steps outlined in this post provide two examples of how to import data into Canvas for no-code ML. In the first example, we demonstrate how to import data through Athena. In the second example, we show how to import data from a third-party SaaS application via Amazon AppFlow.

Import data from Athena

In this section, we show an example of importing data in Canvas from Athena to conduct a customer segmentation analysis. We create an ML classification model to categorize our customer base into four different classes, with the end goal to use the model to predict which class a new customer will fall into. We follow three major steps: import the data, train a model, and generate predictions. Let’s get started.

Import the data

To import data from Athena, complete the following steps:

On the Canvas console, choose Datasets in the navigation pane, then choose Import.
Expand the Data Source menu and choose Athena.
Choose the correct database and table that you want to import from. You can optionally preview the table by choosing the preview icon.

The following screenshot shows an example of the preview table.

In our example, we segment customers based on the marketing channel through which they have engaged our services. This is specified by the column segmentation, where A is print media, B is mobile, C is in-store promotions, and D is television.

When you’re satisfied that you have the right table, drag the desired table into the Drag and drop datasets to join section.
You can now optionally select or deselect columns, join tables by dragging another table into the Drag and drop datasets to join section, or write SQL queries to specify your data slice. For this post, we use all the data in the table.
To import the data, choose Import data.

Your data is imported into Canvas as a dataset from the specific table in Athena.

Train a model

After your data is imported, it shows up on the Datasets page. At this stage, you can build a model. To do so, complete the following steps:

Select your dataset and choose Create a model.
For Model name, enter your model name (for this post, my_first_model).
Canvas enables you to create models for predictive analysis, image analysis, and text analysis. Because we want to categorize customers, select Predictive analysis for Problem type.
To proceed, choose Create.

On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mean of the data.

For Target column, choose a column (for this post, segmentation).

Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 2–4 hours.

For this post, choose Quick build.
After the model is trained, you can analyze the model accuracy.

The following model categorizes customers correctly 94.67% of the time.

You can optionally also view how each column impacts the categorization. In this example, as a customer ages, the column has less of an influence on the categorization. To generate predictions with your new model, choose Predict.

Generate predictions

On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps:

For this post, choose Single prediction to understand what customer segmentation will result for a new customer.

For our prediction, we want to understand what segmentation a customer will be if they are 32 years old and a lawyer by profession.

Replace the corresponding values with these inputs.
Choose Update.

The updated prediction is displayed in the prediction window. In this example, a 32-year old lawyer is classified in segment D.

Import data from a third-party SaaS application to AWS

To import data from third-party SaaS applications into Canvas for no-code ML, you must first transfer data from the application to Amazon S3 via Amazon AppFlow. In this example, we transfer manufacturing data from SAP OData.

To transfer your data, complete the following steps:

On the Amazon AppFlow console, choose Create flow.
For Flow name, enter a name.
Choose Next.
For Source name, choose your desired third-party SaaS application (for this post, SAP OData).
Choose Create new connection.
In the Connect to SAP OData pop-up window, fill out the authentication details and choose Connect.
For SAP OData object, choose the object containing your data within SAP OData.
For Destination name, choose Amazon S3.
For Bucket details, specify your S3 bucket details.
Select Catalog your data in the AWS Glue Data Catalog.
For User role, choose the AWS Identity and Access Management (IAM) role that the Canvas user will use to access the data from.
For Flow trigger, select Run on demand.

Alternatively, you can automate the flow transfer by selecting Run flow on schedule.

Choose Next.
Choose how to map the fields and complete the field mapping. For this post, because there is no corresponding destination database to map to, there is no need to specify the mapping.
Choose Next.
Optionally, add filters if necessary to restrict data transferred.
Choose Next.
Review your details and choose Create flow.

When the flow is created, a green ribbon will populate at the top of the page indicating that it is successfully updated.

Choose Run flow.

At this stage, you have successfully transferred your data from SAP OData to Amazon S3.

Now you can import the data from within the Canvas app. To import your data from Canvas, follow the same set of steps as described in the Data import section earlier in this post. For this example, on the Data source drop-down menu on the Data import page, you can see SAP OData listed.

You are now able to use all existing Canvas functionalities, such as cleaning your data, building an ML model, viewing column impact data, and generating predictions.

Clean up

To clean up the resources provisioned, log out of the Canvas application by choosing Log out in the navigation pane.

Conclusion

With Canvas, you can now import data for no-code ML from 47 data sources through native connectors with Athena and Amazon AppFlow via the AWS Glue Data Catalog. This process enables you to directly access and aggregate data across data sources within Canvas after data is transferred via Amazon AppFlow. You can automate the data transfer to activate on a schedule, which means that you don’t have to go through the process again to refresh your data. With this process, you can create new datasets with your latest data without having to leave the Canvas app. This feature is now available in all AWS Regions where Canvas is available. To get started with importing your data, navigate to the Canvas console and follow the steps outlined in this post. To learn more, refer to Connect to data sources.

About the authors

Brandon Nair is a Senior Product Manager for Amazon SageMaker Canvas. His professional interest lies in creating scalable machine learning services and applications. Outside of work he can be found exploring national parks, perfecting his golf swing or planning an adventure trip.

Sanjana Kambalapally is a Software Development Manager for AWS Sagemaker Canvas, which aims at democratizing machine learning by building no code ML applications.

Xin Xu is a software development engineer in the Canvas team, where he works on data preparation, among other aspects in no-code machine learning products. In his spare time, he enjoys jogging, reading and watching movies.

Volkan Unsal is a Sr. Frontend Engineer in the Canvas team, where he builds no-code products to make artificial intelligence accessible to humans. In his spare time, he enjoys running, reading, watching e-sports, and martial arts.