Automating AI Pipelines with Multi-Cloud Data

Parker Merritt
IBM DCPE Group
Published in
12 min readApr 24, 2020

IBM Cloud Pak for Data is a holistic, container-based enterprise analytics platform, purpose-built for collaboration. With this platform, business users can easily automate and accelerate the operationalization of their data science projects.

Data is the fuel that powers modern enterprise. With cutting-edge AI and analytics applications, it’s now easier than ever for business users to generate meaningful insights from their data, increasing productivity, efficiency, and innovation.

Still, a major challenge remains: your data is useless if you can’t trust it or access it. For many businesses, mission-critical applications span multiple clouds, and users may struggle to foster collaboration across disparate data sources.

With IBM Cloud Pak for Data, however, enterprises can easily connect to their data, govern it, and use it for analysis, no matter where their applications are located. With modern tools that facilitate analytics and remove barriers to collaboration, users can spend less time finding data and more time using it effectively.

In this tutorial, we’ll learn how to connect IBM Cloud Pak for Data to an Amazon Web Services S3 data source, clean and prepare the data for analysis, and generate an automated AI pipeline — all within 15 minutes, without writing a single line of code.

For the visual learners out there, this tutorial is also available as a video on YouTube.

1. First, sign into the Cloud Pak for Data web client.

Don’t currently have access to a Cloud Pak for Data instance? Request a free trial environment with Cloud Pak Experiences.

2. Once you’re signed in, take a moment to explore the main dashboard. Here, users can configure a custom interface with various metrics related to the number of data assets, governance rules, business terms, and more.

For a closer look at these dynamic images, simply click to expand.

3. Once you’re finished exploring the main dashboard, click on the menu icon in the top-left corner. Then, click the Projects tab.

Projects are how Cloud Pak users organize their resources when working with data.

4. Next, click New Project, then Create an empty project, which will contain all of your data source connections and assets.

When you create a project, you have the Admin role and full control over the project. If you have the Editor role, you can add assets and collaborators to a project.

5. Give your new project a name and description. Note that you also have the option of integrating your project with Git for version control, allowing your team to track any changes made. When you’re finished, click create.

Integrating with Git allows you to sync your project with a remote repository (for example, Github) which is useful in a number of ways. From supporting collaboration with the broader data science community, to helping you more easily back up and share your projects assets, Git integration is a great feature for supporting a variety of developer needs.

6. After your project is created, you’ll see the project Overview page, where you can find a high-level summary of the project’s assets and collaborators.

7. Click over to your project’s Assets tab. This page will display all of your project’s data source connections, notebooks, dashboards, and other relevant data assets.

8. Near the top of the page, click the Add to Project button. Then, select Connection.

9. Here you’ll see a catalog of the various data sources and platforms you can connect to with Cloud Pak for Data. Under third-party services, click the Amazon S3 option.

According to a survey by the IBM Institute for Business Value, 85 percent of organizations already operate in a multi-cloud environment. Enabling connectivity across these clouds is essential for enterprise analytics.

10. Next you’ll be required to enter a name and description for your connection. Also enter the name of the S3 Bucket containing your data asset. Be sure that the bucket name is entered exactly how it appears on AWS.

11. In a separate browser window, switch over to your AWS Management Console. At the top of the page, click your username, then select My Security Credentials from the drop-down menu.

12. Under the Access Keys section, click Create New Access Key. Save both the Secret Access Key and the Access Key ID in a note.

13. Switch back over to Cloud Pak for Data and enter your AWS access credentials into the corresponding fields. Finally, Test your connection, then click Create.

14. Navigate back to your Assets tab. Notice that your Amazon S3 Connection now appears on the Data Assets list.

15. Now that we’ve created a connection to Amazon S3, we’ll create a Data Asset and import the connected data for our machine learning model. Near the top of the page, click the Add to Project button. Then, select Connected Data.

In this tutorial, we use a concrete compressive strength dataset from the UCI Machine Learning Repository. However, your S3 bucket can contain any CSV file you’d like to build a predictive model with.

16. Here you’ll be prompted to select the source of your data asset, in addition to providing it with a name and description. Click Select source.

17. After selecting Amazon S3 as your connection source, all of the files contained in your S3 bucket will be listed. Click the CSV you’ll use to train your AI model, then click Select at the bottom-right corner.

18. Once you’ve selected a data source and provided the asset with a name and description, click Create.

19. Now you’ve added the CSV data asset to your project, but the data still needs to be cleaned and prepared for use in your AI model. Hover over the Actions icon next to your data asset, then click Refine.

For many data scientists, 80 percent of their time is spent finding, cleaning, and reorganizing data in order to prepare it for analysis. The Data Refinery tool in Cloud Pak for Data helps streamline and accelerate this process.

20. Let’s quickly explore some of the main features of the Data Refinery tool. By clicking on Steps at the top-right corner, we’re able to see the order of the data transformation operations performed on our dataset.

21. Clicking on the Profile tab at the top-left corner, we’re able to see some basic statistics concerning our data set. However, these metrics are limited by the fact that our data columns currently have the type String, rather than Decimal or Integer.

The Profile and Visualization tabs in Data Refinery allow users to quickly perform an exploratory analysis of their dataset, a critical first step in any data science project.

22. Let’s convert our columns to the correct data type. Click to open the Operations menu in the top-left corner. Select the Convert column type option.

23. Select the first column, then select the Decimal data type.

24. Click Select Column again and repeat this type conversion process for the remaining columns. Be sure to correctly apply the Decimal and Integer labels to the corresponding columns. Then, click Apply.

25. Double check each column to ensure your data types have been converted correctly. Note that your conversion operations now appear under the Steps tab.

26. The final step in the refinery process is to save and run your refinery flow. Click the clock icon, then Save and create a job.

27. Give your refinery flow job a name and description, then click Create and Run.

28. After running for a few moments, the job status will change to Completed.

In just a few steps — without writing a single line of code — you’ve prepared a clean dataset, ready for use in any enterprise analytics project.

29. Return to your project’s Assets page. Note that, in addition to the saved refinery flow, you’ll also see the refined CSV file under your Data assets list.

30. Now that our data is cleaned and prepared for analysis, we’re ready to create our automated AI pipeline. Once again, click the Add to project button, then add an AutoAI experiment.

The AutoAI graphical tool in Cloud Pak for Data automatically analyzes your data and generates candidate model pipelines customized for your predictive modeling problem. These model pipelines are created iteratively as AutoAI analyzes your dataset and discovers data transformations, algorithms, & parameter settings that work best for your problem setting.

31. Give your AutoAI experiment a name and a description, then click Create.

32. Next, add your refined data source to the AutoAI experiment by clicking Select from project, choosing the shaped CSV file, and clicking Select asset.

33. Now select the target variable to predict with your AI model. In this example, we select the ‘Concrete compressive strength’ column. You also have the option to reconfigure the prediction type and the optimized metric under Experiment settings. Then, click Run experiment.

34. Next, you’ll see the experiment automatically progress through various stages in a typical data science pipeline.

To help simplify AI lifecycle management, AutoAI automates:

· Data preparation

· Model development

· Feature engineering

· Hyper-parameter optimization

Over the course of a few minutes, AutoAI will evaluate a variety of model configurations, ranking each according to your chosen optimization metric.

Bypassing the complexity of manually coding an AI pipeline allows citizen data scientists to quickly get started, and helps expert data scientists speed experimentation time from weeks and months to minutes and hours.

35. Once your AutoAI experiment has finished running, scroll down to the Pipeline leaderboard to view a comparative analysis of various model configurations.

AutoAI uses a novel approach that enables testing and ranking candidate algorithms against small subsets of the data, gradually increasing the size of the subset for the most promising algorithms to arrive at the best match. This approach saves time without sacrificing performance.

36. Select the best-ranking pipeline on the leaderboard and click Save as model. Give your model a name and description, then click Save.

37. After clicking save, you’ll see a success notification, indicating your machine learning model will now appear in your project’s Assets page. Click view in project to view more details.

38. Here you’ll see details about how your model was generated, including the data types contained in the model’s training set. To make your model publicly available to developers within your organization, click Promote to deployment space.

39. If you haven’t yet created a Deployment Space for your project, you won’t be able to promote the asset. Click Associate Deployment Space.

40. Give your Deployment Space a name and description, then click Associate.

41. Click again on Promote to Deployment Space. After the model is promoted to your project’s deployment space, a success notification will pop up near the top of your dashboard. Navigate to your deployment space via the link in the notification. Alternatively, there is also a link in your project’s Overview page.

42. Upon navigating to the deployment space, you’ll see your AI model asset, staged and ready for deployment. Click the asset’s Actions icon then Deploy.

43. Give your deployment a name and description, then click Create.

44. Wait for your deployment’s status to change from In-Progress to Deployed. Then, click on the deployment name to view more details.

By streamlining the deployment process, Cloud Pak for Data allows developers and business users within your organization to easily leverage data science projects, ensuring AI models quickly move from experiment to production.

45. Your deployment’s API reference tab provides a RESTful API endpoint for developers, as well as example code snippets for interacting with the model in a variety of popular programming languages.

46. Clicking over to the Test tab, Cloud Pak for Data provides an interface for testing the API endpoint. Referring to your original CSV data source, pick a set of test values to evaluate the accuracy of your model.

In addition to entering your test values into the form interface, users can also format their test request in JSON.

Congratulations!

In this lab, you’ve successfully built and deployed an automated AI prediction model using multi-cloud data assets.

As organizations accelerate their digital transformations to better predict and shape future outcomes, empower higher value work, and automate experiences, the need to embrace AI is growing ever more critical. But to implement AI successfully, companies need to overcome three major challenges: data complexity, skills and trust. This process starts with good, clean and secure data — data that is readily deployable to generate insights and set the foundation of AI-driven business.

IBM Cloud Pak for Data delivers a prescriptive, cost-effective approach to climb the AI ladder.

Single, unified platform.

Speed time to value with a single platform that integrates data management, data governance and analysis for greater efficiency and improved use of resources. Help enable self-service collaboration across teams.

Extensible APIs and ecosystem.

Use best practices to your advantage to accelerate implementations and deliver significant business value. Benefit from built-in models and accelerators for various industries including finance, insurance, healthcare, energy, utilities and more.

Continuous intelligence.

Develop real-time streaming applications and deliver continuous intelligence across your business. With IBM Streams on IBM Cloud Pak for Data, you can enable continuous and rapid analysis of massive volumes of data in motion or at rest. This can help you gain business insights faster and make more-informed decisions.

Accelerate your journey to AI to transform how your business operates with an open, extensible data and AI platform that runs on any cloud.

This tutorial is also available in video format on YouTube.

1:07 — Creating a project

2:12 — Connecting to AWS

4:32 — Clean in Data Refinery

6:31 — Automating AI pipeline

8:58 — Deploying your model

Enjoy this tutorial? Learn more about IBM Cloud Pak for Data by checking out our collection of Demos, trying out the platform yourself at Cloud Pak Experiences, engaging with us on our Community, or emailing me directly at parker.merritt@ibm.com.

Feel free to connect with me on Twitter and LinkedIn, and be sure to check out a few other tech-focused pieces I’ve written:

--

--