AI and ML using Workflows — Is this the future?

Yash Gupta
Data Science Simplified
6 min readNov 17, 2022

Have you imagined working with an automated data workflow or a lifecycle that can be replicated across departments and can integrate a lot of your data pre-processing and analysis stages and also your final model integrated with it?

It's tough for a data science lifecycle to complete on one particular tool, considering the different structures and sizes that data comes in today. In this article, we’ll go over how using some workflow software, you can revolutionize your entire Data lifecycle and make sure you save time and effort while using your data optimally.

Can everyone learn to code? No
Can all your data team members be skilled at everything? No
Can you as the data leader in your organization do everything? Absolutely not
Can you sit with your systems and ensure that everything is running as required at all times? No
But can all the aforementioned problems be solved using workflow systems? Yes.

In an ever-evolving world, it’s important to ensure that things move ahead in a way that can add on the already made developments whilst ensuring time and effort are saved in all possible ways to reduce human effort and intervention (at least in the data sphere). That’s where workflow systems enter the data world.

What are we covering in this article?

  1. What exactly are workflows?
  2. Examples of Data Workflows
  3. What’s in it for anyone learning Data Science?
  4. When does your organization need Dataiku or Alteryx?
  5. Pros and cons of the workflow systems

Tools in this article:

What exactly are workflows?

If I had to put it in a nutshell — Think assembly lines but for data lifecycles.

Right from taking your data from the source to applying different data manipulation techniques on the data, filtering/cleaning/sorting your data, aggregating your data, and exporting it in different formats to suit your uses, it’s all possible with workflows.

That’s not where the story ends though, think of the same thing and extend the capabilities to ML and AI where after you’re equipped with analysis-ready data.

With workflow systems like Alteryx and Dataiku, you can make ML and AI pipelines within your data lifecycle and make it fully customized with your own personal chunks of code using Python or R!

Examples:

Example 1: Dataiku Workflow

Example 2: Alteryx Workflow

If you observe closely the pictures given above, you get a glimpse of what a workflow can look like. Though this article is not a tutorial on how you can make them, there are a couple of things that you can notice right off the start, these workflows contain visual elements, and anyone who does not know how to code can still make use of any of these aggregations or processes; such as,

  • SQL Joins
  • Pivoting
  • AB Testing

and many more.

What’s in it for anyone learning Data Science?

From personal experiences of completing over 35 courses in Data Science online, I know there’s a lot to learn and a lot more to do but Data Science tools don’t get better than this. These workflow systems are designed in a way that everything, all of your data needs, is in one single place.

Ihave personally used both Dataiku and Alteryx and I think the capabilities that these tools offer are immense and can help your organization in unimaginable ways.

Imagine having to understand how to perform a join and then learning how to query it in SQL while keeping intact all these different conditions that need to work on the join, there’s scope to make a mistake when typing it out, but it’s not possible in these tools.

And what if you don’t know how to use SQL as much as required? Maybe you’re a pro data scientist who uses Tableau and Python mostly, how do you tackle this gap in your knowledge in SQL?

This is where Dataiku saves you (one of the few for example). The platform enables visual cues of joins to ensure you understand exactly how the join will be performed so that you can manipulate it like any other GUI that works on data and does not require the user to understand the back-end processing.

Let’s crackdown on the pros and cons of the workflow systems:

Advantages:

  • Can use ML and AI in data workflows
  • Customize python/R code and SQL queries in any step as required
  • Can perform anything any other tool in Data science can (it's all in one for real)
  • Built-in visualizations / Dashboards (in Dataiku, though not as amazing as Tableau)
  • Can work with Big Data (Spark, Hadoop, etc.)
  • Can automate processes to happen on a regular basis
  • Workflows can be replicated to work on different datasets with the same steps
  • High standards of governance

Cons:

  • Heavy on the cost
  • Needs a team good at all the aforementioned skills to use it at its full capability
  • Needs QC at multiple checkpoints

Get to work with either of the two tools or any other workflow system and you’ll see how they make your life easier when it comes to working with data. Real-time analytics and working on massive datasets that hold hundreds of different processes and lifecycles at the same time within different projects was probably impossible 10 years ago.

To stay relevant and ahead of the curve in data science and to ensure that you don’t miss out on means that can help you derive the most value out of your data, stay tuned with developments like this in Data Science and get to learn them as soon as possible!

Learn them here:

Try them out and let me know what you think about workflow systems in the comments!

P.S. Big thanks to the developers of the workflow systems (specially Dataiku and Alteryx) on behalf of all data enthusiasts of the world. You guys make our work more awesome and simpler!

Let me know in the comments below if you have any other pointers or charts that everyone should look into. Leave a clap and follow to stay in touch with any new articles and to support the blog!

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science. Please leave a review down in the comments.

Check out my other articles at:

Do connect with me on LinkedIn at — Yash Gupta — if you want to discuss it further! Leave a clap and comment below to support the blog! Follow for more.

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss