200 Alteryx tools aren’t enough! I need more!

Using Alteryx, how do you extract all of the matches of a regular expression from a field of text? This was a question I came across on the Alteryx Community. For example. if we have the following table, how would we extract all values that start with ‘A’ and end with a number into different records?

+------------+
|ID |Text |
+---+--------+
|1 |A1 A2 A3|
+---+--------+

If we do it right, the output should look something like this:

+----------+
|ID |Match |
+---+------+
|1 |A1 |
|1 |A2 |
|1 |A3 |
+---+------+

I was very excited to answer this question. I have had to solve this problem in the past, before I started using Alteryx, and it was somewhat challenging. Alteryx does not have this capability built into a core tool (that I know of), so it was also an opportunity to peer into the code friendly side of the product. The code-free aspect of Alteryx is compelling, but I have a huge soft spot in my heart for code; writing it is so cathartic. Even better, Alteryx provides a Python SDK. I have been meaning, for a long time, to do something in Python. What better way to spend my weekend free time than build a custom regex tool for Alteryx using Python?

Of course, it wasn’t until after I finished that I realized I mis-read the question. The author wanted to know how to extract values into separate columns, something Alteryx can already do quite easily today. But no matter, even if poor reading comprehension was ultimately the motivator, doing this little project was a worthwhile pursuit. What follows is my experience building my first custom Alteryx tool using the Python SDK.

Setting up the development environment

What’s the difference between an OK language and a great language? The quality of the tools. I have tried out over a dozen different languages and have built functioning solutions in VB.Net, C#, and Dart. I now refuse to try a language unless it has a fully-functioning IDE. Text editors are not acceptable in the 21st century. So the first step of my project was finding a proper IDE. Fortunately, JetBrains provides a product called PyCharm. I love IntelliJ, another JetBrains product, and was very excited to see they created a dedicated solution for Python. If it is a JetBrains product, I know it will be good. Download, install, and get over the fact that I have to use a Windows machine rather than my beloved MacBook Pro (I’m looking at you, Alteryx; let me work on my Mac).

Once installed, I created a new project in PyCharm called RegexExtract. Alteryx is installed on my machine, so I should be able to point to its embedded Python interpreter:

Ok…time to create a new file? RegexExtract.py sounds lovely. I will start by importing the Alteryx SDK:

import AlteryxPythonSDK as Sdk

And look at that, I already have an error:

80% of software development is setting up your environment…

PyCharm cannot find the SDK. I guess that makes sense; even though I am using Alteryx’s embedded Python interpreter, it probably doesn’t know about the SDK. Maybe it’s something I can install with pip?…

Nope, it’s not a pip module.

Let’s go to the help documents. In the Python Installation section there is something about running these commands:

import os, sys
os.environ['PATH'] = r'C:\program files\Alteryx\bin;' + os.environ['PATH']
sys.path.insert(1, r'C:\program files\Alteryx\bin\plugins')
import AlteryxPythonSDK

As you know, reading comprehension is apparently not my strength, but I guess we are adding the SDK to the…a…some path? I think it would help if I understood Python’s package mechanism better. But I don’t, so…

Either way, I don’t want to run these commands every time I run PyCharm…there must be a way to ‘automate’ this so that the project will always look for the Alteryx SDK in the Alteryx installation directory. After a bit of looking around, I found this screen located at File->Settings.

I clicked the settings button (the gear) in the upper-right corner and selected ‘Show All…’. I selected the interpreter and clicked the ‘Interpreter Paths’ button:

I added the bin\plugins folder from my Alteryx installation directory and…

…no error! W00t! PyCharm now knows about the SDK and can provide auto-complete and help documentation for the SDK classes. This was a bit more work than expected, but the excitement is building! Also, future projects will be easy to get running now.

The code

What exactly does the core logic look like? I am not doing rocket science here; it shouldn’t be too complicated. Python even has regular expressions built into a core library (re), so I don’t need to worry about virtual environments or pip’ing packages. After about 3 minutes of Google-ing I came up with the following:

Well…that’s a bit underwhelming

The code above will search through text and iterate through all matches of ‘A’ followed by a number (represented by the pattern A\d ). Of course, we will want to parameterize the pattern and hook the text value into an Alteryx field, but that’s just wiring. The core logic here is very simple.

So, so much wiring

Ok, if the core logic is so simple, I am going to break my rule of always coding via Test-Driven Development (TDD). I am also not going to worry about creating a separate module to house the core logic and its tests; I will embed it into the SDK-specific code. If I was doing anything even a bit more complicated I would put a lot more thought into architecture, and still might if future requirements demand it. But this is stupid simple, so I can focus on understanding the nuances of the SDK.

Let’s get the GUI file out of the way, first. We want the GUI to look something like this:

We should be able to select a field containing text we want to parse, provide a regex pattern to describe what values we want, and name the output field whatever name we desire. The HTML which creates the layout looks like this:

I am not going to pretend to understand everything here, but I will give the highlights of what I think I know. I started with the SDK sample tools and cobbled this together. The important parts of the file are the <ayx> tags. They define the user input elements which ultimately send the user’s selections to the python script. I use two different kinds: a drop-down box and two text boxes (defined by data-ui-props:"{type: 'DropDown'}" and data-ui-props:"{type: 'TextBox'}" ). The data-ui-props property seems to define the visual representation and behaviors of the control. The next property, data-item-props, controls the flow of data into and out of the control. The dataName sub-property becomes the name of the XML element sent to our tool with the user’s input value. Our Python code will parse XML elements using these names to obtain the entered parameters. For the text boxes, this sub-property is all we need.

The drop down requires a bit more explanation. The selected value will be passed in an element called ‘DataField’, per the dataName sub-property. The FieldSelector dataType tells Alteryx to automatically populate the drop-down list with the field names from the incoming data stream defined by the anchorIndex and connectionIndex values. I only ever have 1 incoming data stream in this tool, so index values of 0 are all I need. The fieldType sub-property tells Alteryx to only display certain types of fields. In this tool, I only want text fields to be selected, which is accomplished by providing a value of ‘String’.

This is all I need for the GUI part of my tool. The HTML SDK provides for a lot more flexibility and power, but I neither want it for this tool nor do I have any desire to spend lots of time getting better at web development technologies. I am quite happy that I can keep the HTML simple and focus on the Python code.

Speaking of which, let’s get started! I first imported the necessary modules and declared the plugin class…all pretty standard Python stuff. In addition to the SDK and the re package, I will use an XML package to parse the XML that gets sent to the tool by the Alteryx engine.

import AlteryxPythonSDK as Sdk
import xml.etree.ElementTree as Et
import re

class AyxPlugin:

The first step to creating a class is to define the constructor. Python has strange syntax for doing this, but you get used to it…sort of. I also added a helper function that I stole from the sample tools which simplifies the process to pass error messages back to the Alteryx engine.

The constructor’s signature is dictated by the Python SDK. When the Alteryx engine creates this object, it will pass along the tool ID, the engine object, and an output anchor manager to the tool. The first part of the constructor simply stores this information for later use. The second part of the constructor initializes some custom properties, also for later use. You will notice that I provide explicit types for these items. My biggest complaint, and roadblock, with Python was its dynamic typing. I am very happy that the creators of Python have provided a way for people like me to type-ify our code. This lets PyCharm help me in a big way with code completion and helps me prevent silly errors like calling a method that does not exist. I also no longer have to remember what methods and properties my objects expose; the IDE can remember for me. Here is a quick description of the variables I am initializing:

  • self.DataFieldName: This will store the name of the field the user wants to parse.
  • self.Pattern: This will store the pattern the user wants to use to identify matches.
  • self.MatchFieldName: This will store the name of the output field containing the matched values.
  • self.input: This will store an object to handle the incoming connection. I have not defined this object yet; we will do that later.
  • self.MatchOutput: This will store the output anchor that will send matched values to downstream tools.
  • self.NoMatchOutput: This will store the output anchor that will send records with no matching values to downstream tools.

Next up is the Alteryx initializer function: pi_init is called when the user changes the tool’s configuration in Designer. The engine capture all of the current values from the GUI, packages them into an XML string, and calls this function with the updated XML.

All I am doing here is parsing the XML to extract the user’s selections. The XML package I imported earlier makes this very easy to do. Once parsed, I run a few validation checks and define the output anchors. This entire function is really about getting the tool ready to handle data based on the user’s input.

The rest of the functions in AyxPlugin are very short:

  • pi_add_incoming_connection: The engine passes some metadata about the incoming connection and I have to return an object that satisfies the Incoming Interface contract. I will create that soon. I do not need to do anything special with the metadata (i.e. I don’t have multiple incoming connections), so this function is very simple.
  • pi_add_outgoing_connection: My tool does not need to do anything special with the outgoing connections, so I just acknowledge they were created by returning True.
  • pi_push_all_records: This function gets called when the tool does not have an incoming connection. This tool must have incoming data, so I send out an error message and return false. This function would be useful for tools that grab or generate data; it is not useful for tools that process incoming data.
  • pi_close: I do not have any special error handling requirements, so I just use this function to make sure the output connections have properly closed.

You may notice that this class hasn’t even touched the regex logic. The entire purpose of this class is to handle communications with the Alteryx engine and get itself configured for use. The next class, which satisfies the Incoming Interface contract, will do the actual data processing. Let’s start with the constructor:

Again, we are storing the provided parent object and initializing some properties we will use later. How did I know which properties I would need? I developed the code and added them as needed.

The next function, ii_init, is a bit more interesting. Here is where I have to define the outgoing record definition so the engine can pass the record metadata to downstream tools:

I start by making sure the user actually provided a field to parse; if not, I stop and send an error message. Otherwise, I grab the field object from the incoming metadata and save its field type and size, which I will need right away. In the next two lines, 14 and 17, I create a copy of the incoming record definition and add a new field that will contain the matched values. I use the same field type and size as the field the user wants to parse. None of the matches will be longer than that field, so by using its length we can minimize size requirements while ensuring data does not get truncated.

Next, in lines 20 and 21, I define the record metadata for the output anchors. The anchor with matches will contain the record definition I just created while the output anchor containing records without matches will have the same definition as the incoming record. Visually, it looks something like this:

Lines 24 and 27 set up objects to copy incoming data records and create matching output records from them. The little for loop in lines 30 to 32 maps the incoming data records to the outgoing matches. Because all but the last field in the match output is a duplicate of the incoming data, the field indexes map one-to-one. If you wanted to re-order your fields, this is where you could define how they get reordered.

The remainder of the function finalizes the copier and returns true, letting the engine know that the object is ready.

The next function I need is ii_push_record:

Wait! What is that on lines 14 and 15? Is that my core logic? I believe it is; we have finally gotten to the point where we are processing the data! These plugins require a lot of wiring…

But let’s back up. I start the function by immediately resetting the record creator and inserting a copy of the incoming record, in_record. Line 8 pulls out the data from the field we want to parse. This line took me a long time to figure out. My first inclination was to try and extract the field using a method on in_record, something like:

in_record.get_field(self.parent.DataField)

But as I quickly found out, in_record does not contain any methods. This, of course, is all spelled out in the SDK help docs. I really should read them closer…

in_record is a RecordRef object, so it only contains the underlying data. If you want to extract data out of it, you need to do it via a Field object. Field objects know the start position, length, data type, etc of their data in RecordRef. Fortunately, in ii_init I saved a Field object that references the data field we need to parse. In line 8 I use that object to extract the value from in_record. Line 12 performs a null check on the data I just extracted. Only if the value is not None/Null do I try to parse the text.

Lines 14 through 19 iterate through all of the matches, update the MatchField value, and send it, along with the original data, through the Match output anchor. We keep track of how many matches we found via line 19. Later, in line 23, we send the original record down the No Matches output anchor if we did not find any match.

The remainder of Incoming Interface comprises 2 functions for reporting progress and closing the tool when it is finished. There is not much I do here, so I will not spend time explaining it:

And that’s it! The entire tool comprises of ~146 lines of Python, all but 2 or 3 of which exists to wire up our logic to the Alteryx engine. I am sure there are some opportunities here to abstract a lot of this code into a handful of classes which I can reuse in all my (future) custom tools, but that is a task for another day. The entirety of the Python code looks like this:

To finish out the custom tool, I need 2 more files. The first is a config file called RegexExtractConfig.xml, which looks like this:

The important section is at the top, which points to the other files used by the tool and defines the input and output anchors. The names used by the anchors must match the names used when the Python plugin class retrieves output anchors in pi_init like this:

self.MatchOutput = self.output_anchor_mgr.get_output_anchor('Matches')
self.NoMatchOutput = self.output_anchor_mgr.get_output_anchor('NoMatches')

The rest of the configuration file is straight-forward and I will not spend any more time on it.

The last file is the beautiful logo, where you can see the entirety of my graphic design skills proudly on display:

I am sure you can agree that it is the best thing you have ever seen.

And that’s it! I can package the tool as a .yxi file or install it directly and it will be available in Designer. Here it is in action:

The source code is available from my GitHub page. You can also download the installer and sample workflow from the releases page. Enjoy, and happy coding!