DATA STORIES | CHEMINFORMATICS | KNIME ANALYTICS PLATFORM

From KNIME Nodes to Musical Notes

My Data Guest — An Interview with Stephen Roughley

Rosaria Silipo
Low Code for Data Science
10 min readDec 18, 2023

--

My Data Guest — An Interview with Stephen Roughley.

In this episode of My Data Guest, we had the pleasure of interviewing Stephen Roughley, a distinguished professional in the field of medicinal chemistry and cheminformatics. As a Principal Scientist at Vernalis in Cambridge (UK), Stephen is not only a seasoned KNIME enthusiast but also the driving force behind the widely acclaimed Vernalis extension.

In our insightful conversation, we delved into Stephen’s multifaceted role as a KNIME user, tech support expert, trumpeter, and more. Throughout the interview, Stephen provides valuable insights into the world of medicinal chemistry and cheminformatics, he shares his tips to develop KNIME extensions and he talks about the advantages of a no-code, low-code software like KNIME.

Rosaria: Let’s briefly introduce yourself, your professional self. What does a data scientist for medicinal chemistry and cheminformatics do? And what are medicinal chemistry and cheminformatics?

Stephen: Let me start from the definitions.

Medicinal chemistry identifies new molecules suitable for developing pharmaceutical drugs. After a drug is developed, different tests are conducted on it, such as biological or physical screenings. The data coming from those testing is used to implement the drug developing process.

Cheminformatics studies the computer representations of molecules. Our work focuses on how to get the chemical structures that, as chemists, we scribble on whiteboards, into a machine readable format that allows us to store the molecules structure and calculate lots of different properties. For example, when we do a reaction in the lab, we take some chemicals and transform them into something else. Cheminformatics is able to mirror that exact process computationally, making the development process more time efficient.

A data scientist in medicinal chemistry and cheminformatics deals with varied types of data. Some of it is very simple data like numbers, coming from the testing on already developed compounds, and some of it is really quite complicated data like 3D structures of huge proteins.

Rosaria: Now you are the Principal Scientist for Medicinal Chemistry & Cheminformatics at Vernalis. What does your role entail? And what services does Vernalis offer?

Stephen: As principal scientist, I provide information and support for different projects across Vernalis, a biotechnology company that primarily focuses on early-stage drug discovery. For example, we might investigate how to stop two proteins from interacting with each other in a person. We have also developed quite a few technologies, some of which are community contributions for KNIME. If you are a pharmaceutical company and you have got a tricky target, or you want to get a new target going, come and talk to us.

Rosaria: You are a KNIME user, a tech supporter, an extension developer, a trumpet player, and whatnot. What do you think is your main role among all of those? What would you like to be remembered for?

Stephen: I guess my main role is a KNIME user. I spend quite a lot of time building workflows to solve people’s problems in the company or even just to do some actual investigation of data myself. Furthermore, At Vernalis, we also use KNIME’s commercial product to automate workflow execution and processes.

But you’re right, I’m also a developer of KNIME nodes. I started writing nodes in Java as I needed to do things that I could not do with the existing nodes. But that’s not the end. Once you develop extensions, you have to support them, and help people who find bugs, so that is what I’m also dedicated to right now.

All and all, what would I like the most to be remembered for? Probably the trumpet player.

Rosaria : Let’s talk about that part of your work everybody is interested in: the Vernalis extension for KNIME Analytics Platform. What is the Vernalis extension, what tasks does it implement, and how can I access it?

Stephen: The Vernalis extension was developed to provide nodes that handle cheminformatics file types or data locally in KNIME. Through the process, other nodes that provide general functionality within KNIME were also added to the extension, thanks to more people who joined the community contribution. For example, we have a set of nodes on manipulating collection columns and nodes for various loops. In terms of how you can get it, you can download it for free from the KNIME Community Hub.

Rosaria: When did the work for the Vernalis extension start? What was the first node? Who wrote the first line of code?

Stephen: The first node was our PDB Connector Query Builder node and it was written by Dave Morley, who used to work with me when we were at RiboTargets (one of the companies that went on to become Vernalis following a series of mergers) back in the early 2000s. By 2011, the extension was ready but we decided to use it a bit ourselves and made sure it worked and was robust before releasing it in 2012 on our website. In June 2013, we relaunched it on the KNIME Community Hub, and that coincided with KNIME User Day UK in London where I gave my first KNIME talk.

Rosaria: How big is the Vernalis extension now? How many nodes does it contain? We already know that it is the most downloaded KNIME extension.

Stephen: I built a KNIME workflow yesterday to figure this out and I was surprised to see that the number was 242, and another 46 nodes that have now been deprecated.

Rosaria: Your preferred Vernalis extension node?

Stephen: There are many nodes I really like, but I will go with the Multiport Loop End node that marks the end of a loop and collects intermediate results by row-wise concatenation. We had a three-port loop end and a four-port loop end, and we had another one with optional input ports that was up to six ports. One day, we suddenly needed seven ports, and I thought it was silly to keep writing nodes with a set number of ports so we just went with the Multiple Loop End node. One of the other features that perhaps people haven’t noticed is that you can actually get a preview of the last 50 rows during the loop execution, so you can see what’s happening in the loop and you don’t have to wait until your loop ends.

Rosaria: When and why did you start using KNIME?

Stephen: I first started using KNIME back in 2010 while I was writing a paper, which became more successful than I had anticipated. This paper challenged the idea that as chemists we were only doing two or three types of reactions. We had data coming from many different papers and that meant handling very long Excel spreadsheets, so James Davidson suggested trying KNIME out. I had tried a similar tool before but I did not enjoy it, while with KNIME I was able to build a simple workflow very rapidly, adding many tables together. We then decided to adopt it at Vernalis.

Rosaria: How did KNIME help you with your work? Did it make it faster, more accurate, more agile, or what else?

Stephen: I was involved in a technology project of analysis nearly 10 years ago, which was about speeding up the process of going from the compound to the biological screening results by chopping out all the purification steps in the middle.

We did some testing with a small set of compounds and then we decided to see if we could industrialize it. A lot of biological screening is done in little plates with an 8 by 12 array of tubes, so 96 tubes in total. This means starting with a stack of 96 A4 printouts and going through them manually, trying to work out what product you should have — what molecular weights or what the molecule showed up in the trace. Going through that manually makes it an error-prone process so we decided to give KNIME a go.

I started by just doing the Cheminformatics task, generating the list of products from the list of reactants from our database. The instrument produces a text file which has hundreds of lines for each sample. So I built a KNIME workflow, and the first time I did the work it had a couple of huge nested loops with about 60 nodes in each that took hours to run. I reduced it to a couple of Java snippets, and that was considerably faster. Eventually, I converted everything into a few nodes. I set that workflow running on the server and it would poll the instrument every five minutes for about two weeks until it got to the point where it found a sample for everything in my library.

Once it found all the samples, I checked manually if it had missed something, but I soon realized that if KNIME didn’t find it, it meant that it wasn’t there. The process finally came down to a workflow that takes only 10 minutes to be run. It was also far more accurate than I ever was because if you start typing in 100 entries into a table, you’re going to make mistakes. So it came down to accuracy, speed, and reliability because it saved me a lot of time.

Rosaria: Will you name for us the KNIME feature or node you could not do without?

Stephen: It’s definitely more than one. I believe the most useful ones are the nodes that allow you to completely reshape the data you are working with. I’m talking about the GroupBy, Pivot and Column Aggregator nodes. There are also the “opposite nodes” to these that are called Ungroup, Split Collection Column and Unpivot.

Rosaria: You are very active on the KNIME Forum. What is your username on the KNIME Forum and which kind of questions do you usually answer? Do you have some kind of specialty?

Stephen: So, I actually have two usernames. The personal one is @s.roughley and the company one is @Vernalis. I use the Vernalis account to answer questions about the extensions, while I use the personal account for more general things.

I used to answer a lot of general questions but now there are quite a lot of users who are perhaps far better at answering than me. Now, I tend to stick to the cheminformatics questions.

Rosaria: What is the most difficult question you had to answer?

Stephen: It’s a question I’ve been asked three times, and it’s my own fault. One of our nodes can be used for Matched Molecular Pair Analysis, and there is an option in the node configuration, which is called “allow self-transforms.” I’ve been asked the question on the forum, in person, and by email by different people, which is: “Could you give a real-world example of what that option does?”. Every time, I find that I still can’t give an answer to that question.

Rosaria: This is a bit of a special interview, since we talk more about developing an extension than actually using KNIME. So, on this line, let’s spend a few words of advice for the ones out there who would like to write an extension for KNIME. How is the Vernalis extension written?

Stephen: The Vernalis extension is written entirely in Java. I have very limited Python experience and I actually find it difficult to type Python because I find myself putting semicolons and curly brackets everywhere.

Rosaria: How easy is it to write an extension for KNIME?

Stephen: Writing a KNIME extension is not hard but it helps if you’re familiar with Java, which I wasn’t when I first tried so I had a lot of light bulb-like moments later, realizing, “Oh, that’s why that works.” If you’re familiar with the Eclipse platform that also makes your life easier.

If you are developing the extension for the KNIME community, then Gabriel and Stefan are both really helpful, always on the end of an email to answer questions.

Rosaria: What is the most complicated part of writing and maintaining an extension?

Stephen: The most complicated thing is that both KNIME Analytics Platform and Java have evolved. We started our first extension with Java 6, and now Java 21 was released a couple of weeks ago. So you always have to be on top of things as community developers, but the KNIME Team can always be contacted so that’s really helpful.

Rosaria: What do you think is the reason for the success of the Vernalis extension?

Stephen: The main reason was that the cheminformatics community needed a PDB Connector Query Builder node and we created one. This community, though, is not very big so part of the success is also tied to a broader audience, which appreciates the more general functionalities we added.

Rosaria: Any word of advice to the aspiring extension developers out there? Any gaps that need an extension in the KNIME ecosystem? Any word of caution about development and maintenance?

Stephen: I give some advice to new developers coming from my own experience in the book “Best of KNIME: The COTM Collection — Season 3”. It can be downloaded for free, so I recommend having a look.

I would say that it is very important to make sure that nobody else came up with your same solution: KNIME offers many ways to obtain the same result. Have a look at the documentation and have fun developing!

Rosaria: From chemist to developer, I’m now interested in knowing more about your trumpet player activity. Do you give concerts?

Stephen: So I I used to do that a lot when I was younger, and then I had quite a long break and got back to it in January this year. I’ve been playing the cornet, a trumpet-like instrument, in a brass brand, a traditional British set up. I have played six concerts so far this year, which was a lot of fun. Playing an instrument allows me to focus on the music and leave anything I’m worried about behind.

Rosaria: One more question before we say goodbye. What about AI? Do you envision using AI to create a new music piece for trumpet?

Stephen: No, but a friend sent me something a while back that goes back to cheminformatics. I’m not quite sure what the practical purpose of this is, but somebody took data on the structure of molecules and turned it into a musical representation of the same data, creating music that encoded a specific molecule. I see, however, space for AI in the music industry to tackle certain tasks like rearrangements.

Watch the original interview with Stephen Roughley on YouTube.

--

--

Rosaria Silipo
Low Code for Data Science

Rosaria has been mining data since her master degree, through her doctorate and job positions after that . She is now a data scientist and KNIME evangelist.