GETTING STARTED | CONNECTORS | KNIME ANALYTICS PLATFORM

1, 2, 3, … 100 Connectors

Blend more than 100 data sources smoothly and effectively in a codeless fashion

Julian Bunzel
Low Code for Data Science

--

Download the KNIME Connector cheat sheet for free.

Data integration or, as it is called nowadays, data blending is a key process to enrich the dataset and augment its dimensionality. And since more data often means better models, it is easy to see how data blending has become such an important part of data science. Data blending is an integration process, and, like all integration processes, its problem is the diversity of the players involved.

So, bring on the players! We surely need a database, probably more than one, including some traditional SQL and noSQL databases as well as some big data repositories, some databases on-premise and some on the cloud. Then, probably, we need to collect data from one or two web services and when we’re talking APIs, we need to parse XML and JSON formatted data structures. We might need to integrate a Python script into our workflow as well. Of course, we can’t leave out the omnipresent Excel file. Text files also belong to the everyday routine. Spice it up with some vintage data storage software and you have the picture of where most of your time goes. We could continue … just fill in here the name of your most used data source.

One of the very useful features of KNIME Analytics Platform and its extensions is its versatility to connect easily to a wide variety of data sources throughout a very large number of Connector nodes. Connector nodes are, as the name says, connecting to data sources, each one is specialized in connecting to its own data source. They are straightforward to use, since only a minimal subset of settings is exposed in their configuration dialog. At the same time, they are very flexible, due to their optional configuration that allows for example to add as many diverse input connections as needed.

Many of them are summarized in the Connector cheat sheet, which is downloadable for free from the KNIME web site. In this introduction, though, I would like to give you an overview of the most frequently used connector nodes. Since for obvious space reasons, I cannot describe all of them one by one, I would like to give you a taste of the marvelous world of KNIME Connectors by exploring the Connector cheat sheet.

Let’s start from the left: Reading Files. Reading files is one of the most common data access tasks. The File Reader node can read various text files, such as CSV and TXT, and automatically guess the structure of your files. For many special formats, such as Excel, there are dedicated reader nodes providing configuration options that are specific to each one.

File formats such as JSON and XML have dedicated reader nodes. In addition, KNIME provides additional nodes and features to extract their contents in an easy and convenient way. PDF files, Word documents, and many other formats are covered by the Tika Parser node. KNIME is also capable of reading images, audio files and networks.

With the many technologies and services allowing you to store information, it’s unlikely that all your data is available locally. KNIME offers an easy option to connect your reader nodes to remote file systems. To activate a file system connection, click the three dots in the bottom left corner of a reader node. KNIME provides connectors for cloud storage systems, such as Azure Data Lake Gen2, Google Drive, Amazon S3, and Microsoft SharePoint.

There are also connectors for distributed file systems, such as HDFS and Databricks.

To connect to these systems, the corresponding authentication node needs to be used. For instance, the SharePoint Online Connector node requires a Microsoft Authentication node. You can find all authentication nodes in the lower left corner of the cheat sheet.

KNIME offers integrations with various other tools like Python, R, H2O, and SAS.

Python and R Scripts can be read and executed via the corresponding Source nodes. The parent extensions come with large sets of nodes, such as a Python Object Reader or a Python View node.

Databases are another common way of storing and accessing data. There are dedicated connector nodes for common SQL databases, such as Oracle, Snowflake, PostgreSQL, and MySQL. For NoSQL databases, there are connector nodes to sources such as MongoDB and Neo4J. For big data technologies, we have Hive and Impala. Connectors to cloud-based databases such as Google BigQuery, Amazon Redshift, and Amazon Athena are available as well.

KNIME also offers dedicated nodes to perform in-database processing. These nodes allow non-experts to build SQL statements in a visual way. The DB Reader node, placed at the end of a sequence of DB nodes, pulls the data from the database into KNIME based on the SQL statement created by the previous nodes. Alternatively, some nodes, like the DB SQL Executor, allow the SQL query to run directly on the database, without the data leaving it. This approach exploits the computational power of the database system to process the data.

Web Services can be called by the nodes of KNIME’s REST Client extension. Semantic Web resources, such as SPARQL endpoints, in-memory endpoints, and Triple files, can be accessed via their corresponding nodes. Additionally, there are various extensions for external web services, such as Twitter or Salesforce.

To retrieve contents of a website or an RSS feed, the Webpage Retriever or the RSS Feed Reader nodes can be used, respectively.

The HTTP(S), SSH, and FTP Connector nodes allow you to connect to servers via the specified protocol.

Cloud providers such as Amazon, Microsoft, and Google do not only provide cloud storage systems. They also have their own services and solutions for data analysis and data science. KNIME offers integrations with services such as Google Sheets, Google Analytics, and Amazon’s AI/ML Services like Comprehend, Translate, and Personalize.

Additionally, there are reader nodes for various machine learning model types, such as KNIME models, PMML formatted models, and Keras & TensorFlow networks.

Last but not least, KNIME provides a set of nodes to create big data environments or contexts. The Create Local Big Data Environment node creates locally a fully functional environment including Apache Hive, Apache Spark, and HDFS. It’s ideal for testing your application before deploying it into the real cluster.

With this last group of nodes we have come back to the connector nodes in the top left corner of the cheat sheet, after zooming in on simple files, cloud files, databases, big data repositories, web services, cloud services, cloud storage systems, and authentication.

“Will They Blend?” Yes, they will!

This was just a short overview of the data blending capabilities that KNIME Analytics Platform can offer. Are you are still wondering whether KNIME can blend the data you need for your diverse, articulated and multi-source project? Spoiler: yes, it can!

You can read more about data blending and KNIME Connectors in the “Will They Blend?” book. The “Will They Blend?” book describes data blending techniques for more than 50 data sources and external tools. It has recently been updated and rebranded, with new content, new article organization, new internal format, and a brand new cover. The only thing that has not changed is that it is still free to download.

Download it for free here!

--

--