GETTING STARTED | DATA ACCESS | KNIME ANALYTICS PLATFORM

Efficiency Unleashed: Navigating the Data Landscape with KNIME’s .table Format

What is it and when to use it

Marcell Palfi
Low Code for Data Science

--

As first published at https://datraction.com. Co-author: Gabor Zombory, Data Engineer, Datraction

In the dynamic landscape of modern business, a profound shift is underway as companies increasingly embrace a data-driven paradigm. Traditional decision-making models are giving way to a more analytical approach, where insights derived from vast datasets steer strategic directions. Technologies play a pivotal role in this evolution, empowering organizations to harness the potential of data for informed decision-making. From innovative data storage solutions to advanced analytics tools, companies are redefining their operational frameworks.

This transformative journey underscores a commitment to agility and efficiency, enabling businesses to navigate complexities and make precise decisions in an era where data has become the cornerstone of evolution. In this era, we must focus on how we handle and store our data.

After comparing KNIME functions with Excel in one of our previous articles, in the following writing we explore various aspects and best practices related to the usage of KNIME’s native data storing format.

KNIME Table in a nutshell

As a key player in the data analytics realm, KNIME introduces its proprietary file format-the KNIME Table (.table). The .table file format stands out for its performance and rapid data processing compared to traditional alternatives like Excel and flat files (a type of database that stores data in a plain, text-based format, such as .csv).

Technical aspects of the KNIME .table format:

  • Binary Format: The .table format is binary, storing data in a compact, machine-readable form. This binary representation enhances speed and reduces storage space compared to text-based formats.
  • Columnar Storage: KNIME .table files are often stored in a columnar format, allowing for efficient compression and rapid access to specific columns. This is beneficial for analytical workflows that frequently access specific data fields.
  • Schema Information: The format includes metadata about the table structure, preserving information about column names, data types, and other relevant schema details. This self-contained structure facilitates better interoperability and data integrity.
  • Optimized for Analytics: The .table format is optimized for analytics and data processing tasks commonly performed in KNIME workflows. This optimization contributes to improved read and write performance, especially in scenarios involving large datasets and complex analytical operations.
  • Integration with KNIME Analytics Platform: The .table format is closely integrated with the KNIME Analytics Platform, allowing seamless compatibility and interoperability with other components of the KNIME ecosystem.

An alternative to the traditional data storing formats

While specific speed metrics can vary based on factors such as dataset size and system specifications, the inherent structure of KNIME’s .table file format contributes to notable advantages in reading and writing speeds and used storage space compared to Excel and traditional flat files. The binary nature of .table allows for faster data processing, minimizing the overhead associated with parsing complex structures. In contrast, Excel files, known for their extensive formatting capabilities, often exhibit slower read and write speeds, especially with larger datasets. Flat files, although simpler in structure, may encounter performance bottlenecks due to their lack of optimization for analytical workflows. It is worth noting that this file format is only compatible with KNIME, so users can only benefit from the advantages by reading data into this platform.

When to use KNIME Table?

As previously mentioned, the true strength of .table becomes apparent when handling larger datasets on local drives-whether repeatedly using the same substantial raw dataset in each workflow run or needing to store historical data without the option of building a database. The following section aims to illustrate the advantages of .table based on a series of tests conducted. In the following test we used the two main file format, measured the reading speed and file size, then they were written out to .table, allowing for a comparison of the main differences.

*Results may differ on different hardware setups.

For the test, we generated different files, each had 27 columns, and the row count increased linearly starting from 270 000 rows until 3 330 000 rows. Three file formats were created from the same files, xlsx, (but after the 3 rd measurement, we reached Excel’s limit), .csv and .table. In the first graph, comparison reveals a remarkable efficiency gain with .table, exemplified by a substantially reduced file size. The data stored in KNIME’s proprietary file format occupies less than half of the disk space in comparison to the widely used .csv format, and only a third of the space consumed by Excel.

*Results may differ on different hardware setups.

The second graph highlights that, for smaller datasets, the .csv format remains a viable option. However, as dataset size increases, the .table format consistently proves superior in terms of reduced reading time across varying numbers of rows. The comparison demonstrates that with larger datasets, the time required to read data from the .table format is notably lower than that of the .csv format. This efficiency gain suggests that embracing the .table format can result in significant enhancements in data reading speed, positioning it as a preferred choice for data processing tasks compared to the traditional .csv format. It’s noteworthy that Excel’s reading time is negligible, with both other formats significantly outperforming it.

Considering the remarkable gains in both efficiency and performance demonstrated by KNIME’s .table format, it becomes clear that adopting this native file format is a strategic choice for users seeking streamlined data processing workflows. Whether dealing with extensive datasets, historical data, or recurrent workflows, the .table format emerges as a robust solution.

--

--