Large Scale Clustering Using KNIME To Segment Population Data

David Plummer
10 min readJan 4, 2023

How to overcome the limitations of clustering algorithms using machine learning and parameter optimisation to segment large datasets.

All images © David Plummer

KNIME is a mature low code-data analytics tool with extensive and extensible data science capabilities. One misconception is that low-code equates to low-capability. This is not the case, and the following case study is intended to show how KNIME can be used to automatically cluster large data sets using a combination of dimensionality reduction, automatic parameter optimisation, customised test statistics and machine learning to scale the workflow to process large data sets.

Case study : Census 2021, Travel to Work

Every ten years the United Kingdom Office of National Statistics (ONS) conducts a population census. This generates large amounts of data which is useful for developing government policy and planning services. The most recent survey was conducted in 2021 and the initial results became available towards the end of 2022.

The data is collected at a household level but, for reasons of confidentiality, is aggregated into census areas. The smallest reported census area is the Output Area (OA) which covers between 40 and 250 households (between 100 and 625 persons). The boundaries are chosen such that…

--

--

David Plummer

Writing on systems thinking; data analytics in health and care; and anything else that makes the grey cells itch.