Grouping data by similarity with DocArray

Published in

Jina AI

1 min readAug 9, 2022

Often when you have a lot of data, you want to compare similar rows to each other in various ways. For example, the CIA Factbook dataset has a lot of great data about countries, but it may not be useful to compare infant mortality rates in industrialized countries to those of developing nations. It would be great to automatically generate a list of similar countries so we can get a rough idea of where a country stands in relation to its closest peers.

DocArray enables you to do just that, quickly and easily and with no other dependencies. DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It also has some excellent tools for tabular data.

In this Kaggle notebook, we’ll see how easily DocArray can group tabular data by similarity. In future notebooks, we’ll explore DocArray’s handling of multimodal and nested data.

Grouping data by similarity with DocArray

Written by Nicholas Dunham