How to find all links between components within a dataset (with little memory footprint?)

Thomas Meißner
7 min readAug 19, 2023

--

All you need is dictionaries

Image created by AI on https://www.imagine.art/ with the term “Social network links”

This is part one of a series of articles.

Introduction

In the era of modern data finding links between entities is a powerful task. In example it can be helpful to understand how people are linked with each other in social networks for various analytical reasons.

Some approaches make use of Graph theory to find all links and there are frameworks to help achieving that and even visualize them.

In this series we want to investigate various ways on how to create a DataFrame that shows all direct and indirect links of an entity. We approach the problem from scratch in an engineering manner, leaving any Graph theory out here.

The problem statement

We are data scientist for an internet platform that allows private people to sell goods via forums. One user seems to make much more trades than usual in a shorter amount of time while always receiving a 5-starts feedback. We suspect this trader to do fake trades to artificially increase the own reputation. For fake trades however it would require two accounts at least. We want to investigate how many users might be part of this user’s network for further analysis, finding direct and indirect links alike.

The dataset

To validate our solution we will create a synthetic dataset by hand. This will allow us to evaluate results fairly easy. We create the following two networks:

network_1 = [
("A", "12"),
("B", "12"),
("B", "13"),
("C", "13"),
("C", "1045"),
("D", "1045"),
("D", "2095"),
("E", "2095"),
("E", "883"),
("E", "6634"),
("F", "6634"),
("F", "7777"),
("G", "7777"),
]


network_2 = [
("M", "7768"),
("N", "7768"),
("N", "8998"),
("O", "7768"),
("O", "9000"),
("P", "9000")
]

Network 1 is a chain of seven entities. None of them has more than two direct links, but they are all part of this full chain of seven links.

Network 2 is a smaller network of four chained entities that is disconnected from network 1. We create a Pandas DataFrame and save the file as a csv here:

import pandas as pd

def create_synthetic_data():
synth = pd.DataFrame(
{
"user": ["A", "B", "B", "C", "C", "D", "D", "E", "E", "E", "F", "F", "G",
"M", "N", "N", "O", "O", "P"],
"trade_id": ["12", "12", "13", "13", "1045", "1045", "2095", "2095", "883", "6634", "6634", "7777", "7777",
"7768", "7768", "8998", "7768", "9000", "9000"]
}
)
synth.to_csv("account_linking_synth.csv", index=False)

Now we have a csv containing two columns, one consists of user ids and the other of trade ids.

Loading the csv

Our goal is it to find all links with a small memory footprint. For this reason we will load the csv row by row and fill bunch of dictionaries. Let’s load the csv:

import csv

with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile)
# expect 1st column to be user & 2nd column to be trade_id
for row in datareader:
...

Getting one hop links

The first thing we do is parsing the direct links which is also just one hop. The three main dictionaries we need here are:

entity_to_identifier: Dict[str, List[Union[str, int]]] = {}
identifier_to_entity: Dict[str, List[Union[str, int]]] = {}
entity_to_entity: Dict[str, List[Union[str, int]]] = {}

The dictionary entity_to_identifier will have the user as a key and a list of connected trade_ids as the value:

# check if entity is already there and append identifier, otherwise
if row[entity_idx] not in entity_to_identifier:
entity_to_identifier[row[entity_idx]] = [row[identitiy_idx]]
else:
entity_to_identifier[row[entity_idx]].append(row[identitiy_idx])

This looks afterwards like this:

{
'A': ['12'],
'B': ['12', '13'],
'C': ['13', '1045'],
'D': ['1045', '2095'],
'E': ['2095', '883', '6634'],
'F': ['6634', '7777'],
'G': ['7777'],
'M': ['7768'],
'N': ['7768', '8998'],
'O': ['7768', '9000'],
'P': ['9000']
}

The dictionary identifier_to_entity will have the trade_id as a key and a list of the connected users as value:

# likewise check if identifier is listed already
if row[identitiy_idx] not in identifier_to_entity:
identifier_to_entity[row[identitiy_idx]] = [row[entity_idx]]
else:
identifier_to_entity[row[identitiy_idx]].append(row[entity_idx])

This looks afterwards like this:

{
'12': ['A', 'B'],
'13': ['B', 'C'],
'1045': ['C', 'D'],
'2095': ['D', 'E'],
'883': ['E'],
'6634': ['E', 'F'],
'7777': ['F', 'G'],
'7768': ['M', 'N', 'O'],
'8998': ['N'],
'9000': ['O', 'P']
}

The third dictionary entity_to_entity will give us the information we really care about: For each merchant we will get a list of linked merchants. Here we basically loop through the list of users connected to a trade_id and add each of them as a key into the third dictionary with the list of all merchants involve din the trade.

for entity in identifier_to_entity[row[identitiy_idx]]:
if entity not in entity_to_entity:
entity_to_entity[entity] = identifier_to_entity[row[identitiy_idx]]
else:
entity_to_entity[entity] = list(set(entity_to_entity[entity] + identifier_to_entity[row[identitiy_idx]]))

The result looks like this:

{
'A': ['A', 'B'],
'B': ['A', 'B', 'C'],
'C': ['C', 'B', 'D'],
'D': ['C', 'E', 'D'],
'E': ['F', 'E', 'D'],
'F': ['F', 'E', 'G'],
'G': ['F', 'G'],
'M': ['O', 'M', 'N'],
'N': ['O', 'M', 'N'],
'O': ['O', 'M', 'P', 'N'],
'P': ['O', 'P']
}

So far we found all direct links between users. However we wanted to find the whole chain including indirect links.

From one to multi hop

To achieve this all we need is the entity_to_entity dictionary we created in our last step. We follow the following steps:

  • loop through each user (key)
  • loop through each user that has been a direct link in the trades (values)
  • for each of these linked users we take their shared users and merge them with the linked user list of our current user (we basically merge links of linked users)
  • we repeat the process in multiple iterations until we cannot enhance any list anymore

To explain this with our synthetic data: User “B” made trades with user “A” and “C”. Thus we take the users of “A” [“B”] and concat them with the linked users of “B” [nothing changes here so far). Then we concat with the linked users of “C” [here user “D” comes on top], In the next iteration when we encounter user “B” again we will also add all linked users from “D”.

The result of this operation looks like this:

{
'A': ['F', 'A', 'G', 'B', 'C', 'E', 'D'],
'B': ['F', 'A', 'E', 'G', 'C', 'B', 'D'],
'C': ['F', 'A', 'G', 'B', 'C', 'E', 'D'],
'D': ['F', 'A', 'G', 'B', 'C', 'E', 'D'],
'E': ['F', 'A', 'E', 'G', 'C', 'B', 'D'],
'F': ['F', 'A', 'E', 'G', 'C', 'B', 'D'],
'G': ['F', 'A', 'E', 'G', 'C', 'B', 'D'],
'M': ['O', 'M', 'P', 'N'],
'N': ['O', 'M', 'P', 'N'],
'O': ['O', 'M', 'P', 'N'],
'P': ['O', 'M', 'P', 'N']}

Fantastic! We got what we wanted and see all direct and indirect links.

Saving computational time

We also add an early stopping condition as this operation is expensive and requires lots of computation when long chains are involved. Thus we do not want to compute more than needed. Our early stopping condition is triggered when no chain got any longer during an iteration.

The full code

import csv
import datetime
import pandas as pd
from typing import Dict, List, Union


def create_synthetic_data():
synth = pd.DataFrame(
{
"user": ["A", "B", "B", "C", "C", "D", "D", "E", "E", "E", "F", "F", "G",
"M", "N", "N", "O", "O", "P"],
"message_id": ["12", "12", "13", "13", "1045", "1045", "2095", "2095", "883", "6634", "6634", "7777", "7777",
"7768", "7768", "8998", "7768", "9000", "9000"]
}
)
synth.to_csv("/home/thomas/Desktop/account_linking_synth.csv", index=False)


filename = '/home/thomas/Desktop/account_linking_synth.csv'

entity_to_identifier: Dict[str, List[Union[str, int]]] = {}
identifier_to_entity: Dict[str, List[Union[str, int]]] = {}
entity_to_entity: Dict[str, List[Union[str, int]]] = {}
entity_to_entity_chain_len: Dict[str, int] = {}
entity_to_entity_chain_increased: Dict[str, bool] = {}

create_synthetic_data()

# define which idx the columns have in the CSV
entity_idx = 0
identitiy_idx = 1
header = True
rows_parsed = 0
chain_length_increased: bool = False

# one hop
with open(filename, 'r') as csvfile:
datareader = csv.reader(csvfile)
# expect 1st column to be entity & 2nc column identifier
for row in datareader:
rows_parsed += 1
if header and rows_parsed <= 1:
continue

# check if entity is already there and append identifier
if row[entity_idx] not in entity_to_identifier:
entity_to_identifier[row[entity_idx]] = [row[identitiy_idx]]
else:
entity_to_identifier[row[entity_idx]].append(row[identitiy_idx])

# likewise check if identifier is listed already
if row[identitiy_idx] not in identifier_to_entity:
identifier_to_entity[row[identitiy_idx]] = [row[entity_idx]]
else:
identifier_to_entity[row[identitiy_idx]].append(row[entity_idx])

for entity in identifier_to_entity[row[identitiy_idx]]:
if entity not in entity_to_entity:
entity_to_entity[entity] = identifier_to_entity[row[identitiy_idx]]
else:
entity_to_entity[entity] = list(set(entity_to_entity[entity] + identifier_to_entity[row[identitiy_idx]]))

def multihop():
global entity_to_entity
global entity_to_entity_chain_len
global chain_length_increased
global entity_to_entity_chain_increased

chain_length_increased = False

for entity, shared_entities in entity_to_entity.items():
entity_to_entity_chain_increased[entity] = False

for shared_entity in shared_entities:
if shared_entity == entity:
pass
else:
entity_to_entity[entity] = list(set(entity_to_entity[entity] + entity_to_entity[shared_entity]))
entity_to_entity[shared_entity] = list(set(entity_to_entity[entity] + entity_to_entity[shared_entity]))

# we keep track if any chain got longer, otherwise we do early stopping
chain_length = max(len(entity_to_entity[entity]), len(entity_to_entity[shared_entity]))
if entity not in entity_to_entity_chain_len:
entity_to_entity_chain_len[entity] = chain_length
prev_length = chain_length
else:
prev_length = entity_to_entity_chain_len[entity]

if chain_length > prev_length:
entity_to_entity_chain_len[entity] = chain_length
chain_length_increased = True
entity_to_entity_chain_increased[entity] = True


for i in range(10):
global chain_length_increased
print(f"Start multihop iteration {i} at {datetime.datetime.now()}")
multihop()
# if no chain got longer we do early stopping
if not chain_length_increased:
break

Closing words

Testing this with a csv of 128 mb (> 5 mio rows and some chains with a length of several hundreds of entities) the total memory consumption reached 2.2 GB with 3–7 minutes of runtime).

After some experimentation it turned out that the reason for this high memory consumption is the garbage collector itself. This could be solved by calling

del row
_ = gc.collect()

after each iteration. However so many calls of the GC slow down the script massively. Just calling del after each iteration and running gc.collect() every X iterations does not clean up the memory.

While the runtime is acceptable the memory consumption is quite high. We did not achieve our goal here. In the next part of the series we will conceptually follow a similar route, but engineer a totally different solution, this time reaching our memory goal hopefully.

We could also make use of Rust to benefit from not having a Garbage collector as this has been a real issue here.

If you liked this article please leave a clap. And be welcomed to leave your thoughts in the comments.

Related articles

See here the follow up article where we try to keep memory consumption as little as possible.

--

--

Thomas Meißner

Data scientist at SumUp. Passionate about data, good food, coffee and wine. Father of two lovely children.