Network Analysis and Community Structure for Market Surveillance using Python/NetworkX

Aditya Gandhi
9 min readFeb 14, 2019

--

This article and work is a collaboration between two authors, and their details are given below:

Harsh Shekhar has been working in the fin-tech space for over 10 years and has been associated with application of data science in market surveillance in his current role. LinkedIn: https://www.linkedin.com/in/harsh-shekhar/

Aditya Gandhi is a data scientist with experience in the area of supply chain, risk management and customer analytics. He is currently working in the area of market surveillance. LinkedIn: https://www.linkedin.com/in/adityadgandhi/

Note: The relevant Python code for this article can be found here: https://github.com/adityagandhi7/community_structure

Introduction:

This article concentrates upon insights that can be drawn by analyzing chat logs and decoding community structure based upon data of the chat (bilateral & multilateral chatrooms) participants. Since the accompanying data has to stay confidential, we have used synthetic data to generate the visuals.

Market Surveillance is an area within financial institutions which involves monitoring for market manipulation practices. Traditionally, a lot of work in this area used to monitor either trading or e-communications (chats/voice calls) in silos. This led to a huge amount of false alerts, leading to wastage of a large number of man-hours. Recently, compliance functions are catching up and attempting to analyze multiple variables simultaneously - this is due to the fact that with the influx of data science tools and increase in computing power, it is easier to derive insights from big data. So instead of monitoring either just trade data or just e-communication data in silos, the trend is slowly moving towards monitoring trade and e-communications both.

One of the roles of a data scientist is to look for use cases (moonshots) in different industries and try simulating the concept for finance. During one of our moonshot sessions, we came across an excellent article on Bloomberg related to surveillance expertise, conceptualized and implemented by Palantir Technologies for JP Morgan Chase. Palantir had developed capabilities to scan through emails, browsing histories, GPS location using company owned smart phones, transcripts of phone conversations and employee badge timings.(https://www.bloomberg.com/features/2018-palantir-peter-thiel). Reading through this article inspired us to attempt a moonshot and implement a proof-of-concept visualization/model to carry out holistic surveillance and identify network structure/communities in the data.

Background:

Network Analysis and Graph Theory is already a known concept in areas of social networking, communication, organizational change management and recently in area of market surveillance. Network Analysis helps us in visualizing multiple data points and drawing insights from a complex set of connections.

A quick background about the market surveillance space — Market Surveillance is a department within banks with an onus to curb market manipulation practices by the firm’s traders/clients. Old-school surveillance techniques always used variables such as threshold and the horizon period. This technique implied surveillance of financial transactions within a fixed time horizon and only for transaction amounts that were more than a certain threshold. This led to a large amount of false alerts and traditionally compliance departments have spent a lot of man-hours in tackling false alerts. In addition, the false alert ratio used to be an input to the increasing/decreasing threshold of transactions to be monitored. With the advent of data science, there lies an opportunity to make this space more efficient. Keeping this aim in mind, we have attempted to not analyze trading or e-communication space separately, but to combine trading with chat data, and to perform this analysis, by combining multiple sources.

Insights can be drawn in either quantitative measures like centrality (degree, closeness or eigenvector) or network density, community formation et al. via visual mapping.

Data Overview:

We created an example of chat data which contains the information such as Inviter (person sending the chat), Invitee/s (person receiving the chat), and also the Message Count (number of messages sent in the the conversation). Our data had 130 participants, with 91 conversations. This is shown in the image below (along with the supporting Python code in next block):

Code Block 1: Creating a node graph from a dataset in Python/NetworkX
Figure 1: Summary of Communications Data

Quantitative Measures for Network Analysis:

Centrality: A measure used to identify which nodes/traders are the biggest influencers of the network. The different types of centrality in analyzing the network are given as follows (Reference: https://sctr7.com/2013/06/17/adopting-analytics-culture-6-what-information-is-gained-from-social-network-analysis-6-of-7/):

Degree: Measures number of incoming connections
Closeness: Measures how quickly (minimum number of steps) can one trader connect to others in the network
Eigenvector: Measures a trader’s connection to those who are highly connected. A person with a high score will be someone who is influencing multiple players (who in turn are highly connected) and is exercising control behind the scenes.

The following image shows the values for the three types of centrality mentioned above, and also the supporting Python code:

Code Block 2: Centrality Metric calculation and creating an image that represents it.
Figure 2: Centrality Metrics for the Data and Top 10 Most Influential Participants

Based on the graphs above, we observe that some of the most influential participants are P1, P12, P16, P29, P44 and P63.

Algorithms for Community Detection for the Data:

In this article we have concentrated on the visual representation of a community using different algorithms. Whilst quantitative measures have its own importance, a visual representation is strongly recommended in such areas as work can be easily integrated into popular charting tools available across banks. Visualization is very commonly used within the trading community to analyze trading patterns for a particular asset class and its comparison to benchmarks. The combined visualization of trade with chat data makes the exercise far more meticulous.

Imagine a scenario where we start giving a score to the number of chat messages which has been exchanged between two traders (nodes) and repeat this exercise for the complete network landscape. This score is referred to as modularity. If we try to form communities based on connectivity and modularity and run the exercise for the landscape, we can oversee communities~ which essentially represent group of traders (nodes), whose exchange of messages among themselves is far more as compared to the community’s exchange with rest of the world. The purpose here is to find tightly knit communities of nodes which have rarer friendship ties between different communities.

Community detection algorithms can be of multiple types with varying levels of success. We have used three popular types of community detection algorithms to better understand the network:

  1. Louvain Algorithm for Community Detection:

This algorithm works on the principle of partitioning a network into mutually exclusive communities such that the number of edges across different communities is significantly less than expectation, whereas the number of edges within each community is significantly greater than expectation. The Louvain algortihm is one of the most widely used for identifying communities due its speed and high modularity. Modularity values can span from -1 to 1, and the higher the value, the better the community structure that is formed.

We performed the Louvain algorithm on this dataset, and the results are given in Figure 3. As we see, we have 46 communities, and a modularity of 0.953, which is a pretty good solution. Also we see a few communities that have more than 3 members and some of the most influential people are in those communities. For example, P1, P12, P16 and P44 are all in community 2. These are some of the higher influential participants. The following code block also shows the code used for this purpose:

Code Block 3: Running the Louvain Algorithm and creating a new graph to represent communities
Figure 3: Top 15 communities using the Louvain Algorithm

If we were to visualize all the non-overlapping communities in different colors, we would get the following image. As we can see in Example 1 and Example 2, we see the cases where there are members from different communities that converse with each other. In Example 1, we see six people that are in two communities, 9 and 38., and they have some inter-community and intra-community communication. The same conclusion holds true for communities 18 and 39.

Figure 4: Graphical representation of communities using the Louvain Algorithm

2. Girvan-Newman Algorithm:

This has four steps and can be given as follows:
a. The betweenness of all existing edges in the network is calculated first.
b. The edge with highest betweenness is removed.
c. The betweenness of all edges affected by the removal is recalculated.
d. Steps b. and c. are repeated until no edges remain.

The Girvan-Newman algorithm gives a very similar solution, that is slightly inferior to the Louvain algorithm, but also does a little worse in terms of performance. The modularity is a little lesser, and around 0.94 for this algorithm. The code block for the Girvan-Newman algorithm is quite similar to that for the Louvain algorithm, and can be found at the Github link given at the beginning of this article. Figure 5 shows the community structure for the Girvan-Newman Algorithm.

Figure 5: Top 15 Communities using the Girvan-Newman Algorithm

3. Maximal Clique Calculation:

Cliques are sub-graphs in which every node is connected to every other node. A node can be a member of more than one clique/community hence there is a sense of overlapping structure. As per the Maximal Cliques approach, we find cliques which are not sub-graphs of any other clique. The Bron-Kerbosch algorithm is famous in this aspect, we pick maximal cliques bigger than minimum size (number of nodes).

When run on this data, 79 cliques were formed, and the following figure shows the top 15 communities (overlapping) found using maximal cliques. We can see some communities have multiple influential people in them, such as cliques 40, 41 and 43. We can also see the interconnectedness between cliques, as we see 11 nodes all being a part of 8 overlapping cliques. This can be used to identify a sub-section of communities that are more closely connected than other sets of nodes.

Figure 6: Top 15 Communities using Maximal Clique calculation, and also the interconnectedness between cliques

Now, if would like to view the interconnectedness between cliques for the complete network/dataset, we can see the image below, and also the supporting Python code:

Code Block 4: Performing the Maximal Clique calculation in NetworkX
Figure 7: Clique Representation for the complete dataset/network

Test Exercise: Real-World / Large-Scale Data:

In addition to the metrics and algorithms used above, we also looked at scenarios with large-scale simulated data. This is to give the user a better understanding of how these scenarios work, and how the complexity increases when the data is scaled up. The increase of the density in connections and differences in the quality of solutions becomes evident. Figures 8, 9 and 10 show the graphical representations of the community structure with real-world data.

Figure 8: Community Structure using Louvain Algorithm for Real-World Data

Figure 8 shows a case with approx. 1,100 nodes and 1,600 edges, and shows the representation of community structure for the Louvain algorithm. The Louvain algorithm creates 164 communities with a modularity of 0.88.

Figure 9: Girvan-Newman Algorithm when applied to real-world data

Figure 9 shows the Girvan-Newman algortihm on the same dataset, and we have a total of 140 communities with a modularity of 0.59, which a worse result than the Louvain Algorithm. (note that a larger value of modularity indicates a better community structure)

Figure 10: Clique Calculation forming overlapping communities with real-world data.

For clique calculation, with a highly dense and clustered graph, filtering for cliques that are greater than a certain threshold is helpful. This allows for formation of only the most connected communities, and can assist in filtering out nodes. Figure 10 only shows cliques that have 4 or more nodes. This gives us a set of dense and interconnected communities.

Conclusion:

With the world increasingly networked, community detection and relationships across different nodes will be an interesting space to watch. Benchmarking across different algorithms of community detection namely the Louvian algorithm, Girvan-Newman algorithm and Clique based algorithms clearly depicts that the first one is far more efficient — specially with respect to focus towards finding ‘like’ minded nodes. However, usage/efficiency might differ from one domain to another depending on the use cases. Market Surveillance has been a space where false alerts lead to significant wastage of time — hence innovative technology advances/research are very handy to reduce false alert ratio. Our intent is to continue trying out new ideas to make market surveillance more robust and efficient.

--

--