How to easily visualize your internal links with Python?

Daniel Cupak
3 min readApr 23, 2019

--

Even though there are many tools to visualize internal linking structure out there these tools are often either paid (e.g. Sitebulb) or free but not really easy to use (e.g. using Gephi).

And that’s where Python comes in handy!

Step 1: Scrape your website

First, we start off by importing the necessary Python libraries and setting up the Chrome webdriver so we can also scrape dynamic web pages.

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
browser = webdriver.Chrome(r"C:/Users/Daniel Čupak/Downloads/chromedriver_win32/chromedriver.exe")

When you have imported all the libraries and set up the Chrome driver create two lists. First list named list_urls holds all the pages you’d like to scrape (It could have been a list of imported URL’s form an imported .csv file but turning a .csv file into a list is a beyond the scope of this article). Second, create an empty list where you’ll append links from each page.

list_urls = ["https://www.creativedock.com/", "https://www.creativedock.com/about-us"]links_with_text = []

Use the BeautifulSoup library to extract only relevant hyperlinks for Google, i.e. links only with <a> tags with href attributes. Not only did we get a list of links but the list told us about links on a page that miss the href attribute and need fixing (see picture later below where “to” is None)

for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
for line in soup.find_all('a'):
href = line.get('href')
links_with_text.append([url, href])

Step 2: Turn the URL’s into a Dataframe

After you get the list of your websites with hyperlinks turn them into a Pandas DataFrame with columns “from” (URL where the link resides) and “to” (link destination URL) so you can save it as e.g. an excel file and give it to your webmaster to fix the incomplete links.

df = pd.DataFrame(links_with_text, columns=["from", "to"])
Pandas DataFrame with columns “from” and “to”

Step 3: Draw a graph

Finally, use the aforementioned DataFrame to visualize an internal link structure by feeding it to the Networkx method from_pandas_edgelist first and draw it by calling nx.draw

GA = nx.from_pandas_edgelist(df, source="from", target="to")
nx.draw(GA, with_labels=False)
Simple graph of nodes(URL’s) and edges (links)

As we want to know where the links link to and how many links actually link to a certain page we add the create_using=nx.DiGraph() parameter in the from_pandas_edgelist method and node_size=bigger_nodes in the draw method.

G=nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.DiGraph() )nx.draw(G, with_labels=False, node_size=bigger_nodes, alpha=0.4, arrows=True, pos=nx.spring_layout(GA))
Directed graph with arrows and bigger and smaller nodes

--

--