How to easily visualize your internal links with Python?
Even though there are many tools to visualize internal linking structure out there these tools are often either paid (e.g. Sitebulb) or free but not really easy to use (e.g. using Gephi).
And that’s where Python comes in handy!
Step 1: Scrape your website
First, we start off by importing the necessary Python libraries and setting up the Chrome webdriver so we can also scrape dynamic web pages.
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inlinebrowser = webdriver.Chrome(r"C:/Users/Daniel Čupak/Downloads/chromedriver_win32/chromedriver.exe")
When you have imported all the libraries and set up the Chrome driver create two lists. First list named list_urls holds all the pages you’d like to scrape (It could have been a list of imported URL’s form an imported .csv file but turning a .csv file into a list is a beyond the scope of this article). Second, create an empty list where you’ll append links from each page.
list_urls = ["https://www.creativedock.com/", "https://www.creativedock.com/about-us"]links_with_text = []
Use the BeautifulSoup library to extract only relevant hyperlinks for Google, i.e. links only with <a> tags with href attributes. Not only did we get a list of links but the list told us about links on a page that miss the href attribute and need fixing (see picture later below where “to” is None)
for url in listUrls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
for line in soup.find_all('a'):
href = line.get('href')
links_with_text.append([url, href])
Step 2: Turn the URL’s into a Dataframe
After you get the list of your websites with hyperlinks turn them into a Pandas DataFrame with columns “from” (URL where the link resides) and “to” (link destination URL) so you can save it as e.g. an excel file and give it to your webmaster to fix the incomplete links.
df = pd.DataFrame(links_with_text, columns=["from", "to"])
Step 3: Draw a graph
Finally, use the aforementioned DataFrame to visualize an internal link structure by feeding it to the Networkx method from_pandas_edgelist first and draw it by calling nx.draw
GA = nx.from_pandas_edgelist(df, source="from", target="to")
nx.draw(GA, with_labels=False)
As we want to know where the links link to and how many links actually link to a certain page we add the create_using=nx.DiGraph() parameter in the from_pandas_edgelist method and node_size=bigger_nodes in the draw method.
G=nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.DiGraph() )nx.draw(G, with_labels=False, node_size=bigger_nodes, alpha=0.4, arrows=True, pos=nx.spring_layout(GA))