The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death.

Selenium project with Python: To scrap LinkedIn connections information.

Dhiraj kadam
3 min readFeb 17, 2018

--

A few months ago, I was assigned a task to find a way to scrap out LinkedIn connections data. Being a fresh graduate computer science student this was an exciting task for me since it was something kind of real project and not the boring practicals I did in my college days.

I went on research to figure out techniques people were using to scrap web and came across various web crawling framework like scrappy, Pyspider, MechanicalSoup, Cola, Selenium.

After few study, I came to the conclusion that most of the web crawling framework is based on following approaches:

  • Text pattern matching
  • DOM parsing
  • HTML parsing
  • HTTP programming
  • Vertical aggregation
  • Computer vision web-page analysis

I selected to go with Selenium since it is industry standard and I found its documentation well explained and easier to start with.

Selenium is designed for automated software testing by simulating user’s action on web pages. Since it does DOM and HTML parsing it is well suited for my project.

Here is the working of the python script.

LinkedIn crawler demo

Let me take you through my code:

  • create a file called app.py
from selenium import webdriver
from bs4 import BeautifulSoup
import getpass
import requests
from selenium.webdriver.common.keys import Keys
import pprint
  • Initialize variable to store username/email and password
userid = str(input("Enter email address or number with country code: "))
password = getpass.getpass('Enter your password:')
chrome_path = './chromedriver'
driver = webdriver.Chrome(chrome_path)
  • Now, lets fire up the browser instance and load the LinkedIn landing page
driver.get("https://www.linkedin.com")
#This will open up a new instance of your browser and redirect to #www.linkedin.com
  • Once the web page is loaded, our next task will be login process, we will automate it with selenium by using XPath for login and password input box. One can easily get the XPath of any given element by inspecting.
driver.implicitly_wait(6)
driver.find_element_by_xpath("""//*[@id="login-email"]""").send_keys(userid)
driver.find_element_by_xpath("""//*[@id="login-password"]""").send_keys(password)
driver.find_element_by_xpath("""//*[@id="login-submit"]""").click()
driver.get("{You connection profile link}") #Enter any of your connection profile Link
connectionName = driver.find_element_by_class_name('pv-top-card-section__name').get_attribute('innerHTML')
print(connectionName)
  • You can use find_element_by_xpath() method to target any valid HTML tags with class names and perform available actions. For more methods Click Here
  • Now, let’s write some code to crawl your connection’s profile page to get Email address and contact number.
driver.find_element_by_css_selector('button.contact-see-more-less').click()
content = driver.find_element_by_css_selector(".pv-profile-section.pv-contact-info.artdeco-container-card.ember-view")
data = BeautifulSoup(content.get_attribute('innerHTML'), "lxml")
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 't')
for section in data.find_all('section'):
for header in section.find_all('header'):
if header.contents[0] == 'Email':
section.find_all('a')
print("Email Address" + section.a.contents[0])
if header.contents[0] == 'Phone':
section.find_all('a')
print("Phone Number :" + section.a.contents[0])
  • Final code app.py should look like this:-
from selenium import webdriver
from bs4 import BeautifulSoup
import getpass
import requests
from selenium.webdriver.common.keys import Keys
import pprint
userid = str(input("Enter email address or number with country code: "))
password = getpass.getpass('Enter your password:')
chrome_path = './chromedriver'
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.linkedin.com")
driver.implicitly_wait(6)
driver.find_element_by_xpath("""//*[@id="login-email"]""").send_keys(userid)
driver.find_element_by_xpath("""//*[@id="login-password"]""").send_keys(password)
driver.find_element_by_xpath("""//*[@id="login-submit"]""").click()
driver.get("{You connection profile link}") #Enter any of your connection profile Link
connectionName = driver.find_element_by_class_name('pv-top-card-section__name').get_attribute('innerHTML')
print(connectionName)
driver.find_element_by_css_selector('button.contact-see-more-less').click()
content = driver.find_element_by_css_selector(".pv-profile-section.pv-contact-info.artdeco-container-card.ember-view")
data = BeautifulSoup(content.get_attribute('innerHTML'), "lxml")
driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + 't')
for section in data.find_all('section'):
for header in section.find_all('header'):
if header.contents[0] == 'Email':
section.find_all('a')
print("Email Address" + section.a.contents[0])
if header.contents[0] == 'Phone':
section.find_all('a')
print("Phone Number :" + section.a.contents[0])

Now we are all good to fire up the script from command line to test :)

One can extend this script to First collect all the profile links from by crawling from My network page and then running a for loop to visit each profile to scrap data with the above-discussed method. Which can be saved to CSV file

NOTE: LinkedIn provides an option in profile settings to export your connections data in a single click. This script is just meant for the sake of web crawling demonstration.

Happy coding :)

Thank you.

--

--

Dhiraj kadam

Data Analytics and Full Stack Development enthusiast.