Scraping Data From LinkedIn: Pro Scraper’s Guide + Code
Let’s Clear It Up First: Is It Legal?
Scraping data from LinkedIn — is it along with the law or not?
LinkedIn is a business-focused social networking platform that has grown to be a vital resource for connecting with others and developing professional networks. Given the enormous amount of data available on LinkedIn, some people or businesses are interested in scraping information from the site for their own uses.
So, naturally as a business Linkedin applies certain rules to safeguard its users’ data — both public domain and hidden personal data. Their rules forbid using any automated methods to gather data from its platform, including scraping, crawling, data mining, and so on.
LinkedIn employs a number of technical safeguards, including rate restrictions, CAPTCHAs, and IP blocking to limit illegal access to its platform and automated data scraping. These measures are meant to guarantee that only authorized users have access to the platform and its data.
Public Domain
Here’s the deal, though: LinkedIn’s corporate policies do not make scraping LinkedIn unethical, illegal or dangerous — as long as you’re scraping for data in public domain. The EU and USA laws do not forbid scraping for public domain data in any way, even if the local company policy is against that.
While the U.S. Supreme Court recently struck down laws that made it illegal to search online databases for information, it didn’t address the issue of scraping. That means it remains perfectly legal to go onto LinkedIn itself and pull up every single piece of information you can find about somebody. We recommend to always work within the boundaries of the law.
So, remember: what you scrape should always be on open access for everyone.
Using Selenium for Scraping Data From LinkedIn
There are many technologies that may be used to perform web scraping, which is a potent method for gathering data from websites. Web scraping can be done with Selenium, a well-liked automation tool. The capability to interact with web pages, model user behavior, and automate operations are just a few of the characteristics that make it an effective web scraping tool.
Set Up Selenium On Your Computer
To use Selenium with Python , you’ll need to have Python installed on your computer. You can download Python from the official Python website. Once you have Python installed, you’ll need to install the Selenium package by running the command *pip install selenium* in a command prompt or terminal window.
Importing Driver
Selenium requires a web driver to interact with web pages. You can download the web driver for your preferred web browser from the official Selenium website. Once you’ve downloaded the web driver, you’ll need to specify its location in your code by adding a few lines of code at the beginning of your script.
from selenium import webdriver
driver = webdriver.Chrome('/path/to/chromedriver')
With your web driver set up, you’re ready to start writing your web scraping code. When you’re finished, be sure to close the web driver by adding a line of code at the end of your script to free up system resources.
driver.quit()
Example
Here’s an example of using Selenium for web scraping, to scrape data from a table on a web page. Here is a Python example of how to accomplish this:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com/table')
# Find the table element
table = driver.find_element_by_css_selector('table')
# Get the column headers
headers = [header.text for header in table.find_elements_by_css_selector('theader')]
# Get the data rows
rows = []
for row in table.find_elements_by_css_selector('trow'):
rows.append([cell.text for cell in row.find_elements_by_css_selector('tdata')])
driver.quit()
print(headers)
print(rows)
In this example, we begin by finding the web page with the table we want to scrape. Then, we locate the table element and retrieve the data rows and column headings using Selenium. The table’s “theader” items are all located, and their text is extracted to yield the column headers.
By locating every “trow” element in the table and retrieving the text from each of their “tdata” child elements, the data rows are obtained. The web driver is then shut down, and the data rows and column headings are printed.
Scraping Data From inkedIn with Python
To begin, we first import the necessary libraries such as pandas and Selenium. We then use the Selenium library to scrape data from LinkedIn. The following modules from the Selenium Library are used in this code: Webdriver, expected conditions, options, Keys, and sleep from time.
It is important to note that web scraping requires a web driver. In this example, we will use the Chrome browser, and the Chrome driver is downloaded from the internet.
import pandas as pd
import time
import random
import requests
from time import sleep
from bs4 import BeautifulSoup
from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
opts = Options()
import csv
driver = webdriver.Chrome(options=opts, executable_path= "chromedriver")
The code starts with a function called def validate_field. Here, we validate the field to check whether the field exists on the page. If the field is present, we fill the data in the field.
# function to ensure all key data fields have a value
def validate_field(field):
# if field is present pass if field:
if field:
pass
# if field is not present print text else:
else:
field = 'No results'
return field
We then use the driver.get method to navigate to the LinkedIn official site, followed by the driver.find_element method to locate the email and password fields.
To extract the tag and class of a specific unit, we use the ID of the container, which is known as “session key.” This helps us correctly identify and extract the desired data, allowing us to add the required information to the paragraph.
To retrieve the necessary data, we use driver.find_element_by_id and specify the ID for both containers. The ID for the first container is “session key,” and for the second container, it is “session password.”
By doing this, we correctly identify and retrieve the relevant information from both containers, allowing us to perform the required action on the data for each container.
# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com')
#locate email form by_class_name
username = driver.find_element(By.ID, "session_key")
# send keys(0) to simulate keystrokes
username.send_keys ("*****@gmail.com")
# sleep for 0.5 seconds
sleep(0.5)
# locate password form by_class_name
password = driver.find_element(By.ID,'session_password')
# send keys() to simulate key strokes
password.send_keys('*******')
sleep(0.5)
# locate submit button by xpath
sign_in_button = driver.find_element(By.XPATH,'//* [@type="submit"]')
# . click() to mimic button click
sign_in_button.click()
sleep(10)
After successfully logging in to LinkedIn, we then need to find the links for the particular users from a specific field. For example, suppose we want to find Python developers. In that case, we can search on Google, and we will get some results from LinkedIn. However, we want only the users’ links, so we need to extract them.
# Task 2: Search for the profile we want to crawl
# Task 2.1: Locate the search bar element
search_field = driver.find_element(By.XPATH, '//*[@id="global-nav-typeahead"]/input')
# Task 2.2: Input the search query to the search bar
# search_query = input('What profile do you want to scrape? ')
search_field.send_keys('Python Developers')
# Task 2.3: Search
search_field.send_keys(Keys.RETURN)
# locate Peoples button by xpath
people = driver.find_element(By.XPATH,'//*[@id="search-reusables__filters-bar"]/ul/li[1]/button')
# . click() to mimic button click
people.click()
sleep(2)
For this, we use a particular link that will give us only the Python developers that are from a specific place. Once we navigate to this link, we use driver.get to go to each page with a certain amount of time, and we retrieve the links for each page.
# Task 3: Scrape the URLs of the profiles
profiles = driver.find_elements(By.CLASS_NAME, 'app-aware-link')
all_profile_URL = []
for profile in profiles:
profile_ID = profile.get_attribute('href')
profile_URL = "https://www.linkedin.com" + profile_ID
if profile_URL not in all_profile_URL:
all_profile_URL.append(profile_URL)
print('- Finish Task 3: Scrape the URLs')
Output
Now extract the required data from the scraped profile links
To extract the links, we use the element.get_attribute(“href”) method to get the attribute, href. Once we have the link, we can navigate to each user’s page and scrape the necessary data.
In conclusion, web scraping can be an effective way to collect data from LinkedIn. Selenium and Python provide a powerful combination of tools for web scraping, enabling users to retrieve valuable data from the platform.
Using GoLogin for Safe Work
A online automation platform called GoLogin gives site developers a technique to elude detection by imitating actual user activity. It is challenging for websites to identify automated access because to GoLogin’s ability to generate and manage several browser profiles with distinct identities like user agents, fingerprints, and IP addresses.
This platform has an intuitive user interface, works with a large variety of browsers and devices, and contains cutting-edge fingerprinting technology that can assist developers in avoiding IP bans and CAPTCHA difficulties.
With GoLogin’s cutting-edge fingerprinting technology, developers can simulate a real user’s digital fingerprint. This technology ensures that automated web chores do not trigger any alarms by making it practically hard for websites to notice automated access.
The platform also offers proxy integration, session management, and API support, all of which can speed up and automate development process.
How to set up and use GoLogin for web scraping?
Step 1: Create an account
Making an account on GoLogin’s website is the initial step in using the service. You can accomplish this by going to the GoLogin website and creating an account using your email address. You can log in to the platform and begin configuring your browser profiles after creating an account.
Step 2: Set up a browser profile
GoLogin employs a browser profile as a distinct identity to simulate actual user behaviour. Choose the browser you want to use, such as Google Chrome or Mozilla Firefox, before you can create a profile for it. The profile can then be altered by include user agents, fingerprints, and IP addresses. These features will assist in making the profile appear more authentic, lowering the chance of getting discovered.
Step 3: Configure the proxy settings
You can modify the proxy settings for your browser profile to further lower the chance of detection. By doing this, you can give every website you visit a distinct IP address, which makes it more challenging for them to monitor your online behaviour.
Step 4: Start web scraping
You can begin web scraping after setting up your proxy settings and browser profile. You will need write a web scraping script in a language like Python. The script should access the website and extract the necessary data using the GoLogin-created browser profile.
- Importing the required libraries:
The first modification was the import of the required libraries, including sys, selenium, chrome_options, time, and gologin. This was done by adding the following lines of code to the top of the file:
from sys import platform
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from gologin import GoLogin
- Setting up GoLogin and Selenium WebDriver:
The second modification was the setup of GoLogin and Selenium WebDriver. This was done by adding the following lines of code to the top of the file:
gl = GoLogin({
'token': 'yU0token',
'profile_id': 'yU0Pr0f1leiD',
})
if platform == "linux" or platform == "linux2":
chrome_driver_path = './chromedriver'
elif platform == "darwin":
chrome_driver_path = './mac/chromedriver'
elif platform == "win32":
chrome_driver_path = 'chromedriver.exe'
debugger_address = gl.start()
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", debugger_address)
driver = webdriver.Chrome(executable_path=chrome_driver_path, options=chrome_options)
This code sets up GoLogin with the correct token and profile_id, as well as the correct WebDriver path for the platform. It then starts GoLogin and sets up the WebDriver with the correct debugger address.
- Updating the code to use the WebDriver:
The final modification was updating the code to use the WebDriver for navigation and scraping. This was done by updating the code to use the driver object instead of the requests library for navigation and to use driver.page_source instead of response.content for scraping.
Step 5: Monitor your activity
To make sure that websites are not alerted to your web scraping activities, it is crucial to keep an eye on it.
GoLogin offers a variety of features to make it easier for you to keep tabs on your activities, including session management, API compatibility, and a dashboard that displays your IP address and current browser profile.
For LinkedIn
One must first create a profile on the GoLogin platform in order to scrape data from LinkedIn utilising that service. You can accomplish this by establishing a new user, choosing a particular browser setting, and adding any required plugins or extensions.
The user can visit the LinkedIn website and log in using their credentials after creating their user profile. Next, they may mimic human actions like page scrolling, link clicking, and data entry by using the GoLogin automation tools.
For example, a user might use GoLogin to search for users who have the job title “software engineer” and who are located in a specific city. They could then use automation tools to extract data such as the user’s name, job title, company, and location.
Final Output Will Look Like:
Using numerous profiles to access LinkedIn can help users escape detection and lower their risk of being barred, which is one potential benefit of using GoLogin for LinkedIn scraping.
The platform’s terms of service may be broken and scraping data from LinkedIn without authorization may result in legal problems, so it’s vital to keep that in mind. As a result, it’s crucial to exercise caution and make sure that any scraping actions are carried out in a morally and legally responsible way.
Tips for Safe Scraping Data From LinkedIn
Scraping LinkedIn data can provide valuable insights for businesses and researchers, but it’s important to do so ethically and without violating LinkedIn’s terms of service. To avoid getting banned, web developers should follow some pro tips.
- Limit the frequency of requests to LinkedIn’s servers and set appropriate time intervals between each request.
- Excessive traffic to LinkedIn’s servers can trigger their security systems and result in a ban. Secondly, developers should not use bots or automated tools for scraping LinkedIn data as this violates the platform’s terms of service.
- Mimic human behavior when scraping LinkedIn data. This can be achieved by using a web browser that is commonly used by humans and by making requests at a realistic pace.
- Avoid accessing private profiles or data that is not available to the public.
- Respect the privacy of LinkedIn users and not to use their data for unethical purposes. By following these best practices, web developers can successfully scrape LinkedIn data without getting banned and in an ethical manner.
GoLogin provides a way to create and manage multiple browser profiles, which can be useful for avoiding detection by LinkedIn. By rotating through different browser profiles, it becomes more difficult for LinkedIn to detect patterns of scraping activity, which can help prevent risks.
Furthermore, GoLogin allows for IP rotation, which can further protect your work while scraping LinkedIn data. Overall, using GoLogin can be a valuable tool for web developers looking to scrape LinkedIn data safely with no risks.
Download GoLogin here and enjoy safe scraping with our free plan!