How This Easy Procedure can Help to Build your First Scraper for LinkedIn in 2022

scrapping linkedin

LinkedIn is a goldmine of information. Ranging from huge job listings and opportunities to skilled employee candidates and high-profile leads. Since it is made available to both users and non-users, everyone can access it from their laptop or mobile device. But what if we need broader access to this data? Today, we’d like to demonstrate how you can build and automate your web scraping tool to extract information from LinkedIn job postings using Selenium and Beautiful Soap.

Here is the prerequisite before we get down on the building of our very first web scrapper:

  • Python – 3+ recommended
  • Beautiful Soap – A library called Beautiful Soup makes it simple to scrape data from websites
  • Selenium – Python web browser interaction is automated with the help of the selenium package
  • Chrome Webdriver
  • You will need additional libraries like pandas, regex and time
  • A code editor of your choice – we will be using Jupyter Notebook

On the code editor, write ‘pip install selenium’ to install the Selenium library. The same goes for Beautiful Soap ‘pip install beautifulsoap4’. You can install Chrome Webdriver by opening this link.

Overview:

Here’s the complete list of the topics, we’ll cover in this article.

  • How Selenium automates LinkedIn
  • How to use Beautiful Soap to extract data from LinkedIn profiles
  • Putting the data in a.csv or.xlsx file so it can be used later
  • Automation of the process to get posts from multiple authors in one go

We’ll start by importing everything we need:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import ecpected_conditions as EC
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re as re
import time
import pandas as pd

Other than Selenium and Beautiful Soap, we’ll be using time libraries to use time functionalities such as sleep, regex to get posts of authors and pandas to control large-scale data and for writing into spreadsheets. A few things are required for Selenium to begin the automation process. A username and password are required to log in, as well as the location of the web driver on your computer. So let’s start by getting those and storing them in the corresponding variables PATH, USERNAME, and PASSWORD.

PATH = input(“Enter the Webdriver path: ”)
USERNAME = input(“Enter the username: “)
PASSWORD = input(“Enter the password: “)
print(PATH)
print(USERNAME)
print(PASSWORD)

Now, we’ll have to initialize our webdriver in a variable that Selenium will utilize to deliver all the operations. Let’s name it driver and we’ll give it the location of the webdriver.

driver = webdriver.Chrome (PATH)

Since we have given the location to our driver now is the time to give our driver the link that it should fetch. In our case, that’s Linkedin Homepage.

driver.get(“https://www.linkedin.com/login”)

time.sleep(3)

You must have realized that there’s a sleep function in the above code snippet. In general, the sleep function stops any operation (in our case, the automation process) for the amount of time that is given. Anywhere you need to pause the process, such as when you have to go around a captcha verification, you are free to do so.

Now we log in with the credentials we have.

email=driver.find_element_by_id(“username”)
email.send_keys(USERNAME)
password=driver.find_element_by_id(“password”)
password.send_keys(PASSWORD)
time.sleep(3)
password.send_keys(Keys.RETURN)

Let’s now make a few lists to store data like the author of each post, the post’s content, and the profile links. They will each be referred to as post_links, post_texts, and post_names. Once it’s done, we’ll start the actual web scraping process. Let’s declare sracp_func as a scraping code to fetch data from multiple accounts in recursion.

def Scrape_func(a,b,c):
	    name = a[28:-1]
	    page = a
	    time.sleep(10)
	

	    driver.get(page + 'detail/recent-activity/shares/')  
	    start=time.time()
	    lastHeight = driver.execute_script("return document.body.scrollHeight")
	    while True:
	        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
	        time.sleep(5)
	        newHeight = driver.execute_script("return document.body.scrollHeight")
	        if newHeight == lastHeight:
	            break
	        lastHeight = newHeight
	        end=time.time()
	        if round(end-start)>20:
	            break
	

	    company_page = driver.page_source   
	

	    linkedin_soup = bs(company_page.encode("utf-8"), "html")
	    linkedin_soup.prettify()
	    containers = linkedin_soup.findAll("div",{"class":"occludable-update ember-view"})
	    print("Fetching data from account: "+ name)
	    iterations = 0
	    nos = int(input("Enter number of posts: "))
	    for container in containers:
	

	        try:
	            text_box = container.find("div",{"class":"feed-shared-update-v2__description-wrapper ember-view"})
	            text = text_box.find("span",{"dir":"ltr"})
	            b.append(text.text.strip())
	            c.append(name)
	            iterations += 1
	            print(iterations)
	            
	            if(iterations==nos):
	                break
	

	        except:
	            pass 
	

	n = int(input("Enter the number of entries: "))
	for i in range(n):
	    post_links.append(input("Enter the link: "))
	for j in range(n):
	    Scrape_func(post_links[j],post_texts,post_names)
	

	        
	driver.quit()

We know it’s a long code but you don’t have to worry, as we’ve got you covered. We’ll explain this in simple steps. Let’s first talk about the functionality of the function. It takes 3 arguments, post links, post texts, and post names, which are respectively a, b, and c. We’ll now discuss how the function operates internally. It begins by collecting the profile link and then removes the profile name. Now we’ll use the driver to fetch posts section from user profiles and later store them in containers by using Beautiful Soup.

The length of time the driver has to collect postings is determined by line 17 of the above-mentioned code. It is 25 seconds in our case, but you can adjust it to fit your data requirements.

if round(end-start)>25:
      breakexcept:
      pass

After repeating this process for each “container,” we extract the post data that is stored there and add it to our post texts list along with the post names. When the required number of posts is reached, the loop is ended.  

Well, that was our function! Now it’s the time to put it to use.

We run it to repeat data collection for all accounts, we obtain a list of the user’s profiles and submit it to the function in recursion. The function outputs two lists: post names, which contain all of the posts’ corresponding authors and post texts, which contain all of the posts’ contents.

We’ve now reached the crucial point of our automation: Yup! It’s data saving.

data = {
    "Name": post_names,
    "Content": post_texts,
}
df = pd.DataFrame(data)
df.to_csv("gtesting2.csv", encoding='utf-8', index=False)
writer = pd.ExcelWriter("gtesting2.xlsx", engine='xlsxwriter')
df.to_excel(writer, index =False)
writer.save()

Using pandas, we’ve created a dictionary from the lists the function returns and save it to the variable ‘df’.

You can save it either as an.xlsx file or a.csv file.

Visit our blog section to learn more about valuable content.

administrator

Leave A Comment