Web Crawling & Scraping using Selenium & Beautiful Soup : Automating Data Extraction with Python

The motivation of writing this blog came from a simple automation I did for a data collection work which was done manually for last 1 year at my current workplace. Gathering data has always been a challenge from lot of resources and websites from internet. To simply put, in lot of reporting work, we have situation where we have to gather data from website. For an example, in a use-case where we want to gather information regarding companies. We may like to go to some website, search about a company and then collect data from company’s information page.

Problem Description

In this blog-post, We will discuss one such use-case and describe building bot for automating the stuff using selenium (web-crawling) and beautiful soup (web scraping). Here goes problem statement and the steps to be done.

Go to website https://register.fca.org.uk/
Search an company ID like 310164, 307494, 305637, 519675 in the register.
Click on “Search the Register” button. It may land to a ‘search result page’ OR ‘firm details page’.
If it lands to search result page, check the “Status” column in the table whether firm is “authorized“. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”.
If it lands to “firm details page” then we get the URL for firm details directly.
Scrape the URL from step 4 or step 5. If the status of the firm is authorized on “firm details page” then extract the required information about the company.
The details or attributes to be extracted are:

Company name
Address
Phone
Fax,
Email
Website
Authorization Status

Let us start developing a python based bot which will crawl to “firm details page” and scrap the required information about the firm.

Disclaimer : The mention of any company names, trademarks or data sets in this blog-post does not imply we can or will scrape them. They are listed only as an illustration purposes and general use cases. Any code provided in article is for learning purpose only, we are not responsible for how it is used.

1. Imports

import string
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

2. Web crawling : Selenium

The selenium package is used to automate web browser interaction from Python. Selenium requires a driver to interface with the chosen browser. Chrome, for example, requires chromedriver, which needs to be installed before the below examples can be run. Make sure to provide the path in python code.

The below code snippet performs step 1, 2 and 3 which is opening the website, passing a company ID in the search box and clicking the “Search the Register” button.

fp = open("cid.txt","r")
cids = fp.readlines()
cids = [ix.strip("\n") for ix in cids]
# global list of lists of extracted fields for all CIDs
info = []

# Looping through all the company IDs for information extraction
for cid in cids:
    # Setting up chrome driver for automated browsing
    driver = webdriver.Chrome("Path/chromedriver")
    # Site to browse
    driver.get("https://register.fca.org.uk/")

    # url : url retrieved after search button click
    url=""
    # data : list for keeping all extracted attributes for particular CID
    data=[]

    # placing the id on search box
    driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:searchBox').send_keys(cid)
    # Clicking the search button
    button = driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:j_id39')
    button.click()

Now that we have searched for an company ID by clicking button, we would like to check whether it goes to “search results page” or “firm details page”. Here, the new URL contains sub string “https://register.fca.org.uk/ShPo” when it is “firm details page” and sub string “https://register.fca.org.uk/shpo_” when it is “search results page”. We will also keep some waiting time say 10 seconds in order to load the webpage. Post that, we get the new URL.

    try:
        # wait for max 10 secs to load the new URL, check if URL is "firm details page"
        if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/ShPo")):
            url = driver.current_url
            #print(url)
            driver.quit()

    except TimeoutException:
        # wait for max 10 secs to load the new URL, check if URL is "search results page"
        if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/shpo_")):
            url = driver.current_url
            #print(url)
            driver.quit()

Now, we have got the URL of the new page where it landed. We will use beautiful-soup to scrap the new URL webpage.

we will now perform step 4 i.e. If the URL webpage is search result page then check the “Status” column in the table whether firm is “authorized”. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”. Further, we pass URL of firm details page to function parse_infopage(authorised_link, data). This function parses the firm details page to extract all the required fields about the company.

Also, if the URL webpage is firm details page then it can be directly passed to parse_infopage(authorised_link, data) function. The below python code snippet does the same.

    # scraping using Request package of urllib library
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()

    # if url contains search result page
    if "https://register.fca.org.uk/shpo_" in url:

        soup = BeautifulSoup(webpage.decode("utf-8"), "html.parser")
        flag = 0
        base_weblink = "https://register.fca.org.uk/"
        authorised_link = ""

        # find the table with id "SearchResults"
        for table in soup.findAll("table", id="SearchResults"):
            for row in table.findAll("tr"):
                for col in row.findAll("td"):
                    # finding the current status of firm
                    for sp in col.findAll("span", class_="CurrentStatus Authorised Authorised search_popover"):
                        if col.text == "Authorised":
                            flag = 1
                if flag == 1:
                    # when Authorised, find the "Name" column
                    for name in row.findAll("td", class_="ResultName"):
                        for a in name.findAll("a"):
                            # get the hyperlink of firm details page
                            authorised_link = base_weblink + a["href"]
                        flag=0

        data.append(cid)
        # extract information from firm details page
        data,cols = parse_infopage(authorised_link, data)
        info.append(data)

    # if url contains firm details page
    elif "https://register.fca.org.uk/ShPo" in url:

        data.append(cid)
        # extract information from firm details page
        data,cols = parse_infopage(url, data)
        info.append(data)

# Create dataframe using data lists and column names
df = pd.DataFrame(info,columns = cols)

# writing the extracted data of tabular format in excel
writer = pd.ExcelWriter('companies_info.xlsx')
df.to_excel(writer,sheet_name='CID_Info')
writer.save()

3. Scraping & Parsing : Beautiful Soup

Important thing to note is that the information is extracted from “Principle place of Business” section inside “Contact Details”.

The python function written below parses the firm details page and extracts the necessary fields using beautiful soup.

def parse_infopage(url_link, data):
    """
    input: URL of firm details page, a list to be returned
    returns: list of extracted data, column names
    """

    req = Request(url_link, headers={'User-Agent': 'Mozilla/5.0'})
    webpage_authorised = urlopen(req).read()
    # parsing firm details page with beautifulsoup
    soup_authorised = BeautifulSoup(webpage_authorised.decode("utf-8"), "html.parser")

    # columns list
    cols=["CID"]
    cols.append("Company")

    # Extracting company name field from parsed html
    for name in soup_authorised.findAll("h1",class_="RecordName"):
        data.append(name.text)

    # Extracting information from "Principal place of business"
    for div in soup_authorised.findAll("div", class_="address section"):
        for h3 in div.findAll("h3",class_="addressheader"):
            if h3.text == "Principal place of business":

                # extract attribute/column names from span tags and class "addresslabel"
                for sp in div.findAll("span", class_="addresslabel"):
                    cols.append(sp.text.strip())

                # extract the data fields from div tags for respective attributes/columns
                for d in div.findAll("div", class_="addressvalue"):
                    data.append(' '.join(d.text.split()))

                # extract data fields from span tags
                for sp in div.findAll("span", class_="addressvalue"):
                    # decode the cloudfare obfuscated email id
                    if "email" in sp.text.strip():
                        email = sp.find("a", class_="__cf_email__")
                        data.append(decodeEmail(email['data-cfemail']))
                    else:
                        data.append(sp.text.strip())

    # Extracting authorization status checking statusbox field
    cols.append("Status")
    for stat in soup_authorised.findAll("span",class_="statusbox"):
            if "No longer authorised" in stat.text:
                data.append("Not Authorised")
            elif "Authorised" in stat.text:
                data.append("Authorised")

    return data, cols

4. Extracting Protected Email : Cloud-fare Obfuscation

Coming towards the end of this blog-post, we penned down the decoding function for protected emails. Cloudflare email address obfuscation helps in spam prevention by hiding email addresses appearing in web pages from email harvesters and other bots, while remaining visible to site visitors. An obfuscated email in an anchor tag looks like this.

data-cfemail="6a090b192a0b471a060b04440905441f01"

The below python function can decode the hexadecimal encoding to characters which forms the email id. Every two hexadecimal makes a character except the initial two hexadecimal characters which is used only to assist decoding every character.

def decodeEmail(e):
    de = ""
    k = int(e[:2], 16)

    for i in range(2, len(e)-1, 2):
        de += chr(int(e[i:i+2], 16)^k)

    return de

Passing the hexadecimal encoding as a parameter to the function def decodeEmail(e) will return as decoded email string.

Final Thoughts

Finally, the python bot created in this blog-post extracts information about companies/firms and returns a data frame which has been written in an excel sheet ‘companies_info.xlsx’

You can get the full python implementation for the demonstrated bot from GitHub link here.

Hope it was easy to go through tutorial as I have tried to keep it short and simple. Readers who are interested in web crawling and web scraping can get hands-on with the use case demonstrated in this blog-post. It could be a good start in this field.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy Crawling & Scraping 🙂

Machine Learning in Action

A perfect hands-on practice for beginners to elevate their ML skills

Web Crawling & Scraping using Selenium & Beautiful Soup : Automating Data Extraction with Python

Problem Description

1. Imports

2. Web crawling : Selenium

3. Scraping & Parsing : Beautiful Soup

4. Extracting Protected Email : Cloud-fare Obfuscation

Final Thoughts

Leave a Reply Cancel reply

Problem Description

1. Imports

2. Web crawling : Selenium

3. Scraping & Parsing : Beautiful Soup

4. Extracting Protected Email : Cloud-fare Obfuscation

Final Thoughts

Sharing is Caring

Like this:

Related

Leave a Reply Cancel reply