The motivation of writing this blog came from a simple automation I did for a data collection work which was done manually for last 1 year at my current workplace. Gathering data has always been a challenge from lot of resources and websites from internet. To simply put, in lot of reporting work, we have situation where we have to gather data from website. For an example, in a use-case where we want to gather information regarding companies. We may like to go to some website, search about a company and then collect data from company’s information page.
Problem Description
In this blog-post, We will discuss one such use-case and describe building bot for automating the stuff using selenium (web-crawling) and beautiful soup (web scraping). Here goes problem statement and the steps to be done.
- Go to website https://register.fca.org.uk/
- Search an company ID like 310164, 307494, 305637, 519675 in the register.
- Click on “Search the Register” button. It may land to a ‘search result page’ OR ‘firm details page’.
- If it lands to search result page, check the “Status” column in the table whether firm is “authorized“. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”.
- If it lands to “firm details page” then we get the URL for firm details directly.
- Scrape the URL from step 4 or step 5. If the status of the firm is authorized on “firm details page” then extract the required information about the company.
- The details or attributes to be extracted are:
- Company name
- Address
- Phone
- Fax,
- Website
- Authorization Status
Let us start developing a python based bot which will crawl to “firm details page” and scrap the required information about the firm.
Disclaimer : The mention of any company names, trademarks or data sets in this blog-post does not imply we can or will scrape them. They are listed only as an illustration purposes and general use cases. Any code provided in article is for learning purpose only, we are not responsible for how it is used.
1. Imports
import string import pandas as pd from lxml import html from bs4 import BeautifulSoup from urllib.request import Request, urlopen from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException
2. Web crawling : Selenium
The selenium package is used to automate web browser interaction from Python. Selenium requires a driver to interface with the chosen browser. Chrome, for example, requires chromedriver, which needs to be installed before the below examples can be run. Make sure to provide the path in python code.
The below code snippet performs step 1, 2 and 3 which is opening the website, passing a company ID in the search box and clicking the “Search the Register” button.
fp = open("cid.txt","r") cids = fp.readlines() cids = [ix.strip("\n") for ix in cids] # global list of lists of extracted fields for all CIDs info = [] # Looping through all the company IDs for information extraction for cid in cids: # Setting up chrome driver for automated browsing driver = webdriver.Chrome("Path/chromedriver") # Site to browse driver.get("https://register.fca.org.uk/") # url : url retrieved after search button click url="" # data : list for keeping all extracted attributes for particular CID data=[] # placing the id on search box driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:searchBox').send_keys(cid) # Clicking the search button button = driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:j_id39') button.click()
Now that we have searched for an company ID by clicking button, we would like to check whether it goes to “search results page” or “firm details page”. Here, the new URL contains sub string “https://register.fca.org.uk/ShPo” when it is “firm details page” and sub string “https://register.fca.org.uk/shpo_” when it is “search results page”. We will also keep some waiting time say 10 seconds in order to load the webpage. Post that, we get the new URL.
try: # wait for max 10 secs to load the new URL, check if URL is "firm details page" if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/ShPo")): url = driver.current_url #print(url) driver.quit() except TimeoutException: # wait for max 10 secs to load the new URL, check if URL is "search results page" if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/shpo_")): url = driver.current_url #print(url) driver.quit()
Now, we have got the URL of the new page where it landed. We will use beautiful-soup to scrap the new URL webpage.
we will now perform step 4 i.e. If the URL webpage is search result page then check the “Status” column in the table whether firm is “authorized”. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”. Further, we pass URL of firm details page to function parse_infopage(authorised_link, data)
. This function parses the firm details page to extract all the required fields about the company.
Also, if the URL webpage is firm details page then it can be directly passed to parse_infopage(authorised_link, data)
function. The below python code snippet does the same.
# scraping using Request package of urllib library req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() # if url contains search result page if "https://register.fca.org.uk/shpo_" in url: soup = BeautifulSoup(webpage.decode("utf-8"), "html.parser") flag = 0 base_weblink = "https://register.fca.org.uk/" authorised_link = "" # find the table with id "SearchResults" for table in soup.findAll("table", id="SearchResults"): for row in table.findAll("tr"): for col in row.findAll("td"): # finding the current status of firm for sp in col.findAll("span", class_="CurrentStatus Authorised Authorised search_popover"): if col.text == "Authorised": flag = 1 if flag == 1: # when Authorised, find the "Name" column for name in row.findAll("td", class_="ResultName"): for a in name.findAll("a"): # get the hyperlink of firm details page authorised_link = base_weblink + a["href"] flag=0 data.append(cid) # extract information from firm details page data,cols = parse_infopage(authorised_link, data) info.append(data) # if url contains firm details page elif "https://register.fca.org.uk/ShPo" in url: data.append(cid) # extract information from firm details page data,cols = parse_infopage(url, data) info.append(data) # Create dataframe using data lists and column names df = pd.DataFrame(info,columns = cols) # writing the extracted data of tabular format in excel writer = pd.ExcelWriter('companies_info.xlsx') df.to_excel(writer,sheet_name='CID_Info') writer.save()
3. Scraping & Parsing : Beautiful Soup
Important thing to note is that the information is extracted from “Principle place of Business” section inside “Contact Details”.
The python function written below parses the firm details page and extracts the necessary fields using beautiful soup.
def parse_infopage(url_link, data): """ input: URL of firm details page, a list to be returned returns: list of extracted data, column names """ req = Request(url_link, headers={'User-Agent': 'Mozilla/5.0'}) webpage_authorised = urlopen(req).read() # parsing firm details page with beautifulsoup soup_authorised = BeautifulSoup(webpage_authorised.decode("utf-8"), "html.parser") # columns list cols=["CID"] cols.append("Company") # Extracting company name field from parsed html for name in soup_authorised.findAll("h1",class_="RecordName"): data.append(name.text) # Extracting information from "Principal place of business" for div in soup_authorised.findAll("div", class_="address section"): for h3 in div.findAll("h3",class_="addressheader"): if h3.text == "Principal place of business": # extract attribute/column names from span tags and class "addresslabel" for sp in div.findAll("span", class_="addresslabel"): cols.append(sp.text.strip()) # extract the data fields from div tags for respective attributes/columns for d in div.findAll("div", class_="addressvalue"): data.append(' '.join(d.text.split())) # extract data fields from span tags for sp in div.findAll("span", class_="addressvalue"): # decode the cloudfare obfuscated email id if "email" in sp.text.strip(): email = sp.find("a", class_="__cf_email__") data.append(decodeEmail(email['data-cfemail'])) else: data.append(sp.text.strip()) # Extracting authorization status checking statusbox field cols.append("Status") for stat in soup_authorised.findAll("span",class_="statusbox"): if "No longer authorised" in stat.text: data.append("Not Authorised") elif "Authorised" in stat.text: data.append("Authorised") return data, cols
4. Extracting Protected Email : Cloud-fare Obfuscation
Coming towards the end of this blog-post, we penned down the decoding function for protected emails. Cloudflare email address obfuscation helps in spam prevention by hiding email addresses appearing in web pages from email harvesters and other bots, while remaining visible to site visitors. An obfuscated email in an anchor tag looks like this.
data-cfemail="6a090b192a0b471a060b04440905441f01"
The below python function can decode the hexadecimal encoding to characters which forms the email id. Every two hexadecimal makes a character except the initial two hexadecimal characters which is used only to assist decoding every character.
def decodeEmail(e): de = "" k = int(e[:2], 16) for i in range(2, len(e)-1, 2): de += chr(int(e[i:i+2], 16)^k) return de
Passing the hexadecimal encoding as a parameter to the function def decodeEmail(e)
will return as decoded email string.
Final Thoughts
Finally, the python bot created in this blog-post extracts information about companies/firms and returns a data frame which has been written in an excel sheet ‘companies_info.xlsx’
You can get the full python implementation for the demonstrated bot from GitHub link here.
Hope it was easy to go through tutorial as I have tried to keep it short and simple. Readers who are interested in web crawling and web scraping can get hands-on with the use case demonstrated in this blog-post. It could be a good start in this field.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy Crawling & Scraping 🙂