Web crawling and scraping becomes crucial when we have to gather or create large data-sets automatically. Gathering data has always been a challenge from lot of resources and websites from internet. Also, to simply put, in lot of reporting work, we have situation where we have to gather data from website.
The motivation of writing this blog came from a simple automation I did for a data collection work at my current workplace. This data collection work was done manually for last 1 year. For example, in a use-case where we want to gather information regarding companies. We may like to go to some website, search about a company and then collect data from company’s information page. Moreover, if you are interested in scraping a XML file, read this blog-post.
Problem Description
In this section, We will discuss one such use-case and describe building bot for automating the stuff using selenium (web-crawling) and beautiful soup (web scraping). Now, we will define a problem statement and sequential steps to achieve the objective.
- Go to website https://register.fca.org.uk/
- Search an company ID like 310164, 307494, 305637, 519675 in the register.
- Click on “Search the Register” button. It may land to a ‘search result page’ OR ‘firm details page’.
- If it lands to search result page, check the “Status” column in the table whether firm is “authorized“. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”.
- If it lands to “firm details page” then we get the URL for firm details directly.
- Scrape the URL from step 4 or step 5. If the status of the firm is authorized on “firm details page” then extract the required information about the company.
- The details or attributes to extract are:
- Company name
- Address
- Phone
- Fax,
- Website
- Authorization Status
Let us start developing a python based bot which will crawl to “firm details page” and scrap the required information about the firm.
Disclaimer : The mention of any company names, trademarks or data sets in this blog-post does not imply we can or will scrape them. The blog-post mentions it only for illustration purposes in general use cases. Any code provided in article is for learning purpose only, we are not responsible for how it is used.
1. Imports
import string import pandas as pd from lxml import html from bs4 import BeautifulSoup from urllib.request import Request, urlopen from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import TimeoutException
2. Web crawling : Selenium
Let us talk about libraries required for web crawling task. Selenium package is used to automate web browser interaction from Python. Selenium requires a driver to interface with the chosen browser. Chrome, for example, requires chromedriver, which needs to be installed before the below examples can be run. Also, make sure to provide the path in python code.
Further, we pen down the code snippet to perform step 1, 2 and 3 which is opening the website, passing a company ID in the search box and clicking the “Search the Register” button.
fp = open("cid.txt","r") cids = fp.readlines() cids = [ix.strip("\n") for ix in cids] # global list of lists of extracted fields for all CIDs info = [] # Looping through all the company IDs for information extraction for cid in cids: # Setting up chrome driver for automated browsing driver = webdriver.Chrome("Path/chromedriver") # Site to browse driver.get("https://register.fca.org.uk/") # url : url retrieved after search button click url="" # data : list for keeping all extracted attributes for particular CID data=[] # placing the id on search box driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:searchBox').send_keys(cid) # Clicking the search button button = driver.find_element_by_id('j_id0:j_id1:j_id33:j_id34:registersearch:j_id36:j_id39') button.click()
Now that we have searched for an company ID by clicking button, we would like to check whether it goes to “search results page” or “firm details page”. Here, the new URL contains sub string “https://register.fca.org.uk/ShPo” when it is “firm details page” and sub string “https://register.fca.org.uk/shpo_” when it is “search results page”. We will also keep some waiting time say 10 seconds in order to load the webpage. Post that, we get the new URL. Below code does the same and is in continuation of the previous codes.
try: # wait for max 10 secs to load the new URL, check if URL is "firm details page" if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/ShPo")): url = driver.current_url #print(url) driver.quit() except TimeoutException: # wait for max 10 secs to load the new URL, check if URL is "search results page" if WebDriverWait(driver, 10).until(EC.url_contains("https://register.fca.org.uk/shpo_")): url = driver.current_url #print(url) driver.quit()
Now, we have got the URL of the new page where it landed. We will use beautiful-soup to scrap the new URL webpage.
As a part of web scraping, we will now perform step 4 i.e. If the URL webpage is search result page then check the “Status” column in the table whether firm is “authorized”. If authorized, go to the “Name” column in the same row and click on company link. The company link associated with it carries the URL for “firm details page”. Further, we pass URL of firm details page to function parse_infopage(authorised_link, data)
. This function parses the firm details page to extract all the required fields about the company.
Also, if the URL webpage is firm details page then it can be directly passed to parse_infopage(authorised_link, data)
function. The below python code snippet does the same and is continuation of the previous codes.
# scraping using Request package of urllib library req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) webpage = urlopen(req).read() # if url contains search result page if "https://register.fca.org.uk/shpo_" in url: soup = BeautifulSoup(webpage.decode("utf-8"), "html.parser") flag = 0 base_weblink = "https://register.fca.org.uk/" authorised_link = "" # find the table with id "SearchResults" for table in soup.findAll("table", id="SearchResults"): for row in table.findAll("tr"): for col in row.findAll("td"): # finding the current status of firm for sp in col.findAll("span", class_="CurrentStatus Authorised Authorised search_popover"): if col.text == "Authorised": flag = 1 if flag == 1: # when Authorised, find the "Name" column for name in row.findAll("td", class_="ResultName"): for a in name.findAll("a"): # get the hyperlink of firm details page authorised_link = base_weblink + a["href"] flag=0 data.append(cid) # extract information from firm details page data,cols = parse_infopage(authorised_link, data) info.append(data) # if url contains firm details page elif "https://register.fca.org.uk/ShPo" in url: data.append(cid) # extract information from firm details page data,cols = parse_infopage(url, data) info.append(data) # Create dataframe using data lists and column names df = pd.DataFrame(info,columns = cols) # writing the extracted data of tabular format in excel writer = pd.ExcelWriter('companies_info.xlsx') df.to_excel(writer,sheet_name='CID_Info') writer.save()
3. Scraping and Parsing : Beautiful Soup
Most importantly, the task here is to extract the information from “Principle place of Business” section inside “Contact Details”.
The python function written below scrapes as well as parses the firm details page and extracts the necessary fields using beautiful soup.
def parse_infopage(url_link, data): """ input: URL of firm details page, a list to be returned returns: list of extracted data, column names """ req = Request(url_link, headers={'User-Agent': 'Mozilla/5.0'}) webpage_authorised = urlopen(req).read() # parsing firm details page with beautifulsoup soup_authorised = BeautifulSoup(webpage_authorised.decode("utf-8"), "html.parser") # columns list cols=["CID"] cols.append("Company") # Extracting company name field from parsed html for name in soup_authorised.findAll("class_="RecordName"): data.append(name.text) # Extracting information from "Principal place of business" for div in soup_authorised.findAll("div", class_="address section"): for h3 in div.findAll("h3",class_="addressheader"): if h3.text == "Principal place of business": # extract column names from span tags and class "addresslabel" for sp in div.findAll("span", class_="addresslabel"): cols.append(sp.text.strip()) # extract the data fields from div tags for respective attributes/columns for d in div.findAll("div", class_="addressvalue"): data.append(' '.join(d.text.split())) # extract data fields from span tags for sp in div.findAll("span", class_="addressvalue"): # decode the cloudfare obfuscated email id if "email" in sp.text.strip(): email = sp.find("a", class_="__cf_email__") data.append(decodeEmail(email['data-cfemail'])) else: data.append(sp.text.strip()) # Extracting authorization status checking statusbox field cols.append("Status") for stat in soup_authorised.findAll("span",class_="statusbox"): if "No longer authorised" in stat.text: data.append("Not Authorised") elif "Authorised" in stat.text: data.append("Authorised") return data, cols
4. Extracting Protected Email : Cloud-fare Obfuscation
Coming towards the end of this blog-post, we penned down the decoding function for protected emails. Cloudflare email address obfuscation helps in spam prevention by hiding email addresses appearing in web pages from email harvesters and other bots, while remaining visible to site visitors. An obfuscated email in an anchor tag looks like this.
data-cfemail="6a090b192a0b471a060b04440905441f01"
The below python function can decode the hexadecimal encoding to characters which forms the email id. Every two hexadecimal makes one character of email string except the initial two hexadecimal characters. Initial 2 hex characters are used only to assist decoding every other character.
def decodeEmail(e): de = "" k = int(e[:2], 16) for i in range(2, len(e)-1, 2): de += chr(int(e[i:i+2], 16)^k) return de
Passing the hexadecimal encoding as a parameter to the function def decodeEmail(e)
will return as decoded email string.
Final Thoughts
Finally, after the completion of web crawling and scraping task, we show the final extracted information about companies/firms in a pandas data frame. At the end, bot writes the data frame in an excel sheet ‘companies_info.xlsx’.
In addition, you can get the full python implementation for the demonstrated bot from GitHub link here.
Hope it was easy to go through tutorial as I have tried to keep it short and simple. Interested readers can get hands-on with the use case of web crawling and scraping demonstrated in this blog-post. It could be a good start in this field.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy Crawling and Scraping 🙂
Very nice and helpful information.
https://nearlearn.com/blog/
Like
Content Scraping @ https://limeproxies.com/blog/crawling-vs-scraping/: This category of scraping is used to duplicate the particular advantage of a certain product or service that relies on content. For example, if a product relies on reviews, a rival page could scrape all the review content from their competitor and replicate the content on their website pretending its original.
Like
Thank you very much for sharing the detailed explanation. I am working on similar project and your explanation helped me a lot.
Like
Glad you liked !!
Like
This is a great and very digestible! Can’t wait till I try it for my project 🙂
Like
Thanks for liking it.
Like