Web Content Scraping for Business Intelligence

Introduction

The data that feeds into various information systems may come from a wide variety of locations. It is not uncommon that an organization may want or need data from an external source to help make informed decisions. Because there are large amounts of information publicly available on the Internet, the capability to complete web content scrapping can help give an organization access to the information it needs, potentially providing a competitive advantage. Web content scrapping can provide an organization with a wide variety of information, making it an extremely versatile skill to have. To explore the potent benefits of web content scrapping, an example of a recent project is presented below. In addition to web scraping, geo-mapping location based datasets will also be briefly discussed.

  • At the end of this essay is a link to a sample of Python code used to complete this project.

GA COAM Project Overview

A professional acquaintance from a Georgia Coin Operated Amusement Machine (GA COAM) company reached out for consulting on a project from her boss. She was given a set of data and asked to cross reference it with known businesses with GA COAM licenses to generate a list of potential leads. The businesses they were targeting were gas stations and convenience stores, however the dataset was for buried fuel tanks and mostly consisted of old construction sites or military bases. There was also no way to join the two datasets, making the task tedious.

It was apparent that the dataset given to her by her boss was not usable for the business task at hand. After some brainstorming and Google research, we determined that GA liquor license records would provide all the information needed to complete this task, as the vast majority of convenience and gas stations would sell beer and wine. This presented a perfect opportunity to utilize web content scraping to collect the information needed quickly and in a usable form.

To complete this task Python and Jupyter Notebooks were used, along with the Pandas and Selenium libraries. A brief explanation of Selenium will be included below. The process started with initial preparation, followed by scrapping the data from the website. Next the scraped data was processed and joined to the GA COAM dataset already provided using the business address fields from both datasets, as these fields matched across the datasets. Lastly, in addition to delivering the finalized dataset of potential leads, an interactive map was generated using Google Maps.

Scrapping Data with Python and Selenium

While Python provides a powerful programming language for data science by itself, it really flourishes when you utilize its many excellent open-source libraries. One that is used often in data science is the Pandas library, which provides Python projects with a robust library for creating and manipulating large and complex data structures. This project also used the Selenium library, which offers a variety of tools for interacting with browsers through the usage of the Selenium WebDriver.

The Selenium WebDriver allows for the creation and usage of a web browser object instance, which can programmatically access and interact with both static and dynamic websites. The ability to interact with the elements of dynamic websites makes Selenium WebDriver extremely powerful in allowing the analyst to access the data they are after. Take the snippet of code below (Figure 1).

After creating the Selenium WebDriver instance (in this case using a Google Chrome based browser driver), the link to the dataset was passed to the WebDriver instance and opened. Once the website is opened the power of Selenium can be seen. In this case the first web element that must be interacted with is a link to the Licenses lookup system (Figure 2). Selenium provides a variety of ways to find and interact with website elements. The element that is needed to be interacted with, a link, was found by searching for the text of the link. Once the element was located then the WebDriver instance was instructed to click the link and load the targeted system.

The robust functionality that Selenium provides in interacting with dynamic content allows the data analyst to access and interact with the elements on numerous sites. This can provide access to useful data, but there are many other benefits as well. Selenium allows for additional tasks, such as web site testing, or process automation, making it a useful library know how to use.

Figure 1 - Sample Python code using Selenium WebDriver to access the Licenses link, seen in Figure 2.

Figure 2 - A set of links on the GA database website, including the link to the Licenses portion of the database.

Finishing the Project and Mapping the Data

After the data from the various cities and counties was scrapped and compiled into a single dataset, additional cleaning took place before joining it with the pre-existing GA COAM dataset. For example, inactive businesses needed to be purged from the dataset, strange data values needed to be corrected, and missing data needed to be dealt with. Finally, the entire liquor license dataset included restaurants, grocery stores, and other types of businesses that are not of interest. The only businesses of interest are gas station and convenience store type businesses, and all other records were no longer needed.

At this point, the cleaned-up dataset of GA liquor licenses was able to be joined to the GA COAM licenses dataset, to give the COAM company a large list of potential business leads. But a large spreadsheet table isn’t the best way to view this kind of dataset, and so an additionally deliverable was completed. After geocoding the business addresses in the deliverable dataset, the dataset was then plotted to a simple Google Map. This allowed the business to visualize their potential leads dataset in a way that was previously impossible for them, without having to make any investment.

The sample map linked on this page shows some sample data from the city of Atlanta, GA. This map has been customized so that locations with green tags represent businesses with an active GA COAM license, while businesses with red tags do not have an active GA COAM license. Sales staff can view their map to determine which businesses would be the best to contact in an attempt to make a sale. By seeing exactly where these businesses are, the sales staff can make decisions on whether a business would be a potential client. For example, the sales staff want to focus on businesses near residential neighborhoods and would be able to avoid contacting business in non-residential areas.

Conclusion

Web content scrapping provides access to many untraditional resources that were previously not accessible at a usable scale. By being able to automate the process of data collection over the internet, organizations can obtain all sorts of data to solve all sorts of problems. As was shown above, web content scraping, along with a few other technologies, can help create powerful tools for organizations to utilize to create a competitive advantage for themselves.

Below is a link to a sample of the code used in this project: