Using NLP to Explore Data Analysis and Data Cleansing

Introduction

Exploratory data analysis can tell us much about a dataset and the information that can be gathered from that dataset. Proper data analytics can give deep insight into a dataset, by grouping, aggregating, and drilling down into the information using a variety of business intelligence tools (for example, I discuss using Tableau for Data Exploration in another project sample). However exploratory data analysis can provide other information as well. Often in a data mining project, the data being acquired may be in a “messy” state. Data exploration can help identify the ways in which that data is messy, which can help in solving how best to cleanse the data and improve its quality.

One of the most challenging types of data to cleanse and process is nature language. Natural Language Processing (NLP) is an area of data mining and machine learning (ML) that deals with gaining information from analyzing the real-life words and sentences humans use. Because language is a dynamic thing, gaining information, such as sentiment analysis, can present a major hurdle. Much work must be done to process and prepare natural language text for data mining tasks. To gain a better understanding of exploratory data analysis and data cleansing, a simple supervised ML task will be explored as an introduction into NLP.

  • At the end of this essay is a link to the Jupyter Notebook used to complete this task and provide a submission to the Kaggle competition this project is associated with.

Kaggle NLP with Disaster Tweets Overview

Kaggle.com is a prominent website within the ML research community that hosts a variety of ML competitions. Organizations can provide large cash prizes for ML research to compete to provide the best solution to a given ML problem. One popular ongoing competition (that offers no prize) is the Natural Language Processing with Disaster Tweets competition . In this competition, competitors are provided with a labeled training set of 7,613 tweets from Twitter. Each record contains, among other things, the text of the tweet and a label indicating whether the specific tweet is referencing a disaster/emergency or not (meaning it is either a non-disaster "normal" tweet or a disaster tweet). The goal of the competition is to train an NLP ML model to analyze the tweets and classify whether a newly presented tweet is or is not referring to a disaster or emergency.

To better learn about data exploration, data cleansing, and NLP, the task set forth from this competition will be examined. To accomplish this goal, Python and a Jupyter Notebook will be utilized. Numpy and Pandas will provide access to data manipulation, while Seaborn and Matplotlib will provide access to data visualization. The nltk library will be used for NLP tasks along with the SciKit Learn library, which will also provide additional functionalities. Finally, the SciKit Learn Multinomial Naïve Bayes classifier will be utilized as the ML model for this task as a proof of concept. For brevity, as there are many other classifiers, some of which that could be better suited for this task, only the Naïve Bayes classifier will be utilized for this project sample.

Data Exploration

With the ML task clearly stated from the Kaggle competition, the initial focus can turn to exploratory data analysis. This is a crucial step for NLP tasks, as the data exploration can highlight areas in the text that need to be cleansed. Data analysis also has the possibility of providing valuable insights without even needing to train a ML model, so it is a powerful skill to have.

Figure 1 - Basic data frame information about the training tweets.

Diving into the data, one of the first things to look for is missing values. As shown in Figure 1, of the several attributes in the training data both the “keyword” and “location” attributes have missing values. While the “location” column is only missing less than 1% of values, the “keyword” column is missing approximately 33% of its values. A fix for the “location” column could be determined to utilize that data in the ML model, but it may be better to just drop the “keyword” column from possible training features due to the large number of missing values. For the simplicity of this project, both the “location” and “keyword” attributes were not used in the final model.

Figure 2 - A histogram for tweet length for both the non-disaster and disaster tweets.

With the text being analyze being tweets, the length of the tweets could also be of interest. Analyzing Figure 2, it is obvious that most tweets are near 150 characters. This makes logical sense as that was a character limit for that platform, so most tweets would tend to use all characters. It is interesting to note that non-disaster tweets are slightly more likely to use less than the maximum number of characters than disaster tweets. This could be because tweets relating to disasters most likely aim to share as much information as possible. While this column could be useful in a ML model, for the simplicity of this project, the length of the tweets was not used in the final model.

A vital piece of information, when planning a supervised ML project, is the class distribution of the classes of interest. The distribution of classes should affect the selection of ML models to use, as different models are better in different situations. For example, some ML tasks only have a target class distribution of a handful of percent, with the class of interest being heavily outweighed by records of the other class or classes. An example of this would be a dataset of credit card transactions being used to train a ML model to detect fraudulent charges. Because there may only be one or two fraudulent charge out of thousands, or tens-of-thousands, of transactions, the ML model used must be robust enough to detect the rare fraudulent transactions. Luckily for this project, Figure 3 shows that the target class of interest (class 1, disaster tweets) makes up approximately 43% if the training dataset, showing that this project dataset is not heavily skewed.

Figure 3 - The class distribution within the training dataset.

Figure 4 - Distribution of stop words for the non-disaster tweets (class 0).

Figure 5 - Distribution of stop words for the disaster tweets (class 1).

Data exploration also provides the analyst with insight into the data cleansing that is needed to be done within the dataset. Stop words represent words that do not add much meaning to a phrase or sentence, such as “the”, “a”, or “of”. If the text to be analyzed contained many stop words, then removing those should be part of the data cleansing and preparation process. Looking into stop words within the training dataset, Figures 4 and 5 shows large numbers of stop words, indicating this cleaning step is needed. Another valuable insight is to look at the top bigrams (two-word phrases) within the texts, to determine if there is anything else the analyst should take note of. Looking into commonly paired words in Figure 6 shows, among a large amount of stop words, that words representing "http" are plentiful. This indicates there are URLs within the text and that data cleansing maybe should account for removing URLs.

Figure 6 - The top ten bigrams, or word pairs, for the training dataset.

Data Preparation

With exploratory data analysis providing some insights, the next step is to prepare the text for the ML task. In addition to exploratory data analysis, a data analyst can use subject matter experts, and their own knowledge and experience, to help guide the data cleansing and preparation. In addition to removing stop words and URLs, it would also be helpful to remove emoji, HTML tags, numbers, and punctuation from the text, so that the ML model only analyzes the natural language text of the tweets. Furthermore, due to the nature of Twitter, Twitter handle mentions are probably within the dataset and should be removed, as personal names may not provide much data. Finally, all the text could be converted to lowercase to help further reduce the feature space by reducing two spellings (such as “run” and “Run”) into one spelling (“run” and “run”).

With the text data cleansed, the next step to be completed is to convert the raw text into a Bag-of-Words (BoW) model to then feed into the ML model for training. A BoW model is a simplification of a corpus by breaking down it down into its vocabulary and then representing each text record as a matrix recording the words that made it up (without saving word order or other information). This allows raw text to be turned into vectors, which can be fed into a ML model to train the model. The SciKit Learn CountVectorizer function uses the text processor function that was developed to cleanse the text to process the training dataset and produce a BoW model that can then be sent to a ML model.

Using Pipelines in Machine Learning

To complete this ML task, first the text must be processed and converted into a BoW model before it is fed into a ML classifier. While this workflow for this project will be relatively simple, a more complex project may have a larger workflow for preparing the data and sending it to the ML model. To streamline and automate this workflow a pipeline can be created. SciKit Learn provides a pipeline package to easily create a workflow pipeline that can then be treated as a SciKit Learn ML model. This allows for the automation of text processing and ML into a single object instance.

Once the training dataset has been split up into a training set and a validation set, the pipeline object can be called to fit the training set just like a SciKit Learn ML model. This will then automatically process the text into a BoW and then train the Naïve Bayes ML model on the data passed to it. Once trained, the pipeline can then be used to make predictions on new data, such as the set aside validation data.

Measuring Performance and Submitting Results

Figure 7 - A brief report on classifier performance from SciKil Learn.

Once the pipeline is trained and a set of predictions have been made on the validation data, the performance of the model can be examined by comparing the predictions made on the validation data to the actual target labels of the validation data. SciKit Learn provides a useful classification report summary that provides a look at some of the most common performance metrics, such as precision, recall, and f1 score. The Disaster Tweets competition judges submissions based on f1 score, so the ideal model is one that presents the best f1 score. As shown in Figure 7, the chosen model for this project consistently produced a weighted average f1 score of 0.79.

Another useful way that performance can be analyzed for a classification task is the confusion matrix, which provides a summary look at how records were classified. A confusion matrix will provide the information needed to calculate a large number of performance metrics, including the metrics shown in Figure 7. Seaborn provides an elegant way to visualize a confusion matrix using a heatmap. This heatmap, which can be seen in Figure 8, can become extremely useful when the number of target classes increases above two classes, as the colors of the map can bring attention to potential insights into classifier performance. The confusion matrix along with the classification report can provide significant insight into the performance of a ML model.

When the data analyst is happy with the performance of the ML model, the next step is to create predictions from the test data, and then create a submission to be scored for the Kaggle competition. The pipeline object can be used once again to automatically process the test data and produce predictions with the trained pipeline and model. Once the predictions are completed they can then be properly formatted into the submission format required by the competition. The chosen model for this project produced a consistent score of 0.79 for its submissions (Model score can be verified on Kaggle.com).

Figure 8 - A confusion matrix heatmap showing the trained pipeline performance.

Further Research Suggestions

Due to the vast complexity of ML and NLP, there are many potential routes for further research that could be used to improve the performance of this project’s model. ML is an ever evolving field, so being able to stay up-to-date through research and learning is vital for success. The following suggestions are just a handful of areas that could be explored to improve this and other NLP projects.

Three potentially useful steps in NLP are spelling correction, word stemming, and Term Frequency – Inverse Document Frequency (TF-IDF) analysis. Spelling correction is self-explanatory, and would improve the overall quality of the dataset by correcting commonly misspelled words with the correct spelling. Word stemming is the process of converting words to their base form, so the words “Running”, “Runs”, and “Ran” would all potentially be stemmed to “Run”. This could further reduce the feature space and improve model performance by reducing the variety of words in the corpus vocabulary. TF-IDF analysis is a statistical method that can be used to evaluate the importance of a specific word within a corpus. Term frequency refers to the frequency of a specific word occurring within a corpus. Inverse document frequency helps measure the importance of a word by providing more weight to words that occur less frequently, by taking the inverse of the frequency of documents the term occurs with. Because unique words that occur extremely infrequently can be important, TF-IDF provides a way to weight those unique words more heavily. These three processes are all potential steps that could be used to improve a NLP ML model.

One of the challenges of ML is tuning the model once a model is chosen to be used. Hyperparameter tuning is a useful process in ML that should help produce the best hyperparameters for the particular model. For example, the pipeline created for this project uses the CountVectorizer functions, which has many parameters that can be adjusted, along with the Multinomial Naïve Bayes model, which (among other parameters) has an alpha parameter used for smoothing during training. Instead of manually testing each variation of hyperparameters, functions such as SciKit Learn’s GridSearchCV can be used to test sets of hyperparameters. The GridSearchCV function was used during research on this project for hyperparameter tuning, leading to no feature maximum being selected for the CountVectorizer and an alpha coefficient of 1.1 for the Multinomial Naïve Bayes classifier. For the brevity of this project, hyperparameter tuning with GridSearchCV was not covered in detail. However. it is important to note that automation of hyperparameter tuning provides a robust way for ML model tuning that does not require much manual work.

Finally, while the Multinomial Naïve Bayes ML model was utilized in this project, there are many more ML models that could be used for this task, including some that would potentially be better suited for this task. As noted at the beginning of this section, continuous research is vital to the success and performance of ML projects. Additional research could be completed to implement a better classifier, or better text processing, to improve performance for the NLP task at hand. Learning, continual education, and research should be one of the main focuses of a data analyst.

Conclusion

Data mining, NLP, and ML are complex topics that have both large breadths and depths of information. Building an effective NLP ML model requires much analysis and understanding of both the dataset and the ML process. Effective exploratory data analysis can help highlight valuable insights into the dataset being mined. In the case of NLP, this exploratory data analysis can provide a look at the data cleansing operations that may need to be completed before sending the corpus to the ML model. Another vital part of the ML process is continued research and education. Data analysts and data scientist must constantly be learning to stay on the forefront of knowledge to be most effective. To that end, this project represented a good case study for beginning research in NLP.

Below are links to the Python Jupyter notebooks and datasets used for this project: