Introduction to NLP Series: Preprocessing Phase

Eric Gustavo Romano
5 min readSep 12, 2021

In this blog plot we will explore the technical parts to my pervious blog post where I gave a brief introduction to the importance of Text analytics and some preliminary analysis. I will cover my preprocessing steps that were used to develop a text classification model.

Tokenization

To gain insights within you text data there are methods used to understand the properties of each document found within it. One common task that is routinely performed in Natural Language processing is Tokenization. Tokens are meant to separate a piece of text into smaller segments that can either be a word, combination of words, characters, or sentences. In essence, tokenization can break down text into a format you wish to analysis them in. There is so much more to learn about Tokenization, but I will go on a deeper dive into the topic in another blog.

I utilized tokenization to understand the distribution of unique words that were found in each target corpus. However, before I move onto exploring the distribution, I need to clean my text data and remove any unnecessary portions of my text data that will not provide me any additional insights to gain.

Lower casing, Removal of Punctuation and Stopwords

The following step I took following tokenization is cleaning my text data by removing certain characters that can produce noise data.

Casing can introduce noise in your text data by potential allowing the machine to count two words as different words due to different type of casing. For example, let’s say in a certain text document you have the word “remove” appear a few times. However, the words are formatted like so:

“Remove”, “rEmove”, “REMOVE”, “remove”

All these forms might have some type of meaning into why they are cased differently but, in most cases, it does not provide additional useful information. This type of noise can be removed by lower casing all your text data.

Another element that can produce noise in your text data are punctuations. You need to be careful with this one because you first need to figure out if all punctuations are noise or if there are some characters that are vital in helping you classify a target class. For example, the punctuation “!” might be used to determine some emotional response like anger. After my initial exploratory analysis, I found that most of the punctuations were not adding additional insights that can help when modeling. For this reason, treating them as noise is appropriate and should be removed.

Finally, the core to most of your noise in text data comes straight from the words we use most often. Common words that do not add any additional context to the meaning behind the text are known as stopwords. For example, let’s say I give you the following list of words:

[ “the”, “was”, “are”, “a”, “have”]

From just this list you have no clue what the meaning behind it might be. Now let me add just one word and let’s see how quickly you can start to visualize the meaning behind this list.

[ “the”, “was”, “are”, “a”, “have”, “fire”]

You guessed it, if you were thinking the list of words are trying to describe an event that has something to do with fire, than your right. Notice how I just needed to provide one word so that you can gain insight to the meaning of the list of words. Removing these stop words allows you to reduce the presences of noise.

Stemming and Lemmatization

After tokenization and cleaning the following two techniques can be applied on tokenized text data. In our preprocessing the two preprocessing techniques that I want to cover in this post are Stemming and Lemmatization. If you are new to this space, these two techniques might seem to be the same. These techniques are both meant to chop words to its simplest form that are known as its stem or root. Stemming is a technique that reduces words to morphological variants of a root word. For example, a stemming algorithm reduces the word “Coding”, “Codes”, “Coded” into “Cod”.

This technique can be useful in reducing the dimension of your text data but there are two type of errors that can occur in stemming and should not be taken lightly as they can point you to the wrong direction during analysis.

Over Stemming: Stemming two different words from one root word when they belong to two separate root words.

Under Stemming: Stemming two words into two separate root words when they should just be stemmed from one root word.

Lemmatization is a process that allows you use context to cut down the word into its root form. This technique allows you to avoid stemming problems of over stemming and under stemming. For example, lemmatization will correctly identify the base form of “coding” to “code”, whereas, stemming would return cod. That seems a little fishy to me so I will stick to a Lemmatization approach.

Frequency Distribution

In the world of analysis, we love to explore the distribution within our data. You can also find insights when you examine the distribution within text data. One way of doing this is by exploring the frequency for the unique words that are found in your corpus. By understanding the different term frequency distribution, you can start to create a mental map of your corpus.

The first thing we need to do is create a corpus for each class.

Now we can examine the frequency distribution for our corpus with words found in our disaster class.

Word cloud

When delivering visuals, it important to always consider how your audience will experience the message you are trying to deliver. Most of the time this means the best visual comes from the simplest visual that can deliver your message without introducing any fluff. Although, frequency distribution is a great way to display the distribution of unique words found in your corpus, it’s not the best.

That’s where a word cloud comes into action. It’s visually appealing, customizable, and simply gets to the point.

From these two images try to select the image that represents the words found in my Disaster class corpus.

generatewordcloud(corpus= fdist0,types=’fit_words’, cmap= ‘viridis’)
generatewordcloud(corpus= fdist1,types=’fit_words’, cmap= ‘viridis’)

It’s quite easy to pick which word cloud represents my disaster class. Here is the code I used to generate this word cloud.

If you want to see the whole entire code for this project check out my github where you can find this project and many more.

--

--

Eric Gustavo Romano

Hi! My name is Eric Gustavo Romano. I am a data science enthusiast and practitioner located in Jersey.