Essential Techniques for Text Pre-Processing
Tokenization – The process of dividing the given text into words and sentences in the form of tokens. Before any text analytics procedure (whether it is a classification or whether it is generation), the text needs to be divided into a smaller unit on the basis of linguistics units for example words, numbers, punctuations, and alpha-numeric, etc. This procedure is known as tokenization. It belongs to the pre-processing section in its true form because without getting a separate single unit from a given text, any task related to analysis and generation is not possible.
Removing unnecessary punctuation, tags – It is a simple but much needed pre-processing technique. It is clear from the title itself what is to be done in this step. There are so many libraries which are already available for almost every programming language which can perform this task with few lines of code but it is necessary because it is an essential step to change the given text into a cleaner form.
Removing of stop words – Stop words are considered to be the useless words in the terms of text analysis procedure which means the words which do not have or have very less significance in terms of analysis of a whole sentence, these types of words known as stop words. These words are can be removed basically using different techniques which are –
- Supervised Machine Learning – Removing stop words from feature space.
- Clustering – Removing stop words prior to generating clusters.
- Information Retrieval – Preventing stop words from being indexed.
- Text Summarization – Excluding stop words from contributing to summarization scores & removing stop words when computing ROUGE scores.
Stemming – Usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm. o n the basis of porter stemmer is developed which is very popular for stemming. Some other stemmers are also available such as Lovins stemmer and Paice stemmer which can be used for the same purpose. So many libraries are also available in various languages which provides direct support to this process.
Lemmatization – Usually refers to doing things properly with the use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. It is a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval. It is hard to say more because either form of normalization tends not to improve English information retrieval performance in aggregate – at least not by very much. While it helps a lot for some queries, it equally hurts performance a lot for others. Stemming increases recall while harming precision. Lemmatizer is the tools which are used to do the lemmatization process. Many libraries have already a pre-built function to do lemmatization with ease.
Essential Techniques for Text Classification
Word Embeddings – To capture the context of the word in a text file, word embedding is used. With the help of it the syntactic and semantic similarity, the relation between the words can be established. So it is clear now why word embedding is used. But what are word embeddings exactly? They are the vectors which symbolize a particular word. The mechanism of word embeddings is based on the idea of generating distributed representations. As stated above the relationship is established by introducing some dependency of one word on other words. Word2vec is a getting so much popularity nowadays in the field of word embeddings. This is a neural network approach which is used to develop such an embedding. It is again based on Skip Gram and Common Bag of words which are discussed below. Before going into such details one important point here to be mentioned that there are two types of word embedding approaches which are used mainly though there are other different approaches also exist.
These approaches are –
- Frequency-based embedding – Count Vector, TF-IDF vectorization
- Prediction based embedding – Skip Gram model, Continuous Bag of words
Count Vector – This approach works in two phases first to learns a vocabulary from all the provided text, and in second stage portrait each document by calculating the number of times each word appears. One prerequisite which is must in the case of count vetor procedure is that the stop words should be removed before applying count vector.
TF-IDF vectorization – In order to reallotment of the weight to the count features so that some of them can support floating-point values (which will give an edge to show the terms which are rarer yet more interesting) and then these values can be used by the classifier, TF-IDF is used. With the help of it the occurrence of a word can be calculated by considering not only in the document but in the entire corpus. In TF-IDF, TF means term-frequency times inverse document – frequency (IDF) which can be represented mathematically using the following formula –
- TF = (Number of times term t appears in a document) / (Number of terms in the document).
- IDF = Log(N/n), where N is the total number of documents and n is the number of documents a term t has appeared in.
- TF-IDF(t,document) = TF (t,document) * IDF (t)
Continuous Bag of Words (CBOW) – It is basically a learning process after that the model predicts the word by the context. A context may be a single word or multiple words for a given target word. It is a neural network-based approach which is consist of three types of layers: an Input layer, Hidden layer, and Output layer. There are two types of weights: weights between the input layer and the hidden layer, weights between the hidden layer and the output layer. In simple words first, the output is generated from the input layer using one-hot representation which is used to generate the output from the hidden layer from which the scores are converted to the probabilities using softmax function. Cross entropy is used to calculate the loss generated.
Skip – gram model – In this example, the model is trained in such a way that model is fully capable of generates the surrounding words of the context. Skip-gram model reverses the use of target and context words. Skip-gram takes a word and predicts the context word from it. The functionality of the skip-gram model is the same as the functionality of CBOW model. It is just the reverse of the CBOW model.
Glove – Glove is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. So what is the difference between word2vec model and Glove model? First and major difference between them is: word2vec is a predictive type of model while Glove is a count-based model. So this model has the property of skip-gram model when it comes to word analogy tasks, with the advantages of matrix factorization methods that can exploit the global statistical information.
Machine Learning for Text Analysis
Now coming to the modeling part which can be said as the core of any procedure whether it is sentiment analysis, text analytics, spam detection or simple text classification. To get better software you have to choose a better algorithm and to get excellent software, your selection of an algorithm should be excellent. So it all depends on the choice of the algorithm, text analytics technique.
The choice of algorithms depends on several different things and to be very specific it directly depends on the use case. But in a generalized way, there are some parameters which should be kept in mind while selecting an algorithm for a specific task. Of course, accuracy is one of the prime parameters to select an algorithm, but in this age of enormous data, it is not the only one. Space complexity and time complexity should also be considered in the race of prime parameters while selecting an algorithm for a task. So there are three parameters now on the basis of which the selection of algorithm and what, why and where of this algorithm is discussed. These three parameters are –
- Space (in the terms of memory)
- Time (time taken to make a model workable)
Multinomial Naive Bayes – As the name suggest this algorithm is based on the well-known probability theory of Naive Bayes. This algorithm belongs to the family of Naive Bayes algorithm which also consists of Gaussian Naive Bayes and Bernoulli Naive Bayes as it siblings. The major difference in these algorithms is Gaussian is used for the data in which continuous value is associated with the features, Bernoulli can be used where this associated value is boolean in nature and Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution.
This is the event model typically used for document classification. So this naive Bayes algorithm is suitable to work on analytics. In the terms of accuracy, Multinomial Naive Bayes can lack a bit (not as much) if it is compared SVM, Regression and other machine learning techniques but accuracy is also how well the data is pre-processed and also on the proper feature engineering. But due to its simplicity in terms of its structure, it can beat other machine learning algorithm irrespective of the size of the data which means it can provide a moderate accuracy with good speed on a large as well as the small dataset.
Linear Support Vector Machine – The beauty of the support vector machine is it can handle continuous as well as categorical data. For handling continuous data the regression version of the support vector machine can be used and for handling categorical data, support vector machine as a classifier can be used. But this document is all about text analytics so only categorical data is considered as the nature of the data. The use of SVM guarantees optimality which means due to nature of convex optimization, the solution is guaranteed to be the global, not a local minimum which is itself a great advantage and doing optimization is also easy in the case of SVM due to the availability of fewer parameters or hyperparameters.
In terms of accuracy, it can give good accuracy but it demands noiseless or data with very less containment of noise in it. One catch with SVM is, it does not support the use of word2vec embedding technique during feature extraction so it is better to use Bag of words approaches in the implementation of SVM to gain good accuracy. As compare to naive Bayes, the structure of SVM is complex, as a result, it takes time to train itself and this model also take more space to save itself.
In one line, SVM is good to go when a better accuracy is needed on fewer data in the exchange of that proper feature engineering as well as preprocessing is required. But with a large amount of data, it will lack again inaccuracy.
Logistic Regression – Yes, With some tricks and ways logistic regression can also be used for text data which generally categorical in nature. In industries, it is in trends to rely on logistic regression because their predictive power is considered to be more than the solutions which are pure classification based. But it has its cons also, first it does not perform well with a large number of categorical features/variables well and lack inaccuracy. So, in general, it can not handle the data where feature space is too large. To cope up with these kinds of conditions feature selection became necessary in the case of logistic regression and so as to feature reduction.
So comparing on the behalf of the parameters (accuracy, space complexity, and time complexity), it gives fine accuracy with less feature space data or with good feature engineering techniques, in both the cases time is required so it takes time in its mechanism. It takes spaces also but can not handle a large amount of data very well.
Deep Learning for Text Analytics
Till now the concept of pre-processing techniques, various feature engineering techniques, and machine learning models techniques in respect of Text analytics is covered. But talking about developing a model at the enterprise level, these techniques can work sometimes. Most of the time something more is needed to scale up the things to a big level and to handle the data in an enormous amount. This is covered by using deep learning techniques. There are a large number of techniques available to cover the text analytics branch of data science and it also depends on the use case also which technique should be used. Below is the overview of some techniques. This overview again was written on the concept of what, why and where considering three prime parameters again i.e. Accuracy, time complexity, and space complexity.
fastText – Let’s start with the simple one. The implementation of fastText is based on Bag of Tricks for Efficient Text Classification, text analytics and talking about the brief introduction of fastText. It takes word embedding as pre-process, after embedding each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier. It uses the softmax function to compute the probability distribution over the predefined classes. Then cross-entropy is used to compute loss. Bag of word representation does not consider word order. in order to take account of word order, n-gram features are used to capture some partial information about the local word order; when the number of classes is large, computing the linear classifier is computationally expensive. so it uses hierarchical softmax to speed training process.
Till now what is fastText is described in a brief manner. Now let’s understand where it can be used, so this model can be for mainly two purposes Word representations, Text classification, and Text Analytics. But the question is when there is an approach named as word vectors (word2vec) exist already then why fastText should be used so let’s understand the difference between them. FastText differs in the sense that word vectors a.k.a word2vec treat every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.
Another advantage of fastText is that it is also available as a Python library which makes it simple to use. Just install it in python and it is ready to use with a list of already pre-defined function. To create word embedding Skip-gram and CBOW approach can be used, the library of fastText also contain pre-defined functions in it.
Now it is time to understand why? in respect fastText. First thing is that it is very fast, with a limited amount of data it can beat several very popular model such as TextCNN. Secondly, with the help of it, sentence vector representation can be computed easily. So it is clear that fastText works better on small dataset if it is compared to word gensim approaches i.e. Word2vec but the problem comes when the case of large dataset comes. In terms of accuracy, it gives good accuracy for small dataset while on large dataset it lacks inaccuracy. In terms of time complexity, it can beat any model on a good day. In terms of space complexity, it takes less size of memory but lack of a large dataset.
TextCNN – CNN’s already proved their worth and capabilities in the field of Computer Vision but they can be used for text analytics tasks especially for text classification tasks. The implementation of TextCNN is based on Convolutional Neural Networks for Sentence Classification. The layer structure of TextCNN is described below –
- Embedding layer
- Conv layer
- max-pooling layer
- Fully connected layer
- Softmax function layer
Sentence length will be different from one to another. So pad will be used to get the fixed length, n. For each token in the sentence, Word embedding is used to get a fixed dimension vector, d. Input will be a 2-dimension matrix:(n,d). This is similar to the image for CNN.
First is to do a convolutional operation to the input. It is an element-wise multiply between filter and part of the input. Use k number of filters, each filter size is a 2-dimension matrix (f,d). Now the output will be k number of lists. Each list has a length of n-f+1. each element is a scalar. Notice that the second dimension will be always the dimension of word embedding. Here use different size of filters to get rich features from text inputs and this is something similar to n-gram features.
The second step is to do max pooling for the output of the convolutional operation. Fork number of lists, we will get k number of scalars.
The third step is to concatenate scalars to form the final features. It is a fixed-size vector. And it is independent of the size of filters which are used.
The final step is to use linear layer to project these features to per-defined labels.
What is TextCNN is covered till now? Now let’s look where can TextCNN is used TextCNN is better for classification related tasks as It has a hierarchical structure. It can outperform different techniques (In fact RNN also) in the task of sentence matching. However in some tasks such as Sequence ordering, Part of speech tagging and Sequence Modelling it lacks behind in respect of RNN.
In the case of CNN, the optimization of two key parameters plays an important role these parameters are Hidden Size and Batch Size. However, it has been that fluctuation in learning rate in case of CNN remains smooth in terms of performance. It is a well-known fact now that a CNN takes time to train, also it requires a marginal space to store a CNN model, however, the size of the dataset does not affect the accuracy of CNN as much though it can increase the time to train a CNN model.
BERT – Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT model’s architecture is based on the encoder which is transformer and bi-directional in nature. Inclusion of Transformers increases the training efficiency and performance in capturing the long-distance dependencies if it is compared to a basic recurrent neural network. Discussing the architecture of BERT, it is considered to be a big size model which have 1024 hidden layers, 24 transformer blocks, and nearly 340M parameters. That is why a pre-trained model is used which is again trained to the requirements of the model.
During the pre-training process, Masked Language Model (MLM) is used for masking a percentage of input tokens to train a deep bidirectional representation. For embedding WordPiece embeddings is used. It has the capability of Next Sentence Prediction due to which it can be used for the task like question answering, inference on for the tasks like where there is a need to understand the relationship between sentences.
To use a BERT model, A pre-trained model is downloaded and perform the process to fine-tune the model using use-case specific data. So till now, this section covers what, why and how of the BERT model. Now let’s discuss briefly the where part of the BERT it can be used where there is a need of multi-label classification, online predictions and of course as stated above it perfectly well in tasks where there is a need of sentence generation prediction.
TextRNN – It is obvious when RNN is considered for any task, neglecting the vanilla version of RNN automatically LSTM version is considered. The reason for this selection is simple because LSTM provides memory as an advantage.
The structure of Text RNN is as follows –
Embedding —-> bidirectional LSTM —-> concat output —-> average —-> softmax
Talking about the embedding, word embedding is considered as a good option to be used with Text RNN. This is a brief introduction about what is TextRNN. Now let us discuss where it should be used.
It can be used for various text analytics task such as for those tasks which require text classification but if it is used for that task which includes text generation such as for chatbots which have a requirement to generate text for answering the questions or for the tasks related to prescribe analytics. It can be implemented generally in two ways to generate text first is word by word generation and second is a character by character generation. As compare to the character by character generation model word by word generation model display lower computational cost and higher accuracy because of the difficulty for the character-level model to capture Long Short Term Memory and it requires much larger hidden layer to do so. The reason for it and “why” is stated below.
The reason is simple because it has the capability to memorize and if a model can memorize the data, it can generate the new data by using its memory.
RCNN – Till now CNN and RNN models are illustrated separately. Now let’s discuss the combination of these models for the text analysis process. With the advancement in deep learning technologies and the hardware of the computer systems, now it is possible to use the different techniques combinedly to gain the advantages of both the technologies. Recurrent convolutional neural network (RCNN) is one of these combinations.
The structure of RCNN is as follows –
Recurrent structure —-> max pooling —-> fully connected layer + softmax
The learning of the representation of the word in the sentence or a paragraph for these models starts from left side context to right side context. This can also be illustrated as –
It has the advantage of that this network can compose the semantic representations of the text finely. However, CNN outcast it in the terms of performance in some cases because the max-pooling layer of CNN is considered more discriminative while capturing the contextual information but it takes very less time to train itself as compared to CNN and RNN.
That is why the use case for RCNN is when there is a necessity of taking less time of training, RCNN can be a good option.
A Holistic Strategy
Text Analytics is the method of transforming unstructured text data into significant data for analysis, to estimate customer views, product evaluations, and feedback. For Adopting this approach, we recommend taking the following steps –