Essential Techniques for Text Pre-Processing
Tokenization – The process of dividing the given text into words and sentences in the form of tokens. Before any text analytics procedure (whether it is classification or whether it is generation), text needs to be divided into a smaller unit on the basis of linguistics units for example words, numbers, punctuations and alpha-numeric etc. This procedure is known as tokenization. It is belongs to pre-processing section in its true form because without getting a separate single units from a given text, any task related to analysis and generation is not possible.
Removing unnecessary punctuation, tags – It is a simple but much needed pre-processing technique. It is clear from the title itself what is to be done in this step. There are so many libraries which are already available for almost every programming language which can perform this task with few lines of code but it is necessary because it is a essential step to change the given text into a cleaner form.
Removing of stop words – Stop words are considered to be the useless words in the terms of text analysis procedure which means the words which do not have or have very less significance in terms of analysis of a whole sentence, these types of words known as stop words. These words are can be removed basically using different techniques which are –
- Supervised Machine Learning – Removing stop words from feature space.
- Clustering – Removing stop words prior to generating clusters.
- Information Retrieval – Preventing stop words from being indexed.
- Text Summarization – Excluding stop words from contributing to summarization scores & removing stop words when computing ROUGE scores.
Stemming – Usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm. o n the basis of porter stemmer is developed which is very popular for stemming. Some other stemmer are also available such as Lovins stemmer and Paice stemmer which can be used for the same purpose. So much libraries are also available in various languages which provides direct support to this process.
Lemmatization – Usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. It is a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval. It is hard to say more, because either form of normalization tends not to improve English information retrieval performance in aggregate – at least not by very much. While it helps a lot for some queries, it equally hurts performance a lot for others. Stemming increases recall while harming precision. Lemmatizer are the tools which are used to do lemmatization process. Many library have already pre-built function to do lemmatization with ease.
Essential Techniques for Text Classification
Word Embeddings – To capture the context of the word in a text file, word embedding is used. With the help of it the syntactic and semantic similarity, relation between the words can be established. So it is clear now why word embedding is used. But what are word embeddings exactly? They are the vectors which symbolise a particular word. The mechanism of word embeddings is based on the idea of generating distributed representations. As stated above the relationship is established by introducing some dependency of one word on other words. Word2vec is a getting so much popularity nowadays in the field of word embeddings. This is neural network approach which is used develop such an embedding. It is again based on Skip Gram and Common Bag of words which are discussed below. Before going into such details one important point here to be mentioned that there are two types of word embedding approaches which are used mainly though there are other different approaches also exist.
These approaches are –
- Frequency based embedding – Count Vector , TF-IDF vectorization
- Prediction based embedding – Skip Gram model, Continuous Bag of words
Count Vector – This approach works in two phase first to learns a vocabulary from all the provided text, and in second stage portrait each document by calculating the number of times each word appears. One prerequisites which is must in the case of count vetor procedure is that the stop words should be removed before applying count vector.
TF-IDF vectorization – In order to reallotment of the weight to the count features so that some of them can support floating point values (which will give an edge to show the terms which are rarer yet more interesting) and then these values can be used by the classifier, TF-IDF is used. With the help of it the occurrence of a word can be calculated by considering not only in document but in the entire corpus. In TF-IDF, TF means term-frequency times inverse document – frequency (IDF) which can represented mathematically using following formula –
- TF = (Number of times term t appears in a document) / (Number of terms in the document).
- IDF = Log(N/n), where N is the total number of documents and n is the number of documents a term t has appeared in.
- TF-IDF(t,document) = TF (t,document) * IDF (t)
Continuous Bag of Words (CBOW) – It is basically a learning process after that the model predict the word by the context. A context may be single word or multiple word for a given target words. It is neural network based approach which is consist of three types of layers : Input layer, Hidden layer and Output layer. There are two types of weights : weights between the input layer and hidden layer, weights between the hidden layer and output layer. In simple words first the output is generated from the input layer using one hot representation which is used to generate the output from the hidden layer from which the scores are converted to the probabilities using softmax function. Cross entropy is used to calculate the loss generated.
Skip – gram model – In this example the model is trained in such a way that model is fully capable to generates the surrounding words of the context. Skip-gram model reverses the use of target and context words. Skip-gram take a word and predict the context word from it. The functionality of the skip-gram model is same as the functionality of CBOW model. It is just the reverse of the CBOW model.
Glove – Glove is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. So what is the difference between word2vec model and Glove model. First and major difference between them is : word2vec is a predictive type of model while Glove is a count based model. So this model have the property of skip-gram model when it comes to word analogy tasks, with the advantages of matrix factorization methods that can exploit the global statistical information.
Machine Learning for Text Analysis
Now coming to the modeling part which can be said as core of the any procedure whether it is sentiment analysis, text analytics, spam detection or a simple text classification. To get a better software you have to choose a better algorithm and to get a excellent software, your selection of algorithm should be excellent. So it is all depends on the choice of the algorithm, text analytics technique.
The choice of algorithms depends on several different things and to be very specific it directly depends on the use case. But in a generalized way, there are some parameters which should be kept in mind while selecting a algorithm for a specific task. Ofcourse, accuracy is one of the prime parameter to select a algorithm, but in this age of enormous data it is not the only one. Space complexity and time complexity should also be considered in the race of prime parameters while selecting an algorithm for a task. So there are three parameters now on the basis of which the selection of algorithm and what, why and where of these algorithm is discussed. These three parameters are –
- Space (in the terms of memory)
- Time (time taken to make a model workable)
Multinomial Naive Bayes – As the name suggest this algorithm is based on the well known probability theory of Naive Bayes. This algorithm belongs to the family of Naive Bayes algorithm which also consists Gaussian Naive Bayes and Bernoulli Naive Bayes as it siblings. The major difference in these algorithms is Gaussian is used for the data in which continuous value is associated with the features, Bernoulli can be used where this associated value is boolean in nature and Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution.
This is the event model typically used for document classification. So this naive bayes algorithm is suitable to work on analytics. In the terms of accuracy Multinomial Naive Bayes can lack a bit (not as much) if it is compared SVM, Regression and other machine learning techniques but accuracy is also how well the data is pre-processed and also on the proper feature engineering. But due to its simple in terms of its structure it can beat other machine learning algorithm irrespective of the size of the data which means it can provide a moderate accuracy with good speed on a large as well as small dataset.
Linear Support Vector Machine – The beauty of support vector machine is it can handle continuous as well as categorical data. For handling continuous data the regression version of support vector machine can be used and for handling categorical data, support vector machine as a classifier can be used. But this document is all about text analytics so only categorical data is considered as the nature of the data. The use of SVM guarantees optimality which means due to nature of convex optimization, the solution is guaranteed to be the global not a local minimum which is itself a great advantage and doing optimization is also easy in the case of SVM due to the availability of less parameters or hyperparameters.
In terms of accuracy, it can give good accuracy but it demands noise less or data with very less containment of noise in it. One catch with SVM is, it does not support the use of word2vec embedding technique during feature extraction so it is better to use Bag of words approach in the implementation of SVM to gain good accuracy. As compare to naive bayes, the structure of SVM is complex as a result it takes time to train itself and this model also take more space to save itself.
In one line, SVM is good to go when a better accuracy is needed on less data in the exchange of that proper feature engineering as well as preprocessing is required. But with large amount data it will lack again in accuracy.
Logistic Regression – Yes, With some tricks and ways logistic regression can also be used for text data which generally categorical in nature. In industries it is in trends to rely on logistic regression because their predictive power is considered to be more than the solutions which are pure classification based. But it has its cons also, first it does not perform well with large amount of categorical features/variables well and lack in accuracy. So in general it can not handle the data where feature space is too large. To cope up with these kinds of conditions feature selection became necessary in the case of logistic regression and so as feature reduction.
So comparing on the behalf of the parameters (accuracy, space complexity and time complexity), it gives fine accuracy with less feature space data or with good feature engineering techniques, in both the cases time is required so it takes time in its mechanism. It takes spaces also but can not handle large amount data very well.
Deep Learning for Text Analytics
Till now the concept of pre-processing techniques, various feature engineering techniques and machine learning models techniques in respect of Text analytics is covered. But talking about developing model at enterprise level, these techniques can work sometime. Most of the time something more is needed to scale up the things to a big level and to handle the data in an enormous amount. This something is covered by using deep learning techniques. There are a large number of techniques available to cover the text analytics branch of data science and it also depend on the use case also which technique should be used. Below is the overview of some techniques. This overview again written on the concept of what,why and where considering three prime parameters again i.e. Accuracy, time complexity and space complexity.
fastText – Let’s start with the simple one. The implementation of fastText is based on Bag of Tricks for Efficient Text Classification, text analytics and talking about brief introduction of fastText. It takes word embedding as pre-process, after embedding each word in the sentence, this word representations are then averaged into a text representation, which is in turn fed to a linear classifier. It use softmax function to compute the probability distribution over the predefined classes. Then cross entropy is used to compute loss. Bag of word representation does not consider word order. in order to take account of word order, n-gram features is used to capture some partial information about the local word order; when the number of classes is large, computing the linear classifier is computationally expensive. so it use hierarchical softmax to speed training process.
Till now what is fastText is described in a brief manner. Now let’s understand where it can be used, so this model can be for mainly two purposes Word representations, Text classification and Text Analytics. But the question is when there is a approach named as word vectors (word2vec) exist already then why fastText should be used so let’s understand difference between them. FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.
Another advantage of fastText is that it is also available as Python library which make it simple to use. Just install it in python and it is ready use with a list of already pre-defined function. To create word embedding Skipgram and CBOW approach can be used, library of fastText also contain pre-defined functions in it.
Now it is time to understand Why? in respect fastText. First thing is that it is very fast, with a limited amount of data it can beat several very popular model such as TextCNN. Secondly, with the help of it sentence vector representation can be computed easily. So it is clear that fastText works better on small dataset if it is compared to word gensim approaches i.e. Word2vec but problem comes when the case of large dataset comes. In terms of accuracy, it gives good accuracy for small dataset while on large dataset it lacks in accuracy. In terms of time complexity it can beat any model on a good day. In terms of space complexity it take less size of memory but lack on large dataset.
TextCNN – CNNs already proved their worth and capabilities in the field of Computer Vision but they can be used for text analytics tasks specially for text classification tasks. The implementation of TextCNN is based on Convolutional Neural Networks for Sentence Classification. The layer structure of TextCNN is described below –
- Embedding layer
- Conv layer
- max pooling layer
- Fully connected layer
- Softmax function layer
Sentence length will be different from one to another. So pad will be used to get fixed length, n. For each token in the sentence,Word embedding is used to get a fixed dimension vector, d. Input will be a 2-dimension matrix:(n,d). This is similar with image for CNN.
First is to do convolutional operation to the input. It is a element-wise multiply between filter and part of input. Use k number of filters, each filter size is a 2-dimension matrix (f,d). Now the output will be k number of lists. Each list has a length of n-f+1. each element is a scalar. Notice that the second dimension will be always the dimension of word embedding. Here use different size of filters to get rich features from text inputs and this is something similar with n-gram features.
Second step is to do max pooling for the output of convolutional operation. For k number of lists, we will get k number of scalars.
Third step is to concatenate scalars to form final features. It is a fixed-size vector. And it is independent from the size of filters which are used.
Final step is to use linear layer to project these features to per-defined labels.
What is TextCNN is covered till now. Now let’s look where can TextCNN is used TextCNN is better for classification related tasks as It has hierarchical structure. It can outperform different techniques (Infact RNN also) in the task of sentence matching. However in some tasks such as Sequence ordering, Part of speech tagging and Sequence Modelling it lacks behind in respect of RNN.
In the case of CNN the optimization of two key parameters play a important role these parameters are Hidden Size and Batch Size. However it has been that fluctuation in learning rate in case of CNN remain smooth in terms of performance. It is a well known fact now that a CNN takes time train, also it require a marginal space to store a CNN model, however the size of dataset does not affect the accuracy of CNN as much though it can increase the time to train a CNN model.
BERT – Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT model’s architecture is based on the encoder which is transformer and bi-directional in nature. Inclusion of Transformers increase the training efficiency and performance in capturing the long-distance dependencies if it is compared to basic recurrent neural network. Discussing about the architecture of BERT, it is considered to be a big size model which have 1024 hidden layers, 24 transformer blocks and nearly 340M parameters. That is why a pre-trained model is used which is again trained to the requirements of the model.
During the pre-training process, Masked Language Model (MLM) is used for masking a percentage of input tokens to train a deep birectional representation. For embedding WordPiece embeddings is used. It has the capability of Next Sentence Prediction due to which it can be used for the task like question answering, inference on for the tasks like where there is a need to understand the relationship between sentences.
To use a BERT model, A pre-trained model is downloaded and perform the process to fine tune the model using use-case specific data. So till now this section covers what, why and how of the BERT model. Now let’s discuss briefly the where part of the BERT it can be used where there is a need of multi-label classification, online predictions and of course as stated above it perfectly well in tasks where there is a need of sentence generation prediction.
TextRNN – It is obvious when RNN is considered for any task, neglecting the vanilla version of RNN automatically LSTM version is considered. The reason of this selection is simple because LSTM provides memory as an advantage.
The structure of Text RNN is as follows –
Embedding —-> bidirectional LSTM —-> concat output —-> average —-> softmax
Talking about the embedding, word embedding is considered as a good option to be used with Text RNN. This is brief introduction about what is TextRNN. Now let us discuss about where it should be used.
It can be used for various text analytics task such as for those tasks which require text classification but if it is used for those task which includes text generation such as for chatbots which have a requirement to generate text for answering the questions or for the tasks related to prescribe analytics. It can be implemented generally in two was to generate text first is word by word generation and second is character by character generation. As compare to character by character generation model word by word generation model display lower computational cost and higher accuracy because of the difficulty for character level model to capture Long Short Term Memory and it require much larger hidden layer to do so. The reason of it and “why” is stated below.
The reason is simple because it have the capability to memorize and if a model can memorize the data, it can generate the new data by using its memory.
RCNN – Till now CNN and RNN models are illustrated separately. Now let’s discuss the combination of these models for text analysis process. With the advancement in deep learning technologies and the hardware of the computer systems, now it is possible to use the different techniques combinedly to gain the advantages of both the technologies.Recurrent convolutional neural network (RCNN) is one of these combinations.
The structure of RCNN is as follows –
Recurrent structure —-> max pooling —-> fully connected layer + softmax
The learning of the representation of word in the sentence or a paragraph for these models starts from left side context to right side context. This can also be illustrated as –
It has the advantage of that this network can compose the semantic representations of the text finely. However, CNN outcast it in the terms of performance in some cases because the max pooling layer of CNN is considered more discriminative while capturing the contextual information but it takes very less time to train itself as compared to CNN and RNN.
That is why the use case for RCNN is when there is a necessity of taking less time of training, RCNN can be a good option.
A Holistic Strategy
Text Analytics is the method of transforming unstructured text data into significant data for analysis, to estimate customer views, product evaluations, and feedback.For Adopting this approach, we recommend taking the following steps –
How useful was this post?