Sentiment analysis with various models of supervised and unsupervised learning
Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics with a vast number of websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.
Basic terminologies
•A text corpus consists of multiple text documents and each document can be as simple as a single sentence to a complete document with multiple paragraphs. Textual data, in spite of being highly unstructured, can be classified into two major types of documents. Factual documents that typically depict some form of statements or facts with no specific feelings or emotion attached to them. These are also known as objective documents. Subjective documents on the other hand have text that expresses feelings, moods, emotions, and opinions.
•Sentiment analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, Machine Learning, and linguistics to extract important information or data points from unstructured text. This in turn can help us derive qualitative outputs like the overall sentiment being on a positive, neutral, or negative scale and quantitative outputs like the sentiment polarity, subjectivity, and objectivity proportions.
•Sentiment polarity is typically a numeric score that’s assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express and specific sentiment, positive sentiment will have polarity > 0, and negative < 0. Of course, you can always change these thresholds based on the type of text you are dealing with; there are no hard constraints on this.
In this article, we will focus on analysing IMDb movie reviews data and try to predict whether the review is positive or negative. Familiarity with some machine learning concepts will help to understand the code and algorithms used. We will use popular scikit-learn machine learning framework.
supervised learning approach:
Data-set preparation:
We will use the dataset from here — http://ai.stanford.edu/~amaas/data/sentiment/After downloading the dataset, unnecessary files/folders were removed so that folder structure looks as follows —

Load data into program:
We will load and peek into train and test data to understand the nature of data. In this case, both train and test data are in similar format.
from sklearn.datasets import load_files
reviews_train = load_files("aclImdb/train/")
text_train, y_train = reviews_train.data, reviews_train.target
print("Number of documents in train data: {}".format(len(text_train)))
print("Samples per class (train): {}".format(np.bincount(y_train)))
reviews_test = load_files("aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))scikit-learn provides load_files to read this kind of text data. After loading data we printed the number of documents (train/test) and samples per class (pos/neg) which is as follows —
Number of documents in train data: 25000
Samples per class (train): [12500 12500]
Number of documents in test data: 25000
Samples per class (test): [12500 12500]
We can see total 25000 samples of training and test data with 12500 per class of pos and neg.
Setting Up Dependencies We will be using several Python libraries and frameworks specific to text analytics, NLP, and Machine Learning. Before starting the Internship Project you need to make sure you have pandas, numpy, scipy, and scikit-learn installed. NLP libraries which will be used; include spacy, nltk, and gensim. Do remember to check that your installed nltk version is at least >= 3.2.4, otherwise, the ToktokTokenizer class may not be present. For nltk you need to type the following code from a Python or ipython shell after installing nltk using either pip or conda. import nltk nltk.download('all', halt_on_error=False) For spacy, you need to type the following code in a Unix shell/windows command prompt, to install the library (use pip install spacy if you don’t want to use conda) and also get the English model dependency. $ conda config --add channels conda-forge $ conda install spacy $ python -m spacy download en
normalize_corpus(...), which can be used to take a document corpus as input and return the same corpus with cleaned and normalized text documents
- import text_normalizer as tn
- norm_train_reviews = tn.normalize_corpus(text_train)
- norm_test_reviews = tn.normalize_corpus(text_test)
Representing text data as Bag of Words:
We want to count the word occurrences as a Bag of Words which include the below steps in the diagram —

In order to represent the input dataset as Bag of words, we will use CountVectorizer, TfidfVectorizerand call it’s transform method. CountVectorizer, TfidfVectorizer is a transformer that converts the input documents into sparse matrix of features.
- from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
- vect = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
- X_train1 = vect.fit(text_train).transform(norm_train_reviews)
- X_test1 = vect.transform(norm_test_reviews)
- tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),sublinear_tf=True)
- X_train2 = tv.fit(text_train).transform(norm_train_reviews)
- X_test2 = tv.transform(norm_test_reviews)
Each entry in the resultant matrix is considered a feature. Output from above code snippet is as follows —
for CountVectorizer
Vocabulary size: 1513832
X_train:
<25000x1513832 sparse matrix of type '<class 'numpy.float64'>'
with 3833599 stored elements in Compressed Sparse Row format>
for TfidfVectorizer
X_test:
<25000x1513832 sparse matrix of type '<class 'numpy.float64'>'
with 3321545 stored elements in Compressed Sparse Row format>
Model development:
- from sklearn.linear_model import LogisticRegression,SGDClassifier
- from sklearn.metrics import f1_score
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import classification_report
- # Logistic Regression model on BOW features
- model1=LogisticRegression(class_weight="balanced",penalty='l2', max_iter=100, C=1)
- model1.fit(X_train1,y_train)
- y_pred1=model1.predict(X_test1)
- fl_score1 =f1_score(y_test,y_pred1)
- confusion_matrix(y_test,y_pred1)
- print(classification_report(y_test,y_pred1))
output matrics:
- # Logistic Regression model on TF-IDF features
- model1=LogisticRegression(class_weight="balanced",penalty='l2', max_iter=100, C=1)
- model2.fit(X_train2,y_train)
- y_pred2=model2.predict(X_test2)
- fl_score2 =f1_score(y_test,y_pred2)
- confusion_matrix(y_test,y_pred2)
- print(classification_report(y_test,y_pred2))
output matrics:
- # SVM model on BOW features
- model3=SGDClassifier(loss='hinge', max_iter=100)
- model3.fit(X_train1,y_train)
- y_pred3=model3.predict(X_test1)
- fl_score3 =f1_score(y_test,y_pred3)
- results = confusion_matrix(y_test,y_pred3)
- print(classification_report(y_test,y_pred3))
output matrics:
- # SVM model on TF-IDF features
- model4=SGDClassifier(loss='hinge', max_iter=100)
- model4.fit(X_train1,y_train)
- y_pred4=model4.predict(X_test2)
- fl_score4 =f1_score(y_test,y_pred4)
- results = confusion_matrix(y_test,y_pred4)
- print(classification_report(y_test,y_pred4))
output matrics:
Unsupervised Learning approach:
Load data into program:
We will load and peek into train and test data to understand the nature of data. In this case, both train and test data are in similar format.
- folder_name='unsup' test_set_unsup=[] for review in glob.glob("C:/Users/HP/Downloads/aclImdb_v1/aclImdb/train/unsup/*.txt"): with open(review,mode='r',encoding="utf8")as file: movie={} movie['title']=file.readline() file.read() test_set_unsup.append(movie) df_test_unsup=pd.DataFrame(test_set_unsup)
normalize_corpus(...), which can be used to take a document corpus as input and return the same corpus with cleaned and normalized text documents
- norm_test_reviews = tn.normalize_corpus(test_reviews["title"])
lexicon models used for sentiment analysis
- • AFINN Lexicon
- from afinn import Afinn
- afn=Afinn(emoticons=True)
- actual_sentiment_polarity = [afn.score(review) for review in test_reviews["title"]]
- actual_sentiments= ['positive' if score >=1.0 else "negative" for score in actual_sentiment_polarity]
- predicted_sentiment_polarity = [afn.score(review) for review in df["title"]]
- predicted_sentiments= ['positive' if score >=1.0 else "negative" for score in predicted_sentiment_polarity]
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import classification_report
- results = confusion_matrix(y_train4, predicted_sentiments)
- print(classification_report(y_train4, predicted_sentiments,labels=['positive','negative']))
output metrics:
• SentiWordNet Lexicon
- from nltk.corpus import sentiwordnet as swn
- import nltk
- nltk.download('sentiwordnet')
- awesome = list(swn.senti_synsets('awesome', 'a'))[0]
- print('Positive Polarity Score:', awesome.pos_score())
- print('Negative Polarity Score:', awesome.neg_score())
- print('Objective Score:', awesome.obj_score())
- def analyze_sentiment_sentiwordnet_lexicon(review,
- verbose=False):
- # tokenize and POS tag text tokens
- tagged_text = [(token.text, token.tag_) for token in tn.nlp(review)]
- pos_score = neg_score = token_count = obj_score = 0
- # get wordnet synsets based on POS tags
- # get sentiment scores if synsets are found
- for word, tag in tagged_text:
- ss_set = None
- if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
- ss_set = list(swn.senti_synsets(word, 'n'))[0]
- elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
- ss_set = list(swn.senti_synsets(word, 'v'))[0]
- elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
- ss_set = list(swn.senti_synsets(word, 'a'))[0]
- elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
- ss_set = list(swn.senti_synsets(word, 'r'))[0]
- # if senti-synset is found
- if ss_set:
- # add scores for all found synsets
- pos_score += ss_set.pos_score()
- neg_score += ss_set.neg_score()
- obj_score += ss_set.obj_score()
- token_count += 1
- # aggregate final scores
- final_score = pos_score - neg_score
- norm_final_score = round(float(final_score) / token_count, 2)
- final_sentiment = 'positive' if norm_final_score >= 0 else 'negative'
- if verbose:
- norm_obj_score = round(float(obj_score) / token_count, 2)
- norm_pos_score = round(float(pos_score) / token_count, 2)
- norm_neg_score = round(float(neg_score) / token_count, 2)
- # to display results in a nice table
- sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score,
- norm_neg_score, norm_final_score]],
- columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
- ['Predicted Sentiment', 'Objectivity',
- 'Positive', 'Negative', 'Overall']],
- labels=[[0,0,0,0,0],[0,1,2,3,4]]))
- print(sentiment_frame)
- return final_sentiment
- actual_predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in test_reviews["title"]]
- pre_predicted_sentiments = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in df["title"]]
- results1= confusion_matrix(actual_predicted_sentiments, pre_predicted_sentiments)
- print(classification_report(actual_predicted_sentiments, pre_predicted_sentiments,labels=['positive','negative']))
output matrics:
•VADER Lexicon
- from nltk.sentiment.vader import SentimentIntensityAnalyzer
- def analyze_sentiment_vader_lexicon(review,
- threshold=0.1,
- verbose=False):
- # pre-process text
- review = tn.strip_html_tags(review)
- review = tn.remove_accented_chars(review)
- review = tn.expand_contractions(review)
- # analyze the sentiment for review
- analyzer = SentimentIntensityAnalyzer()
- scores = analyzer.polarity_scores(review)
- # get aggregate scores and final sentiment
- agg_score = scores['compound']
- final_sentiment = 'positive' if agg_score >= threshold\
- else 'negative'
- if verbose:
- # display detailed sentiment statistics
- positive = str(round(scores['pos'], 2)*100)+'%'
- final = round(agg_score, 2)
- negative = str(round(scores['neg'], 2)*100)+'%'
- neutral = str(round(scores['neu'], 2)*100)+'%'
- sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
- negative, neutral]],
- columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
- ['Predicted Sentiment', 'Polarity Score',
- 'Positive', 'Negative', 'Neutral']],
- labels=[[0,0,0,0,0],[0,1,2,3,4]]))
- print(sentiment_frame)
- return final_sentiment
- actual_predicted_sentiments1 = [analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=False) for review in test_reviews["title"]]
- pre_predicted_sentiments1 = [analyze_sentiment_vader_lexicon(review, threshold=0.4, verbose=False) for review in df["title"]]
- results1= confusion_matrix(actual_predicted_sentiments1, pre_predicted_sentiments1)
- print(classification_report(actual_predicted_sentiments1, pre_predicted_sentiments1,labels=['positive','negative']))
output matrics:
conclusion:
From the visualization it is clear that the unsupervised model using AFINN gives the best result and in supervised learning logistic regession perform the best for our test movies reviews.Does this means these model will always perform the best ? Absolutely not.It depends on the data you are analyzing.


Comments
Post a Comment