As with many other fields, advances in deep learning have brought sentiment analysis into the foreground of cutting-edge algorithms. Today we use natural language processing, statistics, and text analysis to extract, and identify the sentiment of words into positive, negative, or neutral categories.
Before we move forward let’s download the dataset that we use in this project.
You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz The download size of the dataset is 1.2GB.
The dataset is zipped so you need to unzip the dataset in your system (computer). Now the Size of the dataset around 2.5GB.
It may be possible that this dataset would not open in your Microsoft Excel.
If you still want to open you can use Delimit software for it, Here Download Link: http://delimitware.com/download.html
Delimit: Handle large delimited data files with ease.
Let’s analyze the dataset
The dataset contains these columns
reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B asin — ID of the product, e.g. 0000013714 reviewerName — name of the reviewer vote — helpful votes of the review style — a disctionary of the product metadata, e.g., “Format” is “Hardcover” reviewText — text of the review overall — rating of the product summary — summary of the review unixReviewTime — time of the review (unix time) reviewTime — time of the review (raw) image — images that users post after they have received the product
The dataset has lots of features but For sentiment analysis, we need review and rating.
1. Fundamentals of AI, ML and Deep Learning for Product Managers
2. The Unfortunate Power of Deep Learning
3. Graph Neural Network for 3D Object Detection in a Point Cloud
4. Know the biggest Notable difference between AI vs. Machine Learning
#reading the json file in a list values=[] with open("Electronics_5.json","r") as f: for i in f: values.append(json.loads(i))print(values[:5])
now we create a dataset that has id, review, and rating of product for sentiment analysis.
we saved our filtered dataset in Electronic_review.csv file.
now we read our Electronic_review data into a data frame.
#read the dataset into a df colnames = ["id","text","overall"] df= pd.read_csv("Electronic_review.csv",names= colnames,header = None)
The division of sentiment, on the basis of vote value, is as follows
from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer from sklearn.preprocessing import LabelEncoder from nltk.corpus import wordnet as wnimport nltk nltk.download("stopwords") import re nltk.download("punkt")
now read the processedDatat.csv
df= pd.read_csv(“processedData.csv”)
Stemming algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.
developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create the dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.
df.loc[count:count+batch-1,'reviewText_final'] = fin lat_df = df[:100000] lat_df.to_csv("CurrentUsedFile.csv")
We saved the first 100000 rows of data as CurrentUsedFile.csv so that we can easily process the data.
Split the dataset into train and test set
#importing the new dataset lat_df = pd.read_csv("CurrentUsedFile.csv") print(lat_df.head(5))
#create x and y => x:textreview , y:sentiment Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df['reviewText_final'],lat_df['Sentiment'],test_size=0.2,random_state = 42)print(Train_X.shape,Train_Y.shape) print(Test_X.shape,Test_Y.shape)
# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the dffrom sklearn.feature_extraction.text import TfidfVectorizerTfidf_vect = TfidfVectorizer(max_features=500000) #tweak features based on the dataset Tfidf_vect.fit(lat_df['reviewText_final'])Train_X_Tfidf = Tfidf_vect.transform(Train_X) Test_X_Tfidf = Tfidf_vect.transform(Test_X)
Before going to head let’s create a model evaluation Function.
def modelEvaluation(predictions, y_test_set): #Print model evaluation to predicted result
print ("nAccuracy on validation set: {:.4f}".format(accuracy_score(y_test_set, predictions)))
# Classifier - Algorithm - Naive Bayes # fit the training dataset on the classifier import time second=time.time()Naive = naive_bayes.MultinomialNB() historyNB = Naive.fit(Train_X_Tfidf,Train_Y)# predict the labels on validation dataset predictions_NB = Naive.predict(Test_X_Tfidf)modelEvaluation(predictions_NB, Test_Y)
from sklearn.metrics import precision_recall_fscore_supporta,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average='macro')# Use accuracy_score function to get the accuracy print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)print("Precision is: ",a) print("Recall is: ",b) print("F-1 Score is: ",c)
Now let’s plot the ROC curve for Naive Bayes
asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average='macro') # Use accuracy_score function to get the accuracy print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100) print("Precision is: ",asvm) print("Recall is: ",bsvm)