Model Selection and Hyperparameter Tuning on Amazon Kindle Book Reviews with Python

Introduction

This article aims at selecting and deploying the optimal machine learning model to perform sentiment analysis on a dataset of book reviews from the Amazon Kindle Store.

In a previous article, we optimized a Support Vector Machines algorithm on an IMDB movie review database. Although SVM is a great algorithm for classification problems, is it also the best choice? With this new project, the goal now is to include the model selection mechanism.

By pipelining multiple models we can compare them on different aspects, including accuracy and efficiency.

The data

The dataset, which you can find at the following link, is composed of roughly 1,000,000 book reviews [1][6]. The goal would be to produce a high-performing sentiment analyzer by training it on one portion of the dataset. If you want to review what sentiment analysis is, I can suggest a quick read of this article, which covers all the basics.

The structure of the csv file includes 10 columns:

asin – ID of the product, like B000FA64PK
helpful – helpfulness rating of the review – example: 2/3.
overall – rating of the product.
reviewText – text of the review (heading).
reviewTime – time of the review (raw).
reviewerID – ID of the reviewer, like A3SPTOKDG7WBLN
reviewerName – the name of the reviewer.
summary – summary of the review (description).
unixReviewTime – UNIX timestamp.

For the algorithm, we’d only need two: the review corpus and its rating.

The following is a sample review:

I enjoy vintage books and movies so I enjoyed reading this book.  The plot was unusual. Don't think killing someone in self-defense but leaving the scene and the body without notifying the police or hitting someone in the jaw to knock them out would wash today. Still it was a good read for me.

As you can notice, the text presents very few "special" characters, which makes it easier to clean. On top of that, we should eliminate all prepositions and pronouns that don’t add value but rather complexity in terms of volume.

Text Processing

Even though text pre-processing is not the focus of this article, I’m going to briefly explain it, if you’d like more details, please follow the following link to one of my previous articles.

Within the field of Natural Language Processing, the text contains a wide variety of symbols and words that are not useful for an ML algorithm. There is therefore a phase, called text processing, that refers to the analysis, manipulation, and generation of text [2].

This makes it easier for an ML model to train as it reduces the "noise", the amount of non-useful information for prediction.

Models

For this particular classification problem, I considered 7 popular and high-performing machine learning algorithms. Each has its peculiarities and will most likely return different results from the others:

Linear Discriminant Analysis (LDA) focuses on the minimization of interclass variance along with the maximization of between-class variance. In other words, the algorithm traces a hyperplane (a decision boundary that helps classify data points) that guarantees maximum class separability.
The k-Nearest-Neighbours (kNN) creates a "neighborhood" of data records t of a number k of nearest neighbors. A majority voting mechanism then decides whether that record t is part of the class or not. This method does not consider distance-based weighting.
Gaussian Naive Bayes is a probabilistic classifier. Simplicity and efficiency are its most defining characteristics. The Bayes theorem stands at the very foundation of the classifier. It calculates the probability that a record is part of a category or another [3].
With a given dataset of independent variables, Logistic regression calculates the probability of an event happening. To do this, odds are first calculated with a function that returns a number between 0 and 1, where 0 means not likely to occur. Finally, a logit transformation is applied to precisely classify the result as either a 0 or a 1.
A Decision Tree Classifier (CART) is a model that separates data by "asking" a series of questions, and therefore taking decisions on which bucket to collocate a new record. This creates a tree-shaped chart where each node represents a question and each leaf an output variable used to make a prediction.
The Support Vector Machines (SVM) is a non-probabilistic classifier. In a similar way to LDA, SVM employs the concept of the hyperplane. Assuming two classes as linearly separable, they can be divided by the hyperplane. On top of it, the algorithm aims at maximizing the space between classes 1 and 2 with two lines on top of the hyperplane called "vectors".

Hyperparameters

Before defining a hyperparameter it is important to define a "standard" parameter. When a model converges, we can say it found the best combination of parameters to describe the general behavior of the data it was trained on.

Every machine learning model has metaphorically speaking an architecture. A hyperparameter gives the user the chance to use an experimental approach to find the best configuration for the model’s "structure". Let’s suppose there are 5 hyperparameters with each having 10 possible settings, "tuning" happens when the practitioner tries various combinations of settings until performance reaches its peak. For example, hyperparameter 1 might be equal to 5, hyperparameter 2 might be equal to 0.2, and so on.

I’d like to highlight the fundamental difference between a parameter and a hyperparameter. The former also means coefficient or weight, it’s a consequence of the learning process. The latter is manually set by the user before training.

Code Deployment

Data Visualization

The first portion of code deployment focuses on data exploration and visualization. The distribution of scores associated with text reviews could tell us valuable information.

#Importing libraries
import matplotlib.pyplot as plt
import pandas as pd

#Reading Dataset
df = pd.read_csv('kindle_reviews.csv')

#Counting values within each review bucket
category_dist = df['overall'].value_counts()

#Defining chart
plt.figure(figsize=(8,6))
category_dist.plot(kind='bar')
plt.grid()
plt.xlabel("Scores", fontsize = 12)
plt.ylabel("Number of Reviews by Rating", fontsize = 12)
plt.title("Distribution of Reviews by Rating", fontsize = 15)

#Generating chart
plt.show()

Distribution of Reviews by Rating - Chart by Author — Distribution of Reviews by Rating – Chart by Author

The vast majority of reviews are concentrated within the 4 and 5 stars buckets, this is usually not good. The dataset is not balanced, and the algorithm could therefore be biased when we train it.

On top of that, research shows that binary classification is overall more accurate compared to multi-class one [4]. As we would have higher accuracy, we need to transform the project into a binary classification problem.

One possible way to perform the change is to consider all reviews above the score of 3 as "Positive" and all the ones below the score of 3 (included) as "Negative". The following step is to perform the model selection on only a portion of the dataset.

Identifying the best model on almost 1,000,000 rows could skyrocket the processing time without adding any increasingly valuable performance information.

After selecting 50,000 rows from the original dataset and turning the review scores into only two categories, the final result should look like this:

Distribution of Reviews by Sentiment for only 50,000 records taken into analysis - Chart by Author — Distribution of Reviews by Sentiment for only 50,000 records taken into analysis – Chart by Author

Text Cleaning

The second portion of code deployment focuses on text cleaning. As previously mentioned, more words within a sentence add more complexity while contributing marginally less to accuracy.

The solution lies in eliminating stop words. In English, "the", "is" and "and" among others qualify as stop words. In text mining, stop words are defined as unimportant words that carry very little useful information.

On top of that, words need to be uniform and made all lowercase, special characters need to be eliminated. If we deploy the code on the example sentence seen previously, the results are similar to the following:

enjoy vintage books movies enjoyed reading book plot unusual think killing someone self defense leaving scene body without notifying police hitting someone jaw knock would wash today still good read

You would agree with me that now negative and positive words are more easily recognizable, for example, "enjoy", "enjoyed", and "good" signal a potential positive review. After checking, the reviewer assigned 5 stars to the book. At the end of the project, our model should classify this as "Positive".

A final step consists of upsampling the negative reviews until they reach the same number of positive ones. "Upsampling" artificially generates data points of a minority class and adds them to the dataset. The process aims at having the same count for both labels and prevents the model from becoming biased towards a majority class.

Model Selection

The third portion of code deployment focuses on selecting the best-performing model for our data.

Scikit-learn and matplotlib are the only two libraries needed. Sklearn includes all the functions needed to perform machine learning on data. Each model needs to be assigned a "sub-library": LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, SVC, GaussianNB, and LinearDiscriminant Analysis.
In the previous section, we used a technique called "upsampling" to match the number of negative book reviews to the positive ones. At this stage, we can define the input and the target variable from that data. The input variable x is "reviewText", containing the corpus of the review; the output variable y is "overall", containing the labels "positive" and "negative".
We know that some models took into analysis, including Decision Tree Classifiers, require a dense matrix. Dense matrices contain mostly non-zero values. The Dense Transformer class makes sure all matrices are dense to avoid any error.
A list called "models" is then created and the object assigned to each model is then appended to the list. On the other hand, the list "results" will contain all the different model scores associated with their names.
The kfold parameter indicates how many k-folds we want. This introduces the concept of cross-validation. Cross-validation aims at better estimating the accuracy of a model. It is defined as k-fold cross validation, with k being the number of sub-groups data is divided. The model is trained on a subgroup and tested on the remaining k-1. The accuracy is calculated on the average score.
The pipeline applies the count vectorizer, the dense transformer, and the model of choice on the k-fold of choice and performs cross-validation. It then prints the results on the console.
matplotlib finally generates a boxplot chart so we can better interpret the results.

Algorithms’ Accuracy Comparison after Model Selection Process – Chart by Author

A boxplot gives you the information about five critical numbers of our models in terms of accuracy: the minimum, the first quartile (25th percent), the median (second quartile), the third quartile (75th percent), and the maximum.

On the x-axis, there are all the models taken into analysis for cross-validation. On the y-axis instead, we have the accuracy score. For example, the logistic regression model is the best performer with a minimum accuracy of 0.82 and a maximum of 0.86 but the Support Vector machines guarantee the most consistent performance, with the minimum and maximum values close to each other. KNN is the worst performer and shows very little consistency. LDA, CART, and Naive Bayes also don’t perform well.

For the reasons stated above, hyperparameters optimization will now focus on the two highest-performing models: Logistic Regression and Support Vector Machines.

Support Vector Machines Hyperparameters Tuning

The fourth portion of code deployment focuses on hyperparameters tuning for the Support Vector Machines model:

Libraries were already imported in the previous code cell but I decided to re-import them to make this code snippet independent
The pipeline this time only includes the Count Vectorizer and the Support Vector Machines model, no dense transformer is required
Moving onto the parameters’ list, these contain all the possible combinations of hyperparameters the grid_search is going to try for each component of the pipeline. For example, the parameter _maxdf of the Count Vectorizer is responsible for the generalization of the model. Max_df removes words that appear too frequently, and a max_df of 0.7 ignores terms that appear in more than 70% of the documents. In one scenario "_vect__maxdf" will combine a _maxdf of 0.7 with a _ngramrange of (1,1), using a kernel poly and a C parameter of 10. The total "fits" (combinations) are 405 because each cross-validation is performed 5 times.
after launching _gridsearch.fit, the last portion of code starts printing the results on the console as calculations are performed

The results are the following:

Best: 0.840734 using {'SVM__C': 10, 'SVM__kernel': 'rbf', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 2)}

0.730129 (0.038366) with: {'SVM__C': 10, 'SVM__kernel': 'poly', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 1)}

0.692791 (0.049500) with: {'SVM__C': 10, 'SVM__kernel': 'poly', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 2)}

The top-performing hyperparameters setting is:

C = 10
type of kernel = rbf
max_df = 0.7
ngram_range = (1, 1)

If we apply the pipeline with the optimized hyperparameters found above on the 90,000 upsampled rows, and we calculate the performance in terms of accuracy, the result is quite astonishing. The Support Vector Machines model optimized for the kindle reviews is 99% effective at categorizing new data. Results were shown after roughly 1 hour of processing time.

              precision    recall  f1-score   support

    Negative       0.98      1.00      0.99      8924     
    Positive       1.00      0.98      0.99      9109

    accuracy                           0.99     18033    
   macro avg       0.99      0.99      0.99     18033 
weighted avg       0.99      0.99      0.99     18033

Logistic Regression Hyperparameters Tuning

The fifth portion of code deployment focuses on hyperparameters optimization for the Logistic Regression algorithm. All the code is extremely similar to the previous section with few important differences.

The logistic regression model needs a parameter for the maximum number of iterations it can reach, without it, the model would not converge. On top of it, the type of solvers changes compared to the SVM model. The solvers considered for the analysis are "newton-cg", "lbfgs", and "liblinear".

The results are the following:

Best: 0.857362 using {'LR__C': 1.0, 'LR__solver': 'newton-cg', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 3)}

0.825643 (0.003937) with: {'LR__C': 100, 'LR__solver': 'newton-cg', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 1)}

0.852704 (0.004520) with: {'LR__C': 100, 'LR__solver': 'newton-cg', 'vect__max_df': 0.7, 'vect__ngram_range': (1, 2)}

The best-performing hyperparameters are:

C = 1
solver = newton-cg
max_df = 0.7
ngram_range = (1, 3)

Again, if we apply the logistic regression models with optimized hyperparameters on the 90,000 upsampled rows, the result is extremely close to the SVM model. A 98% accuracy is remarkably difficult to achieve in any scenario related to machine learning. The processing time this time is roughly 3 minutes.

              precision    recall  f1-score   support

    Negative       0.97      1.00      0.98      8924     
    Positive       1.00      0.96      0.98      9109

    accuracy                           0.98     18033    
   macro avg       0.98      0.98      0.98     18033 
weighted avg       0.98      0.98      0.98     18033

Testing the algorithm

Both SVM and Logistic Regression show great performance in terms of accuracy. If we base the choice of the algorithm solely on this parameter, the Support Vector Machines model would be the best choice. Another important aspect needs to be considered though: the processing time. The SVM model took one hour to train on 70,000 rows, and the logistic regression only took 3 minutes. Computational efficiency on top of accuracy is fundamental when it comes to model deployment.

At this stage, a final check is needed. "test" contains a positive sentence whereas "_test1" has a negative one.

##Testing Algorithm on single sentences
test = ['The book was really good, I could have not imagined a better ending']

test_1 = ['The book was generally bad, the plot was boring and the characters were not original']

test = count_vect.transform(test).toarray()
test_1 = count_vect.transform(test_1).toarray()

#Printing prediction
print(LR.predict(test))
print(LR.predict(test_1))

The following output shows that the trained classifier correctly predicts whether a book review is positive or negative.

Output:

['Positive'] 
['Negative']

Conclusion

The article shows how model selection and hyperparameter tuning can dramatically improve accuracy and provide a full picture of other aspects as well. We started with 6 models, 4 were filtered out, out of the two remaining with great performances, only one should potentially make it to production. Thanks to very well-structured data from the Amazon Kindle store there would be so many future projects on top of solely determining sentiment. Starting with an AI that could review a book and rate it before it is even published or have an early understanding of which books will do great right after they are added to the store. Especially when it comes to machine learning, the possibilities are endless.

As a final note, if you liked the content please consider dropping a follow to be notified when new articles are published. If you have any considerations to make about the article, write them in the comments! I’d love to read them 🙂 Thank you for reading!

PS: If you like my writing, it would mean the world to me if you could subscribe to a medium membership through this link. It’s an indirect way of supporting me and you get the amazing value that medium articles provide!

References

[1] McAuley, J. (2018). Amazon review data. Retrieved July 31, 2022, from Ucsd.edu website: http://jmcauley.ucsd.edu/data/amazon/

[2] Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2016, May 19). A Hybrid Approach to the Sentiment Analysis Problem at the Sentence Level. Retrieved July 30, 2022, from ResearchGate website: https://www.researchgate.net/publication/303402645_A_Hybrid_Approach_to_the_Sentiment_Analysis_Problem_at_the_Sentence_Level

[3] Raschka, S., & Mirjalili, V. (2014). Naive Bayes and Text Classification I Introduction and Theory. Retrieved from https://arxiv.org/pdf/1410.5329.pdf

[4] Jha, A., Dave, M., & Madan, S. (2019). Comparison of Binary Class and Multi-Class Classifier Using Different Data Mining Classification Techniques. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3464211

[5] He, R., & Mcauley, J. (n.d.). Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. https://doi.org/10.1145/2872427.2883037

Model Selection and Hyperparameter Tuning on Amazon Kindle Book Reviews with Python

Introduction

The data

Text Processing

Models

Hyperparameters

Code Deployment

Data Visualization

Text Cleaning

Model Selection

Support Vector Machines Hyperparameters Tuning

Logistic Regression Hyperparameters Tuning

Testing the algorithm

Conclusion

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained

Our Columns