Fine-tuned LLMs for Sentiment Prediction – How to analyze and evaluate

Evaluation of models on Hugging Face for sentiment prediction

Aug 9, 2023

11 min read

Sentiment analysis is an area that has witnessed a remarkable transformation in the era of large language models (LLMs). As the LLMs can understand the context of the text, they are proving to be a very powerful way to analyze sentiments. The number of LLMs that are available for sentiment analysis on Hugging Face is impressive. The last time I checked, when writing this story, the number of models on Hugging Face for the sentiment task was 3017! This is a considerable number. Gone are the days when sentiment analysis was done with a handful of techniques such as traditional machine learning with TFIDF features, counting positive and negative words, or with libraries such as VADER.

Though the huge number of models available is exciting, it can also be overwhelming. So this article will help you navigate the LLM jungle for sentiment analysis. I will take top models and show you how to analyze and evaluate them. This can help you better understand which model suits your sentiment analysis needs.

Why do you need to analyze and evaluate the models with your data

Sentiment analysis is a very important business KPI. Many enterprises take important decisions such as product promotion or discontinuation based on sentiment analysis of customer reviews.

Most of the fine-tuned models on Hugging Face already provide analysis and evaluation. So you may ask why you need to analyze and make your own evaluation. There are multiple reasons:

The evaluation provided by model developers is based on their data, which may not reflect your business.
Not all models may be suitable to your business use case, even if all are called sentiment analysis models.
The strategic importance of sentiment analysis demands analyzing and evaluating based on your specific business data.

Approach

The approach which I will take in this story is shown here. I will first select a few candidate models followed by establishing an evaluation criterion. All models will be used to predict sentiment against a common dataset. The output will be analyzed and compared with the evaluation criterion.

Please note that the evaluation here is purely from a sentiment analysis point of view, and not a technical performance point of view.

Candidate models for analysis

I will take the models as shown in the table below for analysis. The reason for this choice is that these models are the top downloaded models as of the date of writing this story, as well as all are using different base models. The type of sentiment predicted is also different. Analyzing these models will help us get a complete picture of how sentiment is predicted using fine-tuned LLMs.

You may observe that the models are a mix of general text analysis as well as tweet analysis.

You can apply the approach described here to various other models of your interest.

Data for analyzing the model output

In order to analyze the models, let us take a data set from Amazon food reviews. A sample data is shown below.

Example of a customer review (image by author)

The reason why I have chosen a dataset on customer reviews is that it has authentic customer reviews. It also has long and complex reviews, as compared to alternatives dataset such as tweets. Analyzing customer reviews is more important in enterprises compared to analyzing tweets. Also, the number of characters in tweets is restricted, while customer reviews can have very long text.

Evaluation approach – comparing with the ground truth

In addition to analyzing the models, it would be also useful to evaluate them by comparing results with the ground truth. The actual review data has a rating given by customers on a scale of 1 to 5. One can represent the ground truth using the following visual, which is a histogram based on the actual rating given by the customer.

The actual data has a rating on a scale of 1 to 5 but does not have the sentiment negative, neutral, or positive. However we can make the assumption that a rating of 1 and 2 is negative sentiment, a rating of 3 is neutral sentiment, and a rating of 4 and 5 is a positive sentiment.

So our ground truth table would look like this.

Now let us see how various models perform against the actual data. Here is the sentiment analysis using various models.

Fine-tuned LLM for sentiment predicted as a RATING

The nlptown/bert-base-multilingual-uncased-sentiment is one of the most downloaded models. It is a model based on BERT base (Bidirectional Encoder Representations from Transformers). It is fine-tuned to predict the sentiment of the review as a number of stars (between 1 and 5). It works on uncased text and expects a maximum token length of 512. It can be used in six languages – English, Dutch, German, French, Italian, and Spanish.

The sentiment analysis is shown here.

bert-base-multilingual-uncased-sentiment analysis (image by author)

As the data is customer review data, sentiment analysis using a star-rating approach is great. We can compare it with the actual customer rating as shown below. You can observe that the predicted rating is in line with the actual rating. The model is doing a good job of predicting the sentiment.

bert-base-multilingual-uncased-sentiment predicted vs actual rating (image by author)

Comparing with the ground truth gives us a mean absolute error percentage of 8.43%

nlptown/bert-base-multilingual-uncased-sentiment evaluation (image by author)

Now let us move to the next fine-tuned large language model.

Fine-tuned LLM for sentiment predicted as NEGATIVE, NEUTRAL, POSITIVE

Here we will analyze the cardiffnlp/twitter-roberta-base-sentiment-latest model. The base model, RoBERTA (A Robustly Optimized BERT Pretraining Approach) is a variant of BERT and was introduced by Facebook. The cardiffnlp/twitter-roberta-base-sentiment-latest model is fine-tuned on social media tweets to predict sentiment as negative, neutral, or positive. The sentiment prediction results are shown here.

cardiffnlp/twitter-roberta-base-sentiment analysis (image by author)

Comparing with the ground truth gives us a mean absolute error percentage of 1.86%

cardiffnlp/twitter-roberta-base-sentiment evaluation (image by author)

The result is really great, especially given the fact that the model is fine-tuned on tweets and not on actual customer reviews.

Now let us move to the next model.

Fine-tuned LLM for sentiment predicted as an EMOTION

Sentiment can be expressed in various forms, and one such way is emotions. The bhadresh-savani/distilbert-base-uncased-emotion model predicts six types of emotions – sadness, joy, love, anger, fear, and surprise. It is fine-tuned on distilbert which is a smaller and faster variant of BERT. It was designed by the researchers to have a smaller memory footprint. The sentiment analysis results are shown below.

bhadresh-savani/distilbert-base-uncased-emotion analysis (image by author)

The analysis is interesting, as emotions are a nice way to analyze customer reviews. However, linking emotion to sentiment is not straightforward. Here is a visual which will help you understand the link between emotion and sentiment.

You will observe that in the 5-star actual rating, there are all types of emotions: joy, anger, fear, love, sadness, and surprise. Here is an example that has emotion as fear, but has a 5-star rating.

fear emotion, but has a 5-star rating (image by author)

The review shows that the customer was nervous and fearing about changing food for the dog. However, the product which was bought was good and help overcome the fear, as well as solved the problem.

Analyzing emotions can also help understand the reasons why customers purchase. However, it is difficult to compare with the ground truth.

So, the final evaluation table looks as follows

You can take the approach described here for various other fine-tuned LLMs for sentiment analysis.

Conclusion

Sentiment analysis is very important for many enterprises, as it helps better understand customer and product performance. Fine-tuned LLMs provide a cutting-edge approach to analyzing LLMs. As the number of fine-tuned models is ever-increasing, it is useful to understand the various types of models and how to evaluate them.

In this story, you saw three types of sentiment prediction – rating prediction, negative/neutral/positive sentiment prediction, and emotion prediction. They can all be used for various business use cases to better understand customer sentiment. You saw how to analyze them as well as evaluate the results.

As there are lots of fine-tuned large-language models available, you might not want to waste effort in trying out models randomly. So here are some guidelines on how to select models you can experiment with:

Business use-case requirement: Understanding your requirement for sentiment analysis is the main starting point that can put you in the right direction. Sentiment analysis for business use cases can differ based on various industries. For example, a use-case in banking may need sentiment analysis as positive, neutral, or negative. However, an e-commerce retailer would need sentiment based on a rating scale of 1 to 5. Understanding the requirement can help you correctly determine the scope of models which you need to experiment with.

Type of your data: There are various types of data that require sentiment analysis. For example, customer reviews and tweets are obvious, however, there are many different types of data that require sentiment analysis such as emails, quality inspection notes, user-submitted content, survey response, employee feedback, healthcare feedback, etc. It is important to analyze what data the large-language model is fine-tuned on. Generally, it is better to select a model which is fine-tuned that has the same type as your data. Analyzing sentiment on a data type that is different from what was used in fine-tuning can work as an exception. For example, you can do sentiment analysis on customer reviews using a model which was fine-tuned on tweets. However, sentiment analysis on employee feedback using a model fine-tuned on tweets might not give good results. So try to select a model which was fine-tuned on data which is related to your data.

Base model and training parameters: You should carefully observe the base model as well as the training parameters used. This can determine the restrictions which you might need to apply such as the maximum length of tokens or case of your text. For example, if the base-model maximum length is 512, and if your text has more tokens, the fine-tuned model would not give correct results. So try to understand the base model and training parameters to determine all restrictions which you might need to apply. Analyzing the restrictions can help you decide if the model is suitable for your needs or not.

Multi-lingualism: As generative AI becomes of strategic importance to enterprises, most of the projects are global. This implies that the sentiment analysis project would be global in nature and should take into account different languages. This means that the model which you select should be multi-lingual or you should select multiple models based on different languages for which sentiment analysis is required.

Evaluation criterion: Even though fine-tuned models are available and read-to-use, you should still set up some evaluation criteria. The evaluation would either depend on some ground-truth information you might have or could also be a human evaluation. The evaluation criterion needs to be objective, for example, mean absolute error between model output and ground truth, as I have used in this story. You saw that a model which analyses emotion is not useful for evaluation if the ground truth is rating. Setting up the evaluation criterion can also help you decide on which models to experiment with.

Technical Implementation

Here is a Python code snippet that uses the fine-tuned model to predict sentiment

import pandas as pd
from transformers import pipeline

##Read data
file_name = 'path_to_file'
df = pd.read_csv(file_name)
col_txt = "reviews"

##Get hugging face model
task = "sentiment-analysis"
model = "nlptown/bert-base-multilingual-uncased-sentiment"
sentiment_model = pipeline(task,model=model)

##Get Sentiment
lst_txt = list(df[col_txt])
lst_sentiment = sentiment_model(lst_txt)
df['sentiment'] = [s['label'] for s in lst_sentiment]
df['sentiment_score'] = [s['score'] for s in lst_sentiment]

Dataset citation

The dataset is available here with license CC0 Public domain. Both commercial and non-commercial use of it is permitted.

Subscribe and Join Medium

Please subscribe to stay informed whenever I release a new story.

If you are not yet a member, you can join medium with referral link.

Additional Resources

Website

You can visit my website to make analytics with zero coding. https://experiencedatascience.com

Youtube channel

Please visit my YouTube channel to learn data science and AI use cases using demos

Data Science Demonstrated

Written By

Pranay Dave

See all from Pranay Dave

Artificial Intelligence, Hugging Face, Large Language Models, Llm, Sentiment Analysis

Share This Article

Fine-tuned LLMs for Sentiment Prediction – How to analyze and evaluate

Why do you need to analyze and evaluate the models with your data

Approach

Candidate models for analysis

Data for analyzing the model output

Evaluation approach – comparing with the ground truth

Fine-tuned LLM for sentiment predicted as a RATING

Fine-tuned LLM for sentiment predicted as NEGATIVE, NEUTRAL, POSITIVE

Fine-tuned LLM for sentiment predicted as an EMOTION

Conclusion

Technical Implementation

Dataset citation

Subscribe and Join Medium

You may also like

Additional Resources

Website

Youtube channel

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

3 AI Use Cases (That Are Not a Chatbot)

Deep Dive into LSTMs & xLSTMs by Hand ✍️

Does Your Company Have a Data Strategy?

Build Your Own Modular Audio Course on AI Ethics and Safety