Positive, negative, neutral, or mixed sentiment. If you've ever used a social monitoring tool, these terms should be very familiar.
But what exactly is sentiment analysis and how accurate can it be?
Sentiment analysis is defined as: The process of alogorithmically identifying and categorizing opinions expressed in text to determine the writer's attitude toward the subject of the document (or post).
To accomplish this, providers use a variety of approaches, all of which have widely varying degrees of performance and accuracy.
While measuring performance in terms of speed is usually a straightforward process, measuring performance in terms of accuracy is anything but.
And when it comes to using social data to understand consumer opinions, sentiment accuracy is incredibly important.
Let's take a look at how sentiment analysis works, how to determine accuracy, and how to spot bad analysis.
Why sentiment analysis is actually really difficult
Human language is elaborate, with nearly infinite grammatical variations, misspellings, slang and other challenges making accurate automated analysis of natural language quite difficult.
Traditional approaches to sentiment analysis are surprisingly simple in design, struggling with complicated language structures, and fail when contextual information is required to correctly interpret a phrase.
Consider, for example, the following two sentences:
“I like that Corvette.”
“That looks like a Corvette.”
When analyzing sentiment, the first example would optimally be scored as positive, with the second marked neutral. However, the vast majority of systems will not mark these examples correctly, as the word expressing positivity in the first sentence, “like”, is not expressing tone in the second.
The best way for a system to correctly interpret this complexity is to understand the context around the word’s usage.
Consider this secondary example:
“I’m craving McDonald’s so bad.”
Again, most systems will mis-interpret this statement, seeing the word “bad”, or even the phrase “so bad”, and score the sentence as negative. Contextual understanding is critical for a system to be able to reach human-level accuracy.
The structure of a solid sentiment test
When validating a sentiment analysis system, the testing methodology is crucial. The data source, cleanliness of language, how it is scored, subject matter and volume of data tested are all significant variables that can dramatically affect results.
For an optimal test, the data source should closely match the intended uses. For example, if your intended application is analysis of online dialog, the data used to test system accuracy should also be sourced from online dialog.
Volume of data tested is also important, and a general rule of thumb here is “the more data the better the test”.
If you’re looking to score millions of documents at a time, wouldn’t you want to know how well a system does this?
One vendor of a social monitoring platform claims the highest accuracy, but the test was based on 200 posts. What kind of sample size is that?
What to look for in an accuracy score
There are actually three very important numbers that go into determining how well a sentiment analysis system works.
Unfortunately, many in the industry are focused on one single metric: precision, often referred to as accuracy. While certainly important, this measure alone does not tell us anywhere close to the whole story.
Another metric, known as recall, is equally important to the understanding of how these systems perform. Finally, there is F-Score or F-Measure, which is a more holistic account of overall performance.
Let’s look at how these are defined.
A measure of how often a sentiment rating was correct. For documents with tonality, accuracy tracks how many of those that were rated to have tonality were rated correctly.
A measure of how many documents with sentiment were rated as sentimental. This could be seen as how accurately the system determines neutrality. Generally, high recall scores are very difficult in tests of broad subject matter, as the system is required to understand ever-larger sets of words and language.
Also called F-Score or F-Measure, this is a combination of precision and recall. The score is in a range of 0.0 - 1.0, where 1.0 would be perfect. The F1 Score is very helpful, as it gives us a single metric that rates a system by both precision and recall.
As such, it is commonly used amongst experts and researchers in the linguistics and natural language processing fields to simply describe the performance of such systems.
The formula for calculating F1 Score is: F1 = 2 (precision recall) / (precision + recall)
Why recall is critical for understanding system performance
Consider we have 100 documents discussing a bank. Of these documents, 10 are neutral, making statements such as, “I just went to the bank.” 40 of them are positive comments about the bank, and the last 50 are all negative comments specifically mentioning fraud.
Now imagine we were to analyze this dataset with a system which does not understand fraud as being negative. It may correctly score all 40 positive comments, and mark the 50 fraud comments and 10 neutral comments as neutral.
In this case, of the 40 comments the system rated, it got all 40 correct, so it would have a theoretical accuracy of 100%. However, it didn’t rate any of the 50 comments on fraud. So of the 90 sentimental comments, only the 40 positive comments were rated, giving a recall score of 44% (40/90).
In this example, the system may have a very high accuracy rating, but without knowing its recall, we cannot comfortably trust the results.
Now consider the impact to a positivity result: The system would say the data is 40% positive, 0% negative, and 60% neutral. This would be very misleading data, as the true rating should be 40% positive, 50% negative and 10% neutral. Quite a difference!
In this example, the system may have a very high accuracy rating, but without knowing its recall, we cannot comfortably trust the results. Interestingly, recall and accuracy are often at odds with each other, as attempts to boost recall often negatively impact accuracy and vice versa.
How to spot ‘bad’ sentiment analysis
A quick glance through individual posts may give you a rough idea of the effectiveness of a sentiment engine. However, with the likelihood that you’ll be using this system to score millions of posts, this method is less than ideal and often doesn’t go far enough.
Another easy way to spot ineffective sentiment analysis is to look at the distribution of positive, negative, mixed and neutral scores.
If you filter specifically to social networks, remove news sources, and then run a search for a subject that is by common sense talked about incredibly negatively, for instance, “gm AND (recall OR recalls),” neutral content should not account for more than 90% of the total.
Although general statements about a subject that carry no sentimental context are far more common than not, such a high share of neutrally scored content for an emotionally charged subject is often a sign of poor system recall.
The system is categorizing posts as neutral that it can’t decide positivity or negativity for, and this significantly limits your sample and decreases the validity of your results.
Sentiment analysis is just one part of a social media monitoring platform or natural language processing system, and while it shouldn’t be the only thing you consider, accuracy and recall are critical elements to the results you will get.
A system with low accuracy won't provide results that are valuable or results you can trust, and a system with low recall misses a great deal of the data you’re wanting to analyze, which also leaves you with results that are not viable.
- Look for the two critical measures of precision (accuracy) and recall, even better if there is an F1 score.
- Check the sample size of the test that was run. Is it large enough to feel confident about the findings? Remember, the larger the sample set, the better.
- Ask for information about the types of documents that were scored and what criteria was used to determine scoring. Does the data analyzed for the test match the data commonly processed by the system?
- Look for the subject matter used to test the system. Was the test run on a single subject or multiple subjects?
- Test the system for yourself. What does the distribution of neutral content look like? Is the system scoring neutral content correctly?