Nuno Guimarães is a PhD student at the Faculty of Sciences, University of Porto. His research currently focuses on using data mining for helping in detecting fake news in social media. His advisors are Prof. Álvaro Figueira (DCC-FCUP/UPorto) and Prof. Luís Torgo .

MSc Thesis

Title: Lexicon Expansion System for Domain and Time Oriented Sentiment Analysis

Supervisor: Luis Torgo; Co-supervisor: Álvaro Figueira

MSc in Computer Science, FCUP-U.Porto

Finished: Nov/2016

Abstract

In sentiment analysis, the polarity of a text is often assessed recurring to sentiment lexicons, which usually consist of verbs and adjectives with an associated positive or negative value. Research has focused in these particular parts of speech. However, in short informal texts like tweets or web comments, the absence of such words does not necessarily indicates that the text lacks opinion. Tweets like “First Paris, now Brussels… What can we do?” imply an opinion without the use of words included in sentiment lexicons, but rather due to the general sentiment or public opinion associated with terms in a specific time and domain.

In order to complement general sentiment dictionaries, we propose a novel system for lexicon expansion that automatically extracts the more relevant and up to date terms on several different domains and then assesses their sentiment through Twitter.

Experimental results on our system show a 90% accuracy on extracting domain and time specific terms and 80% on correct polarity assessment. In addition, an analysis on the sentiment dynamics and “trending” factor of a sample of terms (that were frequently referred in the news) was carried out. An association on the term polarity change and trend was not possible. However, the variation on the terms sentiment seems to be the expected through the time interval analysed. The achieved results provide evidence that our lexicon expansion system can extract and determine the sentiment of terms for domain and time specific corpora in a fully automatic form.

However, some flaws were detected during the evaluation. Namely, the large number of terms evaluated with a neutral score provided evidence that terms that appear on news may not have always a positive/negative value. Therefore, the implementation of a three class ensemble sentiment method was included in the workflow of our system. Evaluation using tweets datasets from different domains proved that our ensemble system ENS17 outperformed 19 other sentiment analysis methods. In addition, a tweet domain disambiguation process was included for the specific cases where the same term appears in more than one domain.

Preliminary evaluations were made by adding the resulting lexicons to state of the art sentiment systems and testing them on a dataset containing tweets and Facebook posts and comments. The results show that our expanded lexicons cannot be used directly on texts retrieved from the web (namely in factual or news texts). However, they improve the sentiment classification of all three methods on opinion texts, providing evidence of their usefulness on sentiment analysis classification.

Interests

  • Text mining
  • Fake news
  • Social media analysis