Nuno Moniz is a PostDoc Researcher at the Laboratory of Artificial Intelligence and Decision Support (LIAAD - INESC Tec), and an Invited Professor at the Sciences College of the University of Porto (FCUP). He successfully defended his Ph.D. at FCUP in 2017, under the supervision of Prof. Luís Torgo. His work was fully funded by a scholarship awarded by FCT (Portuguese Foundation for Science and Technology), and his final dissertation was awarded in the Fraunhofer Portugal Challenge 2017.

More information about Nuno on his Home Page

PhD Thesis

Title: Prediction and Ranking of Highly Popular Web Content

Supervisor:Luis Torgo

PhD in Computer Science

Finished: Jul/2017

Abstract

This thesis addresses prediction and ranking tasks using web content data. The main objective is to improve the ability to accurately predict and rank recent and highly popular content, thus enabling a faster and more precise recommendation of such items. The main motivation relates to the profusion of online content, and the increasing demand of users concerning a fast and easy access to relevant content.

To fulfill these tasks an extensive review of previous work is carried out in order to define the state-of-the-art and to identify important research opportunities. As a result, three problems are identified and addressed in this thesis: (1) the lack of an interpretable and robust evaluation framework to correctly assess web content popularity prediction models focusing on highly popular web content; (2) issues concerning proposals of popularity prediction models and their ability to predict the rare cases of highly popular content; and (3) the need for recommendation frameworks concerning such items using multi-source data. For each of these problems novel solutions are proposed and extensively evaluated in comparison to existing work.

The first problem (1) concerns the evaluation methods commonly used in web content popularity prediction tasks. According to previous work, the popularity of web content is best described by a heavy-tail distribution. As such, at any given moment, most of the content under analysis has a low level of popularity, and a small set of cases has high levels of popularity. Standard evaluation metrics focus on the average behaviour of the data, assuming that each case is equally relevant. Given the predictive focus on highly popular content, it is argued that such assumption may lead to an over-estimation of the models’ predictive accuracy. Therefore, an evaluation framework is proposed, allowing for a robust interpretation of the prediction models’ ability to accurately forecast highly popular web content.

The second problem (2) is related to the fact that proposals concerning web content popularity prediction models are based on standard learning approaches. These are commonly biased towards capturing the dynamics of the majority of cases. Given the skewness of web content popularity data, this may lead to poor accuracy towards under-represented cases of highly popular items. An evaluation with a diverse set of such proposals is carried out, confirming their issues when learning to predict such items. Also, it is additionally confirmed that the use of standard evaluation metrics often presents an over-estimated ability to accurately predict the most popular items. Novel approaches are proposed for the prediction of web content popularity focusing on accuracy towards highly popular items.

The third and final problem (3) concerns the task of ranking, but also evaluating, web content by its predicted popularity. Although the task of ranking may be trivial in most cases, when considering scenarios with multiple sources of data such task is considerably dicult. Notwithstanding, ranking tasks and their evaluation in single-source scenarios are not exempt of issues concerning the ability to account for highly popular content. The ability to rank web content based on models’ predictions is discussed, given an extensive evaluation in both single-source and multi-source scenarios.

Each of these problems is evaluated using real-world data concerning online news feeds from both ocial and social media sources. This type of web content provides a dicult setting for the early and accurate prediction of highly popular items, given their short lifespan. Experimental evaluations show that the approaches proposed in this thesis concerning the prediction and ranking of highly popular content obtained encouraging results demonstrating a significant advantage in comparison to state-of-the-art work.

Interests

  • Imbalanced Learning
  • Secure Machine Learning
  • Green Machine Learning
  • AutoML

Latest