Rita Ribeiro is an Assistant Professor at the Computer Science Department of Faculty of Sciences of the University of Porto and Researcher at LIAAD, the Artificial Intelligence and Decision Support Lab at INESC TEC.

Currently, she is the Course Director of the MSc in Data Science at Faculty of Sciences of the University of Porto.

More information about Rita on her Home Page

PhD Thesis

Title: Utility-based Regression

Supervisor:Luis Torgo

PhD in Computer Science

Finished: Sep/2011

Abstract

In many real world data mining prediction tasks, such as the forecast of extremely profitable stock trading actions, fraud detection in credit card transaction, ecological or meteorological catastrophes, among others, the phenomenon to predict is described by a specific range of values.

Given the nature of this type of applications, the relevance of the values forecasted may lead to trigger alarms, preventive measures, among others, involving, inevitably some costs. When understanding the particularities of this type of applications, one can acknowledge that the target variable does not have a uniform relevance throughout its domain.

Utility-based mining is a relatively recent approach to learning problems that involve costs and/or benefits. It has emerged from the already known cost-sensitive learning, as the most close approach to the representation of real world problems. Most of the work conducted in this area is related to classification problems. However, we claim that in some of these applications the phenomena to predict is, in its essence, numeric. This means that the target variable is continuous and, therefore, the same type of problems can appear in regression.

There are many real world applications in the conditions mentioned above: the stock market financial prediction is one of such examples. The more relevant values will correspond to the most extreme and rare price alteration. These subsets of values will be the ones that will potentially give us the most helpful information.

In this thesis, we study the general problem of regression based on utility. More conventional regression techniques assume that the relevance of values throughout the objective variable is uniform and that the magnitude of the predictions error is the only factor related to cost. Along this study we show that an evaluation based on utilities is more suitable when dealing with regression tasks where the target variable does not have a uniform relevance. Even though the importance of phenomena is often related to its rarity, that might not always be the case. For this reason, our utility approach is based on any continuous relevance function defined to the target variable. We propose a new evaluation methodology that assesses the utility (cost / benefit) of a prediction for a given value of the target variable, based on the prediction error and on the relevance of both predicted and true values.

When the goal is to predict a rare value, there are typically crucial decisions associated to those predictions (e.g. trading actions, trigger different type of alarms). In such context, it is important to evaluate the models for those values that really matter, i.e. those that describe the target rare event. Moreover, it may be important to evaluate the models from a ranking perspective towards the target event. For this reason, we also derive utility-based evaluation metrics better designed to cope with rarity (e.g. precision, recall, Precision-Recall curves). We illustrate the advantage of using such metrics in the context of two-real world applications.

Finally, we propose ubaRules, a regression rules ensemble system that incorporates a measure based on utility (derived from the proposed methodology) as a preference criterion in the creation of models that satisfy the applications’ requirements: we intend to derive precise and interpretable regression models.

Interests

  • Utility-based Learning
  • Rare events
  • Imbalanced distributions

Latest