oRanki

Project ID Card

  • Principal Investigator: Luis Torgo
  • Hosting Institution: LIAAD, INESC, Portugal
  • Funding Agency: FCT (Portuguese Science Foundation)
  • Project Duration: 2008 – 2011
  • Budget: 48.5 kEur

Project Summary

This project aims to develop methods and tools that can be used to help in detecting rare cases in application domains that are constrained by limited human resources. The application that drives the project is a current problem of the Portuguese Institute of Statistics (INE), and this project will be carried out in straight collaboration with INE experts as a follow up of previous collaborative work. All Portuguese companies have to report monthly to this institute their transactions with foreign countries. Some of the transactions contain errors (e.g., incorrect product id or transaction amount), that may have strong impact on the statistical indicators produced by INE and thus finding and correcting them has a strong political and economical impact. As such, they have to be manually inspected for trying to find as many errors as possible. Due to limited (and varying) amount of human resources available, not all transactions can be inspected. In this context, the ability to find most of the errors by inspecting a small subset of the transactions is of major relevance for INE. The main objective of this project is to develop methods and tools to provide support and guidance on this task. The task we are addressing can be cast into the general problem of outlier detection. In effect, errors on transactions can only be expected to be detectable if they somehow deviate from what is common for the products being traded. The particularity of this outlier detection task lies on the fact that the output of the tools should be flexible so as to cope with the variable amount of human resources that can be allocated to the subsequent inspection task. In other words, the main success criterion is related to how many errors are detected given the currently available resources for inspection. Moreover, any methods that do not achieve a high error detection rate, are unacceptable to INE, given the impact and costs associated with erroneous transactions. In this context, standard outlier detection tools that provide binary answers (is or is not an outlier) are inadequate for this application. In effect, if the available human resources do not allow the inspection of all transactions signaled as outliers, the choice of the ones to be inspected is arbitrary and thus the results may be sub-optimal. Instead, we envisage tools that provide a ranking of “outlierness” of the transactions that are candidate for inspection, thus allowing a more informed use of the available human resources. Though not the main stream in terms of outlier detection research, algorithms already exist that are able to produce outlierness degree values. However, most of them are computationally expensive and were developed with this particular goal in mind. The work to be carried out in this project adds to the current state of the art in outlier ranking in three main directions: i) instead of special purpose methods we will adapt and use standard hierarchical clustering methods for these tasks; ii) we will also incorporate some information on previously inspected transactions, namely information on errors that were found; iii) we will propose a flexible framework that allows the use of different ranking criteria for producing the inspection priority list. Regarding the first direction, the tools that shall be developed will be able to produce rankings of outlierness by using information produced during the clustering process of hierarchical clustering algorithms, thus avoiding extra computational costs. Initial experiments already carried out in this research direction have shown good potential of these approaches. The project consortium plans to explore this direction deeply as well as other alternatives involving existing clustering methods. Regarding the second main research direction, it is motivated by the fact that there exists information concerning the results of previous inspections. This extra information is only available for a small subset of cases. However, this is relevant information and thus should be somehow incorporated in a kind of semi-supervised clustering approach. Alternatively, this information can also be used within a semi-supervised classification setup, where the unlabeled data are used as a complement for improving this type of models. Finally, the third direction has to do with the fact that it is interesting for INE to obtain a ranking of the transactions not only in terms of their probability of being an error, but also by taking into account the impact of the errors on the statistics of foreign trade. In effect, not all errors are equally important for INE.

The experimental validation of the methods that shall be developed is of key importance before they can be used in INE giving the nature and importance of the application. In this context, all methods will be evaluated using real transaction data provided by INE and the results will be analyzed / criticized with the collaboration of INE experts.

Project Goals

The main objective of the project is to develop methods and tools that can be used to optimize the results obtained with a limited amount of human resources in the task of detecting errors on foreign trade transaction forms. This is a current problem at the Portuguese Statistics Institute (INE), and optimizing these results is of key importance given the political and economical impacts of the statistical indicators produced with the information contained on these forms.

The project aims at :

  • developing methods and tools that provide rankings of outlierness for a set of transactions, which can be used to decide upon the transactions to be inspected,

  • developing methods which rank transactions according to a combination of outlierness and criteria that estimate the effect of the errors on foreign trade statistics,

  • comparing the methods to existing alternatives for this type of tasks, including the strategy currently used at INE,

  • test the methods on other related problems.

Luis Torgo
Luis Torgo
Canada Research Chair and Professor

Related