Carlos Leite

Carlos Leite

Data Scientist

Facebook

Carlos Leite is a Data Scientist with a background in Recommender Systems and Search. He has experience in research, development, deploy and monitoring of multiple Artificial Intelligence applications.

Carlos also has some experience in performing analysis on AB test data and help in the experiments decision-making via statistical testing.

MSc Thesis

Title: Domain Oriented Biclustering Validation

Supervisor: Luis Torgo ; co-supervisor: Catarina Magalhaes

MSc in Computer Science, FCUP-U. Porto

Finished: Nov/2016

Abstract

Clustering is a traditional data mining task which consists in partitioning a set of data objects into subsets in such a way that the objects within the same subset (cluster) are similar to one another and dissimilar to objects in other subsets. Moreover, the objects in the same cluster are similar with respect to all the attributes (or features) that describe them. However, there are other formulations of the clustering problem. For instance, one can be interested in finding groups of objects with a similar pattern in some attributes, and not all of them. Biclustering techniques simultaneously cluster both rows and columns in order to find those groups. The motivation for the work presented in this dissertation comes from the application of biclustering techniques to the metagenomic dataset generated from the worldwide 2014 Ocean Sampling Day event. Since microbial activity is a fundamental component of ocean’s biogeochemical cycles, we tried to find geographic niches of certain microbial functions through the application of biclustering techniques. The problem here is how to determine the relevance of a bicluster from a biological and geographical point of view.

We propose a general methodology that evaluates a bicluster considering the relevance of the rows and the relevance of the columns belonging to it. Such relevance is computed relying on a set of indexes defined according to the application domain.

In our case study, the relevance of the rows corresponds to the biological relevance, since the rows of the metagenomic dataset represent the microbial functions. On the other hand, the relevance of the columns corresponds to the geographical relevance, since the columns of the metagenomic dataset represent sampling sites.

We applied our proposed methodology to the case study using ORCA, which is a web application that we developed. Our methodology allowed us to find meaningful biclusters from a biological and geographical point of view. Furthermore, it also allowed us to find interesting relationships, which were unknown so far, between key microbial functions (nitrogen biogeochemistry) within different marine ecosystems. Many of those functional interconnectivities identified with our methodology are relevant from a biological point of view.

Interests

  • Bioinformatics
  • Clustering