The problem of efficiently storing and retrieving information is an essential topic in computer science. During the past decades, researchers have been developing techniques to handle large data sets. Back in the late 80s, a text corpus of several hundred megabytes, such as the digital version of the 20-volume Oxford English Dictionary, was considered large data, and the research programs, e.g. the New OED Project, on managing data of this scale attracted many world-class researchers. These programs resulted in many well-known techniques including indexing data structures that facilitate text search, and companies such as OpenText (the largest Canadian software company) were started as spin-offs of these projects.
Recently, as a result of the rapid growth of large data sets, such as text databases and spatial databases, on the World Wide Web (think of Google), and in bioinformatics applications, modern applications often deal with data measured in terabytes or even petabytes. This has led to new issues and challenges in the way data are organized. Many algorithms and data structures designed twenty years ago are no longer feasible for these data sets. In many cases, it is because these data structures occupy too much storage space.
My main research contributions are in the design of efficient algorithms and data structures for large data sets. This involves studies of many algorithmic techniques, including succinct data structures, I/O-efficient algorithms and data structures, implicit data structures and adaptive algorithms.
The following are some of my current research projects.
Succinct data structures were proposed to represent data structures in a space-efficient manner, so that the information in large systems can be retrieved quickly, but the space requirement is little more than that of the raw data. Our work in this project improved previous results on succinct representations of many fundamental data structures. These include strings which are key structures used in the design of many succinct data structures, space-efficient text indexes that facilitate text search, and representations of unlabeled and labeled trees and graphs. These problems are fundamental, and they also have applications in text retrieval systems, XML databases and GIS applications such as Google Maps.
We also started several new research directions in this area, such as designing succinct indexes to achieve data abstraction in succinct data structure design, combining succinct data structures with other algorithmic approaches such as I/O-efficient algorithms and data structures, as well as using succinct data structures to improve the query efficiency of classic data structures.
We designed a collection of space-efficient text indexes that support various types of text search, including substring search and position-restricted substring search. These indexes typically occupy space close to a compressed version of the textual data, and they can answer queries without requiring the raw data. This saves storage space drastically, as standing text indexes such as suffix trees typically occupy space that is 10-20 times the size of the raw data.
Geometric data structures are used extensively in geographic information systems, computer graphics and databases. Thus it is important to improve their space efficiency for large applications. We designed space-efficient data structures that support queries such as range search, majority and point location. We also designed approaches to represent triangular terrains compactly to support queries such as terrain profiles and trickle paths.
Previous work on implementing succinct trees and text indexes has shown that succinct data structures are very useful in practice. Despite such work, there are many theoretical results that have not been evaluated in practice. Thus this project engineers the implementation of succinct data structures.