CSCI 6104: Algorithms and Data Structures for Massive Data Sets
|CSCI 6104 Assignments HomeTeaching Publications Vitae||
This class discusses algorithms and data structures that can be used to efficiently solve fundamental problems on data stes of terabyte and even petabyte size. It starts with a discussion of why standard data structures and algorithms are not suitable for solving problems on data sets of such massive size. This discussion will demonstrate the need to develop algorithms that fully exploit the memory hierarchies of state-of-the-art computers. The main part of this class discusses paradigms, algorithms, and data structures for a two-level hierarchy. These algorithms are efficient for solving problems on massive data sets. We will also discuss the concept of cache obliviousness and the most important paradigms for designing cache-oblivious algorithms. In a nutshell, cache-oblivious algorithms are standard internal-memory algorithms that lay out their data structures and access them in a fashion that helps the standard paging algorithm to perform few page transfers (between virtual memory and main memory, or between main memory and L2 cache, or ...). This leads to a clean model to design algorithms that automatically adapt to all levels of a multi-level memory hierarchy. Specific topics covered in this course include:
The following list of references covers all topics covered in class. References made in class will refer to this list. I may add more references to this list as we cover things in class. The first three references are survey papers, which cover (more than) the topics covered on online data structures, graph algorithms, and cache-oblivious algorithms.Arg02
L. Arge. External-memory geometric data structures. Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, Denmark, 2002.Dem02
E. Demaine. Cache-oblivious algorithms and data structures. Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, Denmark, 2002.Zeh02
N. Zeh. I/O-efficient graph algorihms. Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, Denmark, 2002.
The remaining references are research papers that present the algorithms discussed in class. These papers often provide more detail than the lecture notes but are often more difficult to read.AV88
A. Aggarwal and J.S. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM 31(9):1116–1127, 1988.Arg03
L. Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica 37(1):1–24, 2003.
L. Arge, G.S. Brodal, and L. Toma. On external memory MST, SSSP and multi-way planar graph separation. Journal of Algorithms 53(2):186–206, 2004. (Killam Library online resource)APR+98
L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J.S. Vitter. Theory and practice of I/O-efficient algorithms for multidimensional batched searching problems. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, pages 685–694, 1998.ASV99
L. Arge, V. Samoladas, and J.S. Vitter. On two-dimensional indexability and optimal range search indexing. In Proceedings of teh 18th ACM Symposium on Principles of Database Systems, pages 346–357, 1999.
R. Bayer and E. McCreight. Organization of large ordered indexes. Acta Informatica 1:173–189, 1972. (Killam Library Z 699 A1 A3)BF02
G.S. Brodal and R. Fagerberg. Cache-oblivious distribution sweeping. In Proceedings of the 29th International Colloquium on Automata, Languages and Programming, volume 2380 of Lecture Notes in Computer Science, pages 426–438. Springer-Verlag, 2002.BF03
G.S. Brodal and R. Fagerberg. On the limits of cache-obliviousness. In Proceedings of the 35th ACM Symposium on Theory of Computing, pages 307–315, 2003.BFJ02
G.S. Brodal, R. Fagerberg, and R. Jacob. Cache-oblivious search trees via binary trees of small height. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, pages 39–48, 2002.CGG+95
Y.-J. Chiang, M.T. Goodrich, E.F. Grove, R. Tamassia, D.E. Vengroff, and J.S. Vitter. External-memory graph algorithms. In Proceedings of the 6th ACM-SIAM Symposium on Discrete Algorithms, pages 139–149, 1995.FLPR99
M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pages 285–297, 1999.GTVV93
M.T. Goodrich, J.-J. Tsay, D.E. Vengroff, and J.S. Vitter. External-memory computational geometry. In Proceedings of the 34th IEEE Symposium on Foundations of Computer Science, pages 714–723, 1993.
S. Huddleston and K. Mehlhorn. A new data structure for representing sorted lists. Acta informatica 17:157–184, 1982. (Killam Library Z 699 A1 A3)
V. Kumar and E.J. Schwabe. Improved algorithms and data structures for solving graph problems in external memory. In Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing, pages 169–176, 1996.
A. Maheshwari and N. Zeh. I/O-efficient planar separators. SIAM Journal on Computing 38(3):767–801, 2008. (Killam Library online resource)MM02
K. Mehlhorn and U. Meyer. External-memory breadth-first serach with sublinear I/O. In Proceedings of the 10th European Symposium on Algorithms, volume 2461 of Lecture Notes in Computer Science, pages 723–735. Springer-Verlag, 2002.
K. Munagala and A. Ranade. I/O-complexity of graph algorithms. In Proceedings of the 10th ACM-SIAM Symposium on Discrete Algorithms, pages 687–694, 1999. (ACM Digital Library)Pag03
R. Pagh. Basic external memory data structures. In U. Meyer, P. Sanders, and J. Sibeyn (eds.), Algorithms for Memory Hierarchies, volume 2625 of Lecture Notes in Computer Science, pages 14–35. Springer-Verlag, 2003.
Final grade = 50% * A + 50% * P + ε * C
The assignments will consist of algorithm design questions; that is, they will be of a theoretical nature. The course project can be of three types: (1) A thorough review of the literature on solving a certain problem or class of problems I/O-efficiently/cache-obliviously, including a critical discussion of the advantages and shortcomings of the proposed approaches. (2) An attempt to make progress on a theoretical open problem in the field, including a brief review of the relevant literature. (3) Implementation and experimental evaluation of an I/O-efficient or cache-oblivious algorithm and a write-up that includes a brief discussion of the relevant literature, a discussion of the design choices made in the implementation process, and a discussion of the experimental results.
The ε for class participation should be interpreted as follows: If you are between two grades (e.g., between B+ and A−), I will hold active class participation in your favour to tip the grade towards the better mark. I teach my course in a relatively interactive style, pointing you in the right direction and asking you to develop the right idea to approach a problem. Active class participation means that you make a conscious effort, that you voice your ideas about how to approach a problem. The answers do not have to be correct: massaging your brain is important. If you do nto follow, asking for clarification also counts as class participation.
Collaboration: I encourage you to form study groups to discuss the course material and work on exercises. However, I do not allow any form of collaboration on assignments. Collaboration on assignments will be treated as plagiarism. Sufficiently similar assignments will be handed over to the Faculty's Academic Integrity Officer.
Plagiarism: I will not tolerate any form of plagiarism (copying from your classmates, copying solutions from the web, etc.). According to university regulations, I have to report you to the Faculty's Integrity Officer if I suspect you of academic dishonesty, and I will. The Integrity Officer may refer your case to the Senate Committee on Plagiarism. The penalties for plagiarism can range from failing the course to expulsion from the university. So please save me and yourself the aggravation. For more information, make sure you read the university policy on intellectual honesty and the Dalhousie Academic Integrity Page.
Late assignments: I do not accept late assignments without a doctor's note or any other official document that certifies a good reason why you could not hand your assignment in on time. In case of important events in your personal life, I am willing to make an exception to this rule if you discuss these circumstances with me and apply for a reasonable extension of the deadline in advance.
September 30, 2008
Deadline to choose project topic.
September 8, 2008
The fun begins.