CSCI 6905: Compact Data Structures in Computational Genomics

Hello and welcome to CSCI 6905!

News:

I don't think there's anything urgent to talk about on Thursday but I'll be online in a Webex meeting you can join here. (You should be able to join with just a browser.) Please feel free to drop by, say hi and ask me about the course or anything else you feel like --- it's completely optional.

I hope this course will be interesting to people with a background in algorithms, data structures, bioinformatics OR genetics. If we get people from a mix of backgrounds, we'll have a better chance of solving interesting problems. If you're planning on taking it, please try to attend (virtually) a CGEM seminar next week (Wednesday Jan 13th, 12--1) on "Solutions and Challenges in achieving equity in genetic and genomic health care for Indigenous People of Canada: The Silent Genomes Project"; register (for free) here. It's not mandatory but it should motivate you for the rest of the course.

There's also a workshop on Data Structures in Bioinformatics (DSB '21) coming up in February that you might like to attend. The registration is free but apparently it closes on January 27th.

So, Webex meeting Thursday(optional), CGEM seminar (optional), DSB registration (optional), and please read what's below and start watching Ben's video courses on YouTube. Once you've watched Ben's videos, please look at the first two assignments (and let me know if you have questions).

If you want to watch a rather rambling, hour-long introductory video on YouTube, there's one here, or you can download it here (100 MB). The YouTube and Dropbox links for the classes are (finally) posted on the Slack. If you want to see lectures from a class I taught with Nicola Prezza in Spain a few years ago, they're here.

The New Yorker article about the case of rare-disease diagnosis is here. ("The Illustrated Guide to a PhD" is here. :) Matt Might got involved in bioinformatics (see here) and is now director of the Hugh Kaul Precision Medicine Institute.

Ben's lab has a new paper (here) about choosing a few representative genomes and indexing all of them together with standard tools. (They're not using founder sequences.) Apparently it works better than you might guess from what I was saying in the first class, but I still think including all populations will make the dataset too big for a standard computer (at least for the next few years), and it (probably?) won't deal with rare diseases. It's still worth reading, though (after all, it got into Genome Biology :).

Outline:

I started studying bioinformatics during my PhD when I went to Italy to work with Giovanni Manzini, who'd co-invented a data structure called the FM-index. (A student in a course I taught in 2012 wrote most of that Wikipedia page!) The FM-index is the basis for all popular short-read DNA aligners: if you send a saliva or blood sample away to have your genome sequenced, probably

the sample will be put into a machine that outputs lots of overlapping short reads, which are substrings of your genome of a couple of hundred characters each;
the reads will be aligned against a standard reference genome using approximate pattern matching, with software based on the FM-index such as Bowtie or BWA;
your genome will be stored in Variant-Call Format (VCF), which lists where and how your genome differs from the reference (like a big diff);
you'll get an explanation of what some of those variations mean and either the VCF (from which you can recover your genome by applying those changes to the reference) or a 3 GB text file with your whole genome.

I'm hoping the CGEM seminar will explain why using a single standard reference genome can be a bad idea, as it may undermine personalized medicine for people from other ethnic backgrounds. (Some countries have assembled their own reference genomes, such as China and Denmark, but that's probably not feasible in a country as ethnically diverse as Canada.) You can also read the pop-science articles here (which also mentions rare-disease diagnosis) and here (or a more formal version here).

Before we can discuss approaches to overcoming the bias resulting from using the standard reference genome, we need to understand how read alignment works. There are already courses out there that cover that well so we'll start by watching videos from two of them: Ben Langmead's Algorithms for DNA Sequencing and Indexing; there are lecture notes and code here. (The last two "Indexing" lectures cover Wheeler graphs and start to get into pan-genomics, which we'll get to in a minute.)

If you've already taken Meng He's course "Advanced Data Structures" (4117 or 6057) or Rob Beiko's course "Bioinformatics Algorithms" (4181 or 6802), you should find the first month's material pretty easy. If you want a textbook, I'd suggest Mäkinen et al.'s "Genome-Scale Algorithm Design" or Navarro's "Compact Data Structures: A Practical Approach". Check you've understood things by completing Assignment 1 and Assignment 2 from my undergrad directed-studies course (4192) in the fall. Assignment 2 should be done using Simon Gog's SDSL. I'll post more instructions about using SDSL once the course is under way.

Those two assignments are each worth 5% and due midnight AST, January 24th and 31st, respectively.

Once we understand how to align against a the standard reference, we'll look at ways to alleviate the bias that comes with that and discuss their advantages and disadvantages:

make the standard reference "more standard";
use different reference genomes for different ethnic groups;
use a collection of a few dozen genomes as a reference, chosen to reflect genetic diversity;
use a collection of a few dozen genomes as a reference, engineered to reflect genetic diversity;
use a collection of thousands of genomes as a reference;
index a variation graph or other data structure that represents a pan-genome.

Since this is a course in computer science, the data structures involved will be be central, but you don't need to love the details as much as I do as long as you make up for it by applying them.

We'll focus on approaches 4 to 6, since approaches 1 to 3 probably aren't scalable in the long term: using a better single reference won't help us diagnose rare diseases (which are rare individually but are pretty common collectively, since there are so many of them; see here); a lot of people aren't sure of their ethnic backgrounds or have several; and while a sample of a few dozen genomes will reflect genetic diversity better than only one, it may not capture enough of it.

In particular, we'll look implementations of approaches 4 to 6: e.g.,

founders sequences;
PanVC (here and here), r-index (here and here), MONI/PHONI;
minigraph, vg (here and here), Giraffe, PLAST.

These are called pan-genomic indexes; you can read more about pan-genomics here, here and here (or just search on Google). You can also watch another talk by Ben here and some of my talks here, here and here.

(Most of what I've written here focuses on human pan-genomics, but a lot of people are working on other species, particularly pathogens. If you're more interested in microbial pan-genomics, for example, I can put you in touch with some friends of mine who specialize in that.)

Logistics:

The syllabus is here. My email is sivart.eigag@liamg.moc with all the words reversed. At least to start with, email is probably the best way to reach me.

We have two 90-minute classes each week, Monday 10:35--11:55 and Thursday 16:05--17:25, but we shouldn't need them both of them very often: most weeks you'll have a few hours of videos to watch (asynchronously) or a paper to read and we'll just meet on Webex for an hour to discuss what you've learned and/or your assignments. I'd stick to Teams but I've had better luck recording hi-res video with Webex. I may set up a Slack if people want. I'm not using Brightspace because I want some grad students and post-docs at other schools to be able to follow along.

The weird time slots are because I'm hoping to convince some colleagues to record guest lectures and then show up to the next meeting. Monday mornings should be ok for guests in Europe while Thursday afternoons should be ok for guests in California. Earlier on Monday mornings might have been better for guests in Australia, China or Japan, but these slots are what were available (and is anyone a big fan of early Monday mornings?).

I'd like the marks to be made up of

5% for the first warm-up assignment,
5% for the second warm-up assignment,
10% for participation in the discussions,
20% for a 20-minute presentation comparing two of the implementations (video due February 21st),
30% for an experiment comparing two of the implementations (one-on-one meeting by March 21st),
20% for a report on the experiment (due April 4th),
10% for a short report putting your findings into the context of all the reports (due 1 week before grades due).

There's some room for negotiation.

If you come up with an awesome idea about using compact data structures for computational (pan-)genomics and want to do that instead of a comparison, I'll probably agree. (If it doesn't work, that's ok, research is like that; you can still get an A.)

I think it would be really cool if we could work with Dr Arbour to address some of the issues from her "Silent Genomes" talk (although working with individual human genomes brings some privacy concerns, etc). Giraffe sounds like it might be able to help right out of the box, although it has to deal with chimeric alignments. In a couple of months, you should be able to argue with me about that!