Mini-Project: Longest Common Subsequence

The examples in the previous two mini-projects require no laziness. They would work just as well if Haskell used applicative order evaluation, as most other programming languages do. The final mini-project in this chapter does need lazy evaluation to work.

What sets this example apart from the previous ones is that in the previous examples, we constructed the arrays we needed and then used their contents. Those were two completely separate steps. Now, the list that we pass to our array function to construct the array depends on the array contents. This type of circular dependency is not possible using applicative order evaluation because we either build the array first and then construct the list from it or vice versa. Lazy evaluation makes such circular dependencies possible, as long as no array value depends on itself.

The problem we want to solve is finding the longest common subsequence of two strings:

GHCi

>>> :l LCS.hs
[1 of 1] Compiling Main             ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"

A common subsequence of two strings is a string that can be obtained from both strings by deleting an appropriate subset of their characters. Here, "Dalhus" can be obtained from "Dalhousie" by deleting the 'o', the 'i', and the 'e'. It can also be obtained from "Dalhusy" by deleting the 'y'. It is a longest common subsequence (LCS) if there is no longer string that is a common subsequence of the two strings.

I hope that you have learned or will learn in CSCI 3110 how to solve this problem in $O(mn)$ time using dynamic programming, where $m$ and $n$ are the lengths of the two input strings. I won't prove the correctness of the algorithm here, but here is the intuition of how it works:

If one of the two input strings is empty, then clearly the LCS is the empty string.
If both input strings are non-empty, that is, they are of the form x:xs and y:ys, then there are two cases to consider:
- If x == y, then the LCS of x:xs and y:ys starts with x, and this is followed by an LCS of xs and ys.
- If x /= y, then the LCS of x:xs and y:ys is either an LCS of x:xs and ys or of xs and y:ys, whichever is longer.

I hope it is intuitively obvious that this is a correct description of the solution. We can immediately implement a recursive implementation of the lcs function based on this description:

LCS.hs

lcs :: String -> String -> String
lcs xs ys = snd $ lcs' xs ys
  where
    lcs' xs@(x:xs') ys@(y:ys')
        | x == y      = (len1 + 1, x : lcs1)
        | len2 > len3 = (len2,         lcs2)
        | otherwise   = (len3,         lcs3)
      where
        (len1, lcs1) = lcs' xs' ys'
        (len2, lcs2) = lcs' xs  ys'
        (len3, lcs3) = lcs' xs' ys
    lcs' _ _ = (0, [])

In order to make choosing the longer of the two LCSs easier when x /= y, lcs' computes not only an LCS of its arguments but a pair consisting of the length of the LCS and the LCS itself. Thus, lcs simply returns the second component of the pair returned by lcs'.

This implementation seems to work as expected:

GHCi

>>> :l LCS.hs
[1 of 1] Compiling Main             ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"

Now try this:

GHCi

>>> lcs ['A'..'Z'] ['a'..'z']
"^CInterrupted.

This didn't work all that well. I had to press Ctrl+C to abort the computation. A little bit of analysis shows that the running time of the algorithm is proportional to $2^{52}$ in this case because both input strings have length $26$ and have not a single character in common. The algorithm we have implemented takes exponential time for strings that are very different. We may have waited for an answer for a looooong time.

Dynamic programming gets us out of this pickle. There are really only $(m + 1) \times (n + 1)$ pairs of substrings of xs and ys we need to consider: all pairs of suffixes of xs and ys. Indeed. To find an LCS of x:xs and y:ys, we are interested in the LCSs of xs and ys, x:xs and ys, and xs and y:ys, but all four sequences x:xs, y:ys, xs, and ys are suffixes of x:xs and y:ys. Similarly, when considering two subsequences xs' and ys' that are suffixes of x:xs and y:ys, then to compute an LCS of xs' and ys', we need to find LCSs of pairs of suffixes of xs' and ys', but since xs' and ys' are suffixes of x:xs and y:ys, suffixes of xs' and ys' are also suffixes of x:xs and y:ys. Thus, we can construct an LCS of x:xs and y:ys if we have a table that stores the LCSs of all pairs of suffixes of x:xs and y:ys. As we will see, we can construct each entry in this table in constant time. Since there are only $(m + 1) \times (n + 1)$ table entries, we obtain a solution that runs in $O(mn)$ time. This faster solution determines that lcs ['A'..'Z'] ['a'..'z'] = "" instantaneously.

First let's figure out how we implement this solution in Python, because this allows you to focus on the algorithm before worrying about how to implement it functionally in Haskell.

def lcs(xs, ys):
    # Compute the lengths of xs and ys
    m = len(xs)
    n = len(ys)

    # Build a 2-d table storing the lengths of the LCSs of all pairs of suffixes of xs and ys.
    # Initially, this table is uninitialized.
    lcslen = [[None] * (n + 1)] * (m + 1)

    # When one of the two suffixes considered is empty, the LCS has length 0.
    for j in range(0, n+1):
        lcslen[m][j] = 0
    for i in range(0, m+1):
        lcslen[i][n] = 0

    # Now consider all pairs of non-empty suffixes by increasing length.
    for i in range(m-1, -1, -1):
        for j in range(n-1, -1, -1):
            if xs[i] == ys[j]:
                lcslen[i][j] = lcslen[i+1][j+1] + 1
            else
                lcslen[i][j] = max(lcslen[i][j+1], lcslen[i+1][j])

    # Build the LCS
    lcs = []
    i = 0
    j = 0
    while i < m and j < n:
        if xs[i] == ys[j]:
            lcs.append(xs[i])
            i += 1
            j += 1
        elif lcslen[i][j+1] > lcslen[i+1][j]:
            j += 1
        else:
            i += 1
    return ''.join(lcs)

This algorithm builds a table lcslen that records only the lengths of the LCSs. Initially, each entry in this table is None. The entry at position $(i,j)$ will in the end record the length of an LCS of (xs[i:], ys[j:]). For i = m or j = n, xs[i:] or xs[j:] is empty, so the LCS length is 0. That's what the two for-loops after the line creating lcslen record. The main body of the function consists of two nested loops that iterate over all non-empty suffixes of xs and ys, by decreasing values of i and j. If xs[i] == ys[j], then remember that an LCS of (xs[i], ys[j]) starts with xs[i] followed by an LCS of (xs[i+1:], ys[j+1:]). Thus, lcslen[i][j] = lcslen[i+1][j+1] + 1 in this case. If xs[i] /= ys[j], then the LCS of of (xs[i:], ys[j:]) is an LCS of (xs[i+1:], ys[j:]) or (xs[i:], ys[j+1:]), whichever is longer. Thus, lcslen[i][j] = max(lcslen[i][j+1], lcslen[i+1][j]).

Finally, the while-loop at the end of the function traverses the two strings and collects the LCS. First, we collect the characters of the LCS in an array lcs. At the end, we convert this array into a string using ''.join(lcs). The while-loop starts with i = j = 0 to construct an LCS of (xs, ys) = (xs[0:], ys[0:]). Given the current values of i and j, the subproblem the loop has to solve is to construct an LCS of (xs[i:], ys[j:]). If i == m or j == n, this LCS is empty, so the loop exits. Otherwise, we check whether xs[i] == ys[j]. If so, we add xs[i] to the solution, because it is the first character in the LCS of (xs[i:], ys[j:]). We also increase i and j by one to get the loop to construct the remainder of the LCS, which is an LCS of (xs[i+1:], ys[j+1:]). If xs[i] /= ys[j], then we increment i or j depending on whether lcslen[i][j+1] > lcslen[i+1][j]. This reflects that an LCS of `(xs[i:], ys[j:]) is an LCS of (xs[i+1:], ys[j:]) or (xs[i:], ys[j+1:]), whichever is longer.

Now let's do all of this in Haskell. We'll build up our function one piece at a time. Let's put the skeleton in place:

LCS.hs

import Data.Array

lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
  where
    m = length xs
    n = length ys

    table   = array ((0, 0), (m, n)) entries
    entries = undefined

We start by determining the lengths of the strings xs and ys and create a table equivalent to lcslen, only I didn't come up with a creative name for this table and simply called it table, because it stores more than just the lengths of the LCSs. To be more precise, the entry at index (i,j) is a pair consisting of the length of an LCS of the two substrings of xs and ys starting at positions i and j, respectively, and the LCS itself. Thus, once we are done constructing this table, the LCS we are looking for is the second component of the pair stored at position (0, 0) in the table. In particular, lcs xs ys = snd $ table ! (0, 0).

To populate table, we need to construct a list of entries to be passed to the array function, which is currently undefined. Let's fill this in.

The entries at indices (i, n) and (m, j) are all (0, "") because the substring of xs starting at position m is empty, and the substring of ys starting at position n is empty. Thus,

LCS.hs (Edit)

import Data.Array

lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
  where
    m = length xs
    n = length ys

    table   = array ((0, 0), (m, n)) entries
    entries = [((i, n), (0, "")) | i <- [0..m]]
           ++ [((m, j), (0, "")) | j <- [0..n]]
           ++ undefined

The undefined placeholder represents the remaining table entries we still need to provide.

For i < m and j < n, we compute the entry at position (i, j) using a function extend that implements the same logic as in our Python code. This function needs as input the starting positions i and j of the current two substrings of xs and ys, as well as their starting characters, x and y. Since the output of extend i j x y is to be stored at position (i,j) in our dynamic programming table table, we associate this value with the array index (i,j):

LCS.hs (Edit)

import Data.Array

lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
  where
    m = length xs
    n = length ys

    table   = array ((0, 0), (m, n)) entries
    entries = [((i, n), (0, "")) | i <- [0..m]]
           ++ [((m, j), (0, "")) | j <- [0..n]]
           ++ [((i, j), extend i j x y) | (i, x) <- zip [0..] xs, (j, y) <- zip [0..] ys]

    extend i j x y = undefined

Now let's fill in the implementation of extend. This is where all the magic of laziness happens. Let's review what extend i j x y should compute:

If x == y, then the LCS of the two substrings starting at positions i and j starts with x, followed by an LCS of the two substrings starting at positions i+1 and j+1. The latter LCS, together with its length is stored at position (i+1,j+1) in table.
If x /= y, then the LCS of the two substrings starting at positions i and j is an LCS of the two substrings starting at positions i and j+1 or of the two substrings starting at positions i+1 and j, whichever is longer. These two LCSs along with their lengths are stored at positions (i,j+1) and (i+1,j) in table.

Thus, to implement extend, we can look up the requisite values in table:

LCS.hs (Edit)

import Data.Array

lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
  where
    m = length xs
    n = length ys

    table   = array ((0, 0), (m, n)) entries
    entries = [((i, n), (0, "")) | i <- [0..m]]
           ++ [((m, j), (0, "")) | j <- [0..n]]
           ++ [((i, j), extend i j x y) | (i, x) <- zip [0..] xs, (j, y) <- zip [0..] ys]

    extend i j x y
        | x == y      = (len1 + 1, x : lcs1)
        | len2 > len3 = soln2
        | otherwise   = soln3
      where
        (len1, lcs1)    = table ! (i + 1, j + 1)
        soln2@(len2, _) = table ! (i    , j + 1)
        soln3@(len3, _) = table ! (i + 1, j    )

This implementation populates table using the entries provided in the list entries, and these list entries are computed by extend, which looks up other table entries in table. The reason why this works is laziness. Just as any other value in Haskell, the entries in table and in the entries list are thunks. We can easily define expressions that tell us how to compute these entries, and it is perfectly fine that table entries are determined by entries of the list entries, and list entries are determined by table entries. None of these expressions are ever evaluated, until we ask for the table entry at position (0, 0). The entry at this position is whatever the entry in entries was that was associated with the index pair (0, 0). This index pair was computed by looking at position (1, 1) or at positions (0, 1) and (1, 0) in our table, so we also need to evaluate those. They in turn depend on entries at positions (0, 2), (1, 1), (1, 2), (2, 0), (2, 1), and (2, 2), so we need to evaluate those. We follow the strings of dependencies between table entries until we have unravelled the whole mesh down to entries in row m or column n, which were defined without any dependencies on other table entries. All we have really done in our program is to specify the dependencies between the different table entries, and lazy evaluation takes care of evaluating all the table entries we need. This works as long as there are no cycles in the dependencies between table entries, but this is the case here because for all (i, j), the table entry at position (i, j) depends only on table entries in row i + 1 or column j + 1.

If we load this code into GHCi, it can deal efficiently with the input that made our naive implementation run forever:

GHCi

>>> :l LCS.hs
[1 of 1] Compiling Main             ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs ['A'..'Z'] ['a'..'z']
""
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"

Exercise

Here is another classical problem typically solved using dynamic programming: finding the longest monotonically increasing subsequence of a sequence. Given a sequence xs of numbers (or elements of any type with an ordering), a subsequence of xs is any sequence ys we can obtain by deleting some elements of xs. In particular, the elements of ys do not need to appear consecutive in xs. For example, [1,3,4] is a subsequence of [1,2,3,4,5]. Such a subsequence ys is monotonically increasing if each element in ys is greater than the element before it in ys. The sequence [1,3,4] is monotonically increasing. The sequence [1,4,3] is not. The goal is to find the longest monotonically increasing subsequence of xs.

Here's (a variation of) the algorithm you may learn for this problem in CSCI 3110: Let the input be a sequence $x = \langle x_1, x_2, \ldots, x_n \rangle$. We build a table $L$ of size $n$. The $i$th entry is the length of the longest monotonically increasing subsequence of $x$ that starts with $x_i$. Then clearly, the length of a longest monotonically increasing subsequence of $x$ is $\max_{1 \le i \le n} L[i]$. In fact, we can augment this table to store pairs $(\ell, y)$ such that $L[i] = (\ell, y)$ if $\ell$ is the length of a longest monotonically increasing subsequence of $x$ that starts with $x_i$ and $y$ is such a subsequence. In this case, if $L[i] = (\ell, y)$ is the entry in $L$ with the greatest length, then $y$ is a longest monotonically increasing subsequence of $x$.

To build the table $L[i]$, we use the following rules:

If all elements $x_{i+1}, \ldots, x_n$ are no greater than $x_i$, then $L[i] = (1, \langle x_i \rangle)$: Clearly, any monotonically increasing subsequence of $x$ that starts with $x_i$ can contain only $x_i$.
If there is at least one element among $x_{i+1}, \ldots, x_n$, then let $L[j] = (\ell, y)$ be the entry among $L[i+1], \ldots, L[n]$ with the greatest length. Then set $L[i] = (\ell + 1, x_i : y)$.

Implement this algorithm in Haskell:

lmis :: Ord a => [a] -> [a]

Solution

This one actually does not require an array to be implemented efficiently, because each table entry is computed from all the table entries after it, so there is no need for efficient random access to individual table entries. Here's an efficient implementation using only lists:

LMIS.hs

-- We need a few functions for manipulating functions an lists, and for
-- converting a list to a Maybe value.  It wouldn't be hard to implement
-- these functions yourself, but I chose to use them here to introduce
-- you to a few more useful standard library functions.
import Data.Function (on)
import Data.List (maximumBy)
import Data.Maybe (listToMaybe)

lmis :: Ord a => [a] -> [a]

-- The LMIS of the empty list is clearly empty.
lmis [] = []

-- For a non-empty input sequence xs, lmiss xs is the list of the LMISs
-- of all suffixes of xs, where each LMIS of a suffix ys of xs is
-- required to start with the first element of ys. More precisely, lmiss
-- contains a pair composed of the length of the LMIS and the LMIS
-- itself for each suffix of xs. Thus, picking the entry in lmiss xs
-- with the greatest length and then extracting the LMIS from this pair
-- using snd gives us the LMIS of xs.
lmis xs = snd
        $ maximumBy (compare `on` fst)
        $ lmiss xs

  where

    -- For the empty sequence, the only suffix sis the empty sequence
    -- itself. Its LMIS is [], of length 0.
    lmiss [] = [(0, [])]

    -- For a non-empty sequence y:ys, lmiss (y:ys) has the LMISs of ys
    -- as its tail. The head of lmiss (y:ys) is the sequence y : ext and
    -- its length where ext is the longest sequence in lmiss ys that is
    -- empty or whose first element is greater than y.
    lmiss (y:ys) = (extlen + 1, y : ext) : lmiss'
      where
        lmiss'        = lmiss ys
        (extlen, ext) = maximumBy (compare `on` fst)
                      $ filter (maybe True (y <) . listToMaybe . snd) lmiss'

A few explanations may be in order:

First, compare `on` fst. compare is a function supported by every type that supports comparisons. Its type is

compare :: Ord a => a -> a -> Ordering

Ordering is defined as

data Ordering = LT | EQ | GT

representing the possible outcomes of a comparison. Given two elements x and y, compare x y returns whether x is less than y (LT), equal to y (EQ) or greater than y (GT). Here, we want to compare the pairs in lmiss by their first components. That's what compare `on` fst does. on is defined as follows:

on :: (a -> a -> c) -> (b -> a) -> (b -> b -> c)
(f `on` g) x y = f (g x) (g y)

With f = compare and g = fst, we obtain (compare `on` fst) x y = compare (fst x) (fst y), that is, we compare the pairs x and y by their first components.

maximumBy f xs takes the maximum of the elements in xs, but the function used to compare elements in xs is the function f. Thus, `maximumBy (compareonfst) xs takes the maximum entry from a list of pairs where the pairs are compared by their first components.

This should clear up the definition of lmis xs. Now let's look at the definition of lmiss (y:ys) (the definition of lmiss [] hopefully being obvious). The tail of this list is lmiss' = lmiss ys. This part should also be obvious. The head is (extlen + 1, y : ext), where ext is the longest LMIS of any suffix of ys such that y : ext is monotontically increasing, and extlen is its length. That's what an LMIS of y:ys that is required to start with y looks like.

This leaves the computation of (extlen:ext). We start with lmiss', the list of LMISs of all suffixes of ys. Our goal is to extract from this list all the LMISs that are either empty or whose first elements are greater than y. That's what the filter expression produces. We'll discuss it in detail shortly. The maximumBy expression then once again extracts from this list the pair with the largest first component, that is, the pair (extlen, ext).

Now, my first impulse was to extract the list of all suffixse that can follow y using the much simpler expression filter ((y <) . head . snd) lmiss': For every pair in lmiss', extract its second component, which is the actual sequence represented by this pair. That's what snd does. Then we take the head of this sequence using head, and check whether it is greater than y using (y <). We keep any sequence that meets this condition. This works fine for all but one entry in lmiss'. Note that lmiss' also contains the entry (0, []) corresponding to the empty suffix of ys. For this entry, head . snd fails because it tries to take the head of an empty list. Thus, we need to alter the condition ((y <) . head . snd) so it doesn't crash our program when applied to (0, []). It should in fact accept this entry because the list [] surely meets the condition of being empty. Here's how we do this:

Once again, we extract the list from the pair using snd. Then we convert this pair to Maybe its head. That's what listToMaybe does. If the input list is empty, the result is Nothing. Otherwise, the result is Just the head of the list. And finally we pass this value to maybe True (y <). Given Nothing as an argument, this produces True, and given Just z as an argument, this produces the result of the comparison y < z. Thus, for an empty list, maybe True (y <) . listToMaybe returns True, and for a non-empty list it returns True if and only if y is less than the head of the list. That's what we want.

I couldn't resist giving this solution because the seasoned Haskell programmer avoids arrays unless they need them. However, this is an exercise in the section on arrays. So, for completeness, here is an implementation that uses arrays:

LMISArray.hs

import Data.Array (Array, array, elems, (!))
import Data.Function (on)
import Data.List (maximumBy)

lmis :: Ord a => [a] -> [a]
lmis [] = []
lmis xs = snd
        $ maximumBy (compare `on` fst)
        $ elems lmiss
  where
    n     = length xs
    lmiss = array (0, n) els
    els = (n, (0, [])) : zipWith el [0..] xs

    el i x = (i, (extlen + 1, x : ext))
      where
        (extlen, ext)
            = maximumBy (compare `on` fst)
              [ (l, seq)
              | (l, seq) <- [lmiss!j | j <- [i+1..n]]
              , null seq || head seq > x
              ]

I think we can agree that the list-based implementation is more elegant.