Mini-Project: Longest Common Subsequence
The examples in the previous two mini-projects require no laziness. They would work just as well if Haskell used applicative order evaluation, as most other programming languages do. The final mini-project in this chapter does need lazy evaluation to work.
What sets this example apart from the previous ones is that in the previous
examples, we constructed the arrays we needed and then used their contents.
Those were two completely separate steps. Now, the list that we pass to our
array
function to construct the array depends on the array contents. This type
of circular dependency is not possible using applicative order evaluation
because we either build the array first and then construct the list from it or
vice versa. Lazy evaluation makes such circular dependencies possible, as long
as no array value depends on itself.
The problem we want to solve is finding the longest common subsequence of two strings:
>>> :l LCS.hs
[1 of 1] Compiling Main ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"
A common subsequence of two strings is a string that can be obtained from both strings by deleting an appropriate subset of their characters. Here, "Dalhus" can be obtained from "Dalhousie" by deleting the 'o', the 'i', and the 'e'. It can also be obtained from "Dalhusy" by deleting the 'y'. It is a longest common subsequence (LCS) if there is no longer string that is a common subsequence of the two strings.
I hope that you have learned or will learn in CSCI 3110 how to solve this problem in \(O(mn)\) time using dynamic programming, where \(m\) and \(n\) are the lengths of the two input strings. I won't prove the correctness of the algorithm here, but here is the intuition of how it works:
-
If one of the two input strings is empty, then clearly the LCS is the empty string.
-
If both input strings are non-empty, that is, they are of the form
x:xs
andy:ys
, then there are two cases to consider:-
If
x == y
, then the LCS ofx:xs
andy:ys
starts withx
, and this is followed by an LCS ofxs
andys
. -
If
x /= y
, then the LCS ofx:xs
andy:ys
is either an LCS ofx:xs
andys
or ofxs
andy:ys
, whichever is longer.
-
I hope it is intuitively obvious that this is a correct description of the
solution. We can immediately implement a recursive implementation of the lcs
function based on this description:
lcs :: String -> String -> String
lcs xs ys = snd $ lcs' xs ys
where
lcs' xs@(x:xs') ys@(y:ys')
| x == y = (len1 + 1, x : lcs1)
| len2 > len3 = (len2, lcs2)
| otherwise = (len3, lcs3)
where
(len1, lcs1) = lcs' xs' ys'
(len2, lcs2) = lcs' xs ys'
(len3, lcs3) = lcs' xs' ys
lcs' _ _ = (0, [])
In order to make choosing the longer of the two LCSs easier when x /= y
,
lcs'
computes not only an LCS of its arguments but a pair consisting of the
length of the LCS and the LCS itself. Thus, lcs
simply returns the second
component of the pair returned by lcs'
.
This implementation seems to work as expected:
>>> :l LCS.hs
[1 of 1] Compiling Main ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"
Now try this:
>>> lcs ['A'..'Z'] ['a'..'z']
"^CInterrupted.
This didn't work all that well. I had to press Ctrl+C to abort the computation. A little bit of analysis shows that the running time of the algorithm is proportional to \(2^{52}\) in this case because both input strings have length \(26\) and have not a single character in common. The algorithm we have implemented takes exponential time for strings that are very different. We may have waited for an answer for a looooong time.
Dynamic programming gets us out of this pickle. There are really only \((m + 1)
\times (n + 1)\) pairs of substrings of xs
and ys
we need to consider: all
pairs of suffixes of xs
and ys
. Indeed. To find an LCS of x:xs
and y:ys
,
we are interested in the LCSs of xs
and ys
, x:xs
and ys
, and xs
and
y:ys
, but all four sequences x:xs
, y:ys
, xs
, and ys
are suffixes of
x:xs
and y:ys
. Similarly, when considering two subsequences xs'
and ys'
that are suffixes of x:xs
and y:ys
, then to compute an LCS of xs'
and
ys'
, we need to find LCSs of pairs of suffixes of xs'
and ys'
, but since
xs'
and ys'
are suffixes of x:xs
and y:ys
, suffixes of xs'
and ys'
are also suffixes of x:xs
and y:ys
. Thus, we can construct an LCS of x:xs
and y:ys
if we have a table that stores the LCSs of all pairs of suffixes of
x:xs
and y:ys
. As we will see, we can construct each entry in this table in
constant time. Since there are only \((m + 1) \times (n + 1)\) table entries, we
obtain a solution that runs in \(O(mn)\) time. This faster solution determines
that lcs ['A'..'Z'] ['a'..'z'] = ""
instantaneously.
First let's figure out how we implement this solution in Python, because this allows you to focus on the algorithm before worrying about how to implement it functionally in Haskell.
def lcs(xs, ys):
# Compute the lengths of xs and ys
m = len(xs)
n = len(ys)
# Build a 2-d table storing the lengths of the LCSs of all pairs of suffixes of xs and ys.
# Initially, this table is uninitialized.
lcslen = [[None] * (n + 1)] * (m + 1)
# When one of the two suffixes considered is empty, the LCS has length 0.
for j in range(0, n+1):
lcslen[m][j] = 0
for i in range(0, m+1):
lcslen[i][n] = 0
# Now consider all pairs of non-empty suffixes by increasing length.
for i in range(m-1, -1, -1):
for j in range(n-1, -1, -1):
if xs[i] == ys[j]:
lcslen[i][j] = lcslen[i+1][j+1] + 1
else
lcslen[i][j] = max(lcslen[i][j+1], lcslen[i+1][j])
# Build the LCS
lcs = []
i = 0
j = 0
while i < m and j < n:
if xs[i] == ys[j]:
lcs.append(xs[i])
i += 1
j += 1
elif lcslen[i][j+1] > lcslen[i+1][j]:
j += 1
else:
i += 1
return ''.join(lcs)
This algorithm builds a table lcslen
that records only the lengths of the
LCSs. Initially, each entry in this table is None
. The entry at position
\((i,j)\) will in the end record the length of an LCS of (xs[i:], ys[j:])
. For
i = m
or j = n
, xs[i:]
or xs[j:]
is empty, so the LCS length is 0.
That's what the two for-loops after the line creating lcslen
record. The main
body of the function consists of two nested loops that iterate over all
non-empty suffixes of xs
and ys
, by decreasing values of i
and j
. If
xs[i] == ys[j]
, then remember that an LCS of (xs[i], ys[j])
starts with
xs[i]
followed by an LCS of (xs[i+1:], ys[j+1:])
. Thus,
lcslen[i][j] = lcslen[i+1][j+1] + 1
in this case. If xs[i] /= ys[j]
, then
the LCS of of (xs[i:], ys[j:])
is an LCS of (xs[i+1:], ys[j:])
or
(xs[i:], ys[j+1:])
, whichever is longer. Thus,
lcslen[i][j] = max(lcslen[i][j+1], lcslen[i+1][j])
.
Finally, the while-loop at the end of the function traverses the two strings and
collects the LCS. First, we collect the characters of the LCS in an array lcs
.
At the end, we convert this array into a string using ''.join(lcs)
. The
while-loop starts with i = j = 0
to construct an LCS of
(xs, ys) = (xs[0:], ys[0:])
. Given the current values of i
and j
, the
subproblem the loop has to solve is to construct an LCS of (xs[i:], ys[j:])
.
If i == m
or j == n
, this LCS is empty, so the loop exits. Otherwise, we
check whether xs[i] == ys[j]
. If so, we add xs[i]
to the solution, because
it is the first character in the LCS of (xs[i:], ys[j:])
. We also increase i
and j
by one to get the loop to construct the remainder of the LCS, which is
an LCS of (xs[i+1:], ys[j+1:])
. If xs[i] /= ys[j]
, then we increment i
or
j
depending on whether lcslen[i][j+1] > lcslen[i+1][j]
. This reflects that
an LCS of `(xs[i:], ys[j:])
is an LCS of (xs[i+1:], ys[j:])
or (xs[i:],
ys[j+1:])
, whichever is longer.
Now let's do all of this in Haskell. We'll build up our function one piece at a time. Let's put the skeleton in place:
import Data.Array
lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
where
m = length xs
n = length ys
table = array ((0, 0), (m, n)) entries
entries = undefined
We start by determining the lengths of the strings xs
and ys
and create a
table equivalent to lcslen
, only I didn't come up with a creative name for
this table and simply called it table
, because it stores more than just the
lengths of the LCSs. To be more precise, the entry at index (i,j)
is a pair
consisting of the length of an LCS of the two substrings of xs
and ys
starting at positions i
and j
, respectively, and the LCS itself. Thus, once
we are done constructing this table, the LCS we are looking for is the second
component of the pair stored at position (0, 0)
in the table. In particular,
lcs xs ys = snd $ table ! (0, 0)
.
To populate table
, we need to construct a list of entries
to be passed to
the array
function, which is currently undefined
. Let's fill this in.
The entries at indices (i, n)
and (m, j)
are all (0, "")
because the
substring of xs
starting at position m
is empty, and the substring of ys
starting at position n
is empty. Thus,
import Data.Array
lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
where
m = length xs
n = length ys
table = array ((0, 0), (m, n)) entries
entries = [((i, n), (0, "")) | i <- [0..m]]
++ [((m, j), (0, "")) | j <- [0..n]]
++ undefined
The undefined
placeholder represents the remaining table entries we still need
to provide.
For i < m
and j < n
, we compute the entry at position (i, j)
using a
function extend
that implements the same logic as in our Python code. This
function needs as input the starting positions i
and j
of the current two
substrings of xs
and ys
, as well as their starting characters, x
and y
.
Since the output of extend i j x y
is to be stored at position (i,j)
in our
dynamic programming table table
, we associate this value with the array index
(i,j)
:
import Data.Array
lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
where
m = length xs
n = length ys
table = array ((0, 0), (m, n)) entries
entries = [((i, n), (0, "")) | i <- [0..m]]
++ [((m, j), (0, "")) | j <- [0..n]]
++ [((i, j), extend i j x y) | (i, x) <- zip [0..] xs, (j, y) <- zip [0..] ys]
extend i j x y = undefined
Now let's fill in the implementation of extend
. This is where all the magic of
laziness happens. Let's review what extend i j x y
should compute:
-
If
x == y
, then the LCS of the two substrings starting at positionsi
andj
starts withx
, followed by an LCS of the two substrings starting at positionsi+1
andj+1
. The latter LCS, together with its length is stored at position(i+1,j+1)
intable
. -
If
x /= y
, then the LCS of the two substrings starting at positionsi
andj
is an LCS of the two substrings starting at positionsi
andj+1
or of the two substrings starting at positionsi+1
andj
, whichever is longer. These two LCSs along with their lengths are stored at positions(i,j+1)
and(i+1,j)
intable
.
Thus, to implement extend
, we can look up the requisite values in table
:
import Data.Array
lcs :: String -> String -> String
lcs xs ys = snd $ table ! (0, 0)
where
m = length xs
n = length ys
table = array ((0, 0), (m, n)) entries
entries = [((i, n), (0, "")) | i <- [0..m]]
++ [((m, j), (0, "")) | j <- [0..n]]
++ [((i, j), extend i j x y) | (i, x) <- zip [0..] xs, (j, y) <- zip [0..] ys]
extend i j x y
| x == y = (len1 + 1, x : lcs1)
| len2 > len3 = soln2
| otherwise = soln3
where
(len1, lcs1) = table ! (i + 1, j + 1)
soln2@(len2, _) = table ! (i , j + 1)
soln3@(len3, _) = table ! (i + 1, j )
This implementation populates table
using the entries provided in the list
entries
, and these list entries are computed by extend
, which looks up other
table entries in table
. The reason why this works is laziness. Just as any
other value in Haskell, the entries in table
and in the entries
list are
thunks. We can easily define expressions that tell us how to compute these
entries, and it is perfectly fine that table
entries are determined by entries
of the list entries
, and list entries are determined by table
entries. None
of these expressions are ever evaluated, until we ask for the table
entry at
position (0, 0)
. The entry at this position is whatever the entry in entries
was that was associated with the index pair (0, 0)
. This index pair was
computed by looking at position (1, 1)
or at positions (0, 1)
and (1, 0)
in our table
, so we also need to evaluate those. They in turn depend on
entries at positions (0, 2)
, (1, 1)
, (1, 2)
, (2, 0)
, (2, 1)
, and
(2, 2)
, so we need to evaluate those. We follow the strings of dependencies
between table entries until we have unravelled the whole mesh down to entries in
row m
or column n
, which were defined without any dependencies on other
table entries. All we have really done in our program is to specify the
dependencies between the different table entries, and lazy evaluation takes care
of evaluating all the table entries we need. This works as long as there are no
cycles in the dependencies between table entries, but this is the case here
because for all (i, j)
, the table entry at position (i, j)
depends only on
table entries in row i + 1
or column j + 1
.
If we load this code into GHCi, it can deal efficiently with the input that made our naive implementation run forever:
>>> :l LCS.hs
[1 of 1] Compiling Main ( LCS.hs, interpreted )
Ok, one module loaded.
>>> lcs ['A'..'Z'] ['a'..'z']
""
>>> lcs "Dalhousie" "Dalhusy"
"Dalhus"
Exercise
Here is another classical problem typically solved using dynamic
programming: finding the longest monotonically increasing subsequence of a
sequence. Given a sequence xs
of numbers (or elements of any type with an
ordering), a subsequence of xs
is any sequence ys
we can obtain by
deleting some elements of xs
. In particular, the elements of ys
do not
need to appear consecutive in xs
. For example, [1,3,4]
is a subsequence
of [1,2,3,4,5]
. Such a subsequence ys
is monotonically increasing if
each element in ys
is greater than the element before it in ys
. The
sequence [1,3,4]
is monotonically increasing. The sequence [1,4,3]
is
not. The goal is to find the longest monotonically increasing subsequence of
xs
.
Here's (a variation of) the algorithm you may learn for this problem in CSCI 3110: Let the input be a sequence \(x = \langle x_1, x_2, \ldots, x_n \rangle\). We build a table \(L\) of size \(n\). The \(i\)th entry is the length of the longest monotonically increasing subsequence of \(x\) that starts with \(x_i\). Then clearly, the length of a longest monotonically increasing subsequence of \(x\) is \(\max_{1 \le i \le n} L[i]\). In fact, we can augment this table to store pairs \((\ell, y)\) such that \(L[i] = (\ell, y)\) if \(\ell\) is the length of a longest monotonically increasing subsequence of \(x\) that starts with \(x_i\) and \(y\) is such a subsequence. In this case, if \(L[i] = (\ell, y)\) is the entry in \(L\) with the greatest length, then \(y\) is a longest monotonically increasing subsequence of \(x\).
To build the table \(L[i]\), we use the following rules:
-
If all elements \(x_{i+1}, \ldots, x_n\) are no greater than \(x_i\), then \(L[i] = (1, \langle x_i \rangle)\): Clearly, any monotonically increasing subsequence of \(x\) that starts with \(x_i\) can contain only \(x_i\).
-
If there is at least one element among \(x_{i+1}, \ldots, x_n\), then let \(L[j] = (\ell, y)\) be the entry among \(L[i+1], \ldots, L[n]\) with the greatest length. Then set \(L[i] = (\ell + 1, x_i : y)\).
Implement this algorithm in Haskell:
lmis :: Ord a => [a] -> [a]
Solution
This one actually does not require an array to be implemented efficiently, because each table entry is computed from all the table entries after it, so there is no need for efficient random access to individual table entries. Here's an efficient implementation using only lists:
-- We need a few functions for manipulating functions an lists, and for
-- converting a list to a Maybe value. It wouldn't be hard to implement
-- these functions yourself, but I chose to use them here to introduce
-- you to a few more useful standard library functions.
import Data.Function (on)
import Data.List (maximumBy)
import Data.Maybe (listToMaybe)
lmis :: Ord a => [a] -> [a]
-- The LMIS of the empty list is clearly empty.
lmis [] = []
-- For a non-empty input sequence xs, lmiss xs is the list of the LMISs
-- of all suffixes of xs, where each LMIS of a suffix ys of xs is
-- required to start with the first element of ys. More precisely, lmiss
-- contains a pair composed of the length of the LMIS and the LMIS
-- itself for each suffix of xs. Thus, picking the entry in lmiss xs
-- with the greatest length and then extracting the LMIS from this pair
-- using snd gives us the LMIS of xs.
lmis xs = snd
$ maximumBy (compare `on` fst)
$ lmiss xs
where
-- For the empty sequence, the only suffix sis the empty sequence
-- itself. Its LMIS is [], of length 0.
lmiss [] = [(0, [])]
-- For a non-empty sequence y:ys, lmiss (y:ys) has the LMISs of ys
-- as its tail. The head of lmiss (y:ys) is the sequence y : ext and
-- its length where ext is the longest sequence in lmiss ys that is
-- empty or whose first element is greater than y.
lmiss (y:ys) = (extlen + 1, y : ext) : lmiss'
where
lmiss' = lmiss ys
(extlen, ext) = maximumBy (compare `on` fst)
$ filter (maybe True (y <) . listToMaybe . snd) lmiss'
A few explanations may be in order:
First, compare `on` fst
. compare
is a function supported by every
type that supports comparisons. Its type is
compare :: Ord a => a -> a -> Ordering
Ordering
is defined as
data Ordering = LT | EQ | GT
representing the possible outcomes of a comparison. Given two elements
x
and y
, compare x y
returns whether x
is less than y
(LT
),
equal to y
(EQ
) or greater than y
(GT
). Here, we want to compare
the pairs in lmiss
by their first components. That's what
compare `on` fst
does. on
is defined as follows:
on :: (a -> a -> c) -> (b -> a) -> (b -> b -> c)
(f `on` g) x y = f (g x) (g y)
With f = compare
and g = fst
, we obtain
(compare `on` fst) x y = compare (fst x) (fst y)
, that is, we
compare the pairs x
and y
by their first components.
maximumBy f xs
takes the maximum of the elements in xs
, but the
function used to compare elements in xs
is the function f
. Thus,
`maximumBy (compare
onfst) xs
takes the maximum entry from a list
of pairs where the pairs are compared by their first components.
This should clear up the definition of lmis xs
. Now let's look at the
definition of lmiss (y:ys)
(the definition of lmiss []
hopefully
being obvious). The tail of this list is lmiss' = lmiss ys
. This part
should also be obvious. The head is (extlen + 1, y : ext)
, where ext
is the longest LMIS of any suffix of ys
such that y : ext
is
monotontically increasing, and extlen
is its length. That's what an
LMIS of y:ys
that is required to start with y
looks like.
This leaves the computation of (extlen:ext)
. We start with lmiss'
,
the list of LMISs of all suffixes of ys
. Our goal is to extract from
this list all the LMISs that are either empty or whose first elements
are greater than y
. That's what the filter
expression produces.
We'll discuss it in detail shortly. The maximumBy
expression then once
again extracts from this list the pair with the largest first component,
that is, the pair (extlen, ext)
.
Now, my first impulse was to extract the list of all suffixse that can
follow y
using the much simpler expression
filter ((y <) . head . snd) lmiss'
: For every pair in lmiss'
,
extract its second component, which is the actual sequence represented
by this pair. That's what snd
does. Then we take the head of this
sequence using head
, and check whether it is greater than y
using
(y <)
. We keep any sequence that meets this condition. This works fine
for all but one entry in lmiss'
. Note that lmiss'
also contains the
entry (0, [])
corresponding to the empty suffix of ys
. For this
entry, head . snd
fails because it tries to take the head of an empty
list. Thus, we need to alter the condition ((y <) . head . snd)
so it
doesn't crash our program when applied to (0, [])
. It should in fact
accept this entry because the list []
surely meets the condition of
being empty. Here's how we do this:
Once again, we extract the list from the pair using snd
. Then we
convert this pair to Maybe
its head. That's what listToMaybe
does.
If the input list is empty, the result is Nothing
. Otherwise, the
result is Just
the head of the list. And finally we pass this value to
maybe True (y <)
. Given Nothing
as an argument, this produces
True
, and given Just z
as an argument, this produces the result of
the comparison y < z
. Thus, for an empty list,
maybe True (y <) . listToMaybe
returns True
, and for a non-empty
list it returns True
if and only if y
is less than the head of the
list. That's what we want.
I couldn't resist giving this solution because the seasoned Haskell programmer avoids arrays unless they need them. However, this is an exercise in the section on arrays. So, for completeness, here is an implementation that uses arrays:
import Data.Array (Array, array, elems, (!))
import Data.Function (on)
import Data.List (maximumBy)
lmis :: Ord a => [a] -> [a]
lmis [] = []
lmis xs = snd
$ maximumBy (compare `on` fst)
$ elems lmiss
where
n = length xs
lmiss = array (0, n) els
els = (n, (0, [])) : zipWith el [0..] xs
el i x = (i, (extlen + 1, x : ext))
where
(extlen, ext)
= maximumBy (compare `on` fst)
[ (l, seq)
| (l, seq) <- [lmiss!j | j <- [i+1..n]]
, null seq || head seq > x
]
I think we can agree that the list-based implementation is more elegant.