Infinite Data Structures
Lazy evaluation in Haskell allows us to define infinite data structures. We will
discuss list comprehensions in detail in the next chapter. For the example in
this section, it suffices to know that [1..]
is the list of all positive
integers. If you try to display this list in GHCi, you better get ready to press
Ctrl+C quickly because GHCi dutifully tries to print all integers in this
list, a process that never ends.
>>> [1..]
[Output omitted, probably a few screens full]
On the other hand, if we assign this list to a variable, this is remarkably fast:
>>> ints = [1..]
That's because all we did here was to assign the thunk representing the infinite
list to the variable ints
. Nothing has forced us yet to evaluate any part of
this list. However, trying to inspect what ints
looks like produces the same
screens full of integers as before.
Why are infinite data structures useful? Let's look at an example.
Poor Man's Grep
You may be familiar with grep
, a UNIX tool that allows us to look for lines in
an input file that match a given regular expression. We won't implement regular
expression search here, but we can implement "poor man's grep
" here, which
looks for the occurrence of a given string as a substring of each line. To
start—this part does not need infinite lists yet—we need to break our input into
lines, and we need a function that tests whether our search string is a
substring of a given line. The functions we're looking for are lines
and
isSubsequenceOf
.
>>> lines "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]
So lines text
returns a list of strings, each a line of text
with the
trailing newline character stripped off. isSubsequenceOf
is not called
isSubstringOf
because it does in fact work for arbitrary lists, as long as the
elements can be tested for equality:
>>> import Data.List
>>> :t isSubsequenceOf
isSubsequenceOf :: Eq a => [a] -> [a] -> Bool
>>> [3,4] `isSubsequenceOf` [1,5,3,4,2]
True
>>> [3,4] `isSubsequenceOf` [1,5,3,2,2]
False
>>> "ell" `isSubsequenceOf` "Hello"
True
Our poor man's grep
function returns the list of lines in the input that
contain the given search string:
findMatchingLines :: String -> String -> [String]
findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)
>>> findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","Haskell is fun."]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!"]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]
grep
also provides an option -n
, which does not only show the matching line
but also its line number. Let's add this functionality to findMatchingLines
.
We want the return type to be [(Int, String)]
containing for each matching
line the number of the line and the line itself. Here's a first attempt:
findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
where
ls = lines text
len = length ls
This uses another list comprehension: [x..y]
is the list of all integers
between x
and y
. Actually, list comprehensions are more powerful: they work
also for characters and other enumerable types, but here, we're working with
integers.
What the function does is to break text
into its lines. The list of lines is
stored in ls
. We calculate the length of ls
, the number of lines, and store
the result in len
. Next we zip
the list of integers between 1 and len
together with ls
. Here's the zip
function:
zip :: [a] -> [b] -> [(a,b)]
zip (x:xs) (y:ys) = (x,y) : zip xs ys
zip [] ys = ys
zip xs [] = xs
It takes two lists as arguments and produces a list of pairs where the first element of the first list is paired with the first element of the second list, the second element of the first list is paired with the second element of the second list, and so on until one of the two lists runs out of elements:
>>> zip [1,2,3] ["Hello","world","some","more","text"]
[(1,"Hello"),(2,"world"),(3,"some")]
Note that "more" and "text" were discarded because there were only three
elements in the list [1,2,3]
.
So, what zip [1..len] ls
does is to pair every line of text
, every element
of ls
, with its line number. We then filter this list. The output should
contain every line number-line pair where the line contains pat
as a
substring. To implement this, we use a helper function for pairs:
snd :: (a, b) -> b
snd (_, y) = y
It simply returns the second element of the pair. Thus,
isSubsequenceOf pat . snd
applies the test isSubsequenceOf pat
to the second
component of the pair, that is, it tests whether the line contains pat
as a
substring.
We can try this out. It behaves as expected:
>>> :{
| findMatchingLines :: String -> String -> [(Int, String)]
| findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
| where
| ls = lines text
| len = length ls
| :}
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!")]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(2,"This is a silly text file."),(3,"Haskell is fun.")]
Our implementation is not very good though. It's inefficient, and both
efficiency and elegance can be improved using infinite lists and lazy
evaluation. The inefficiency stems from the following behaviour. To generate the
list [1..len]
, we need to know the value of len
. Calculating len
requires
that length
traverses the entire list ls
to count how many entries there are.
Next, we walk down the list ls
a second time, in order to zip it together with
[1..len]
. So the list ls
is traversed twice.
Here's a much better implementation:
findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..] $ lines text)
Clearly, the double traversal of ls
is gone. ls
is used only once, as an
argument to zip
. Now we're zipping ls
together with the infinite list
[1..]
, so there is no need to count the number of entries in ls
anymore.
Since zip
discards all elements of the longer list beyond the length of the
shorter list, zip
needs only a finite portion of the infinite list [1..]
as
long as ls
is finite. With lazy evaluation, this means that if ls
has length
\(n\), then only the first \(n\) nodes of the list [1..]
are evaluated to WHNF.
The remainder of [1..]
remains unevaluated. We can observe this behaviour by
using :sprint
again:
>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> :sprint ints
ints = 1 : 2 : 3 : 4 : _
:sprint
clearly shows that only the first four list nodes of ints
were
evaluated. You may wonder why four, not three, nodes were evaluated. That's a
result of the implementation of zip
. Let's see how it evaluates the expression
zip ints $ lines text
in this example. We have ints = [1..]
and lines text
= ["Hello, world!", "This is a silly text file.", "Haskell is fun."]
. So we call
zip [1..] ["Hello, world!", "This is a silly text file.", "Haskell is fun."]
This constructs the pair (1, "Hello, world!")
and recursively calls
zip [2..] ["This is a silly text file.", "Haskell is fun."]
to zip the remaining list elements together. This constructs the pair
(2, "This is a silly text file.")
and recursively calls
zip [3..] ["Haskell is fun."]
This constructs the pair (3, "Haskell is fun.")
and recursively calls
zip [4..] []
Now comes the explanation why the list [4..]
is not left unevaluated but is
evaluated to WHNF, 4 : _
. To check whether to produce an empty or non-empty
list, zip
needs to check whether both input lists are non-empty, and it does
so for both its arguments in order. Thus, it evaluates its first argument
[4..]
to WHNF. Since this list is non-empty, it needs to inspect the second
argument. The second input list is empty, so the result is []
—we are done
zipping the two lists together. By the time zip
comes to this conclusion,
however, it has already evaluated [4..]
to WHNF.
Let me be a bit more precise here. Calling findMatchingLines pat text
does not
itself lead to any splitting of text
into lines to produce ls
or to the
evaluation of any list nodes of [1..]
. Again, this is just one big expression,
which may never be evaluated. The splitting of text
into lines and the
evaluation of list nodes of [1..]
happens only when we need to know the
entries of the list produced by filter
, and even then we may not evaluate all
of ls
or [1..]
. For example, if we are only interested in the first line
that matches the search pattern, then lines
will produce all lines up to the
first line that contains pat
and not produce any of the remaining entries in
ls
, and zip
will also evaluate only as many list nodes of [1..]
as
necessary to pair with the lines produced by lines
up to this point. All of
this happens thanks to lazy evaluation. Again, :sprint
confirms that this is
exactly what happens:
>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> head $ findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
(2,"This is a silly text file.")
>>> :sprint ints
ints = 1 : 2 : _
Here, we are interested only in the head
of
findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
,
so this expression needs to be evaluated just enough to find the first element
of the list. To do so, filter
pulls elements from zip ints $ lines text
until it sees the first element that matches the predicate
isSubsequenceOf pat . snd
. The first element in zip ints $ lines text
is
(1, "Hello, world!")
, which does not match the predicate because
pat = "ill"
. Thus, we need the next element,
(2, "This is a silly text file.")
. This matches the pattern, so we are done.
Up to this point, we have evaluated the first two elements in the list produced
by zip ints $ lines txt
. Correspondingly, we have evaluated the first two
elements of ints
. The remainder of ints
is unevaluated. That's exactly what
:sprint
tells us.
Other Examples
There are many other examples where infinite data structures are useful. Here are just two, without providing details of their implementation:
-
You are hopefully familiar with breadth-first search (BFS) and depth-first search (DFS) as strategies to explore a graph. In imperative languages, both are easy to implement in linear time. I am not aware of any purely functional linear-time implementation. Using a binary search tree to maintain the set of vertices already visited, we can achieve a running time of \(O(n \lg n)\). To achieve linear time, we need to resort to using the
ST
monad. As we will discuss in the chapter on monads, monads are a functional way of looking at side effects. Most monads we work with are merely abstractions that make it more elegant to express certain computations, such as ones that maintain some state, ones that involve exception handling or ones that model non-deterministic computations, but these computations could be expressed without using monads. The two exceptions areIO
andST
, which capture true side effects that cannot be expressed using only pure functions. If you embrace functional programming, then one reason is usually that purely functional code is easier to test and easier to even prove to be correct than imperative code, much easier. Thus, it is desirable to keep the amount of code that uses theIO
monad orST
monad to a minimum. Well-designed Haskell programs have a fairly small amount of "driver code" that uses theIO
monad, and the rest is purely functional code.For BFS and DFS, we could express the whole procedure using the
ST
monad, but there's a better way. The only part of BFS and DFS that needs theST
monad is maintaining the set of vertices that have been visited already, so we don't visit them again. Let's assume for now that we don't care whether we have visited a vertex before; we just visit it again. In that case, exploring our graph from a vertex leads us to all its neighbours, then to all their neighbours, and so on. In general, unless we have a directed acyclic graph, the resulting tree structure will be infinite because the children of every node in the tree are all its neighbours, whether we have seen them before or not. This infinite tree is easy to define purely functionally, and lazy evaluation means that it never gets built fully if we never inspect more than a finite portion of it. To obtain a BFS or DFS tree, we need to prune this tree. For DFS, we want to keep the leftmost occurrence of each vertex in the tree. All other occurrences are discarded, including the subtrees below them. For BFS, we want to keep the occurrence of each vertex that is closest to the root. Again, all other occurrences are discarded. These pruning procedures are fairly easy to implement using theST
monad. The important part is that by starting with an infinite tree that can be built purely functionally, we ensure that we use theST
monad only in the pruning procedure. Our code becomes easier to test because the actual graph exploration part is purely functional, and we can easily test whether the children of each node in the constructed tree are all its neighbours. Only testing that we prune the tree correctly requires us to deal with the complications of testing imperative code. -
Infinite sequences are often the right objects to think about. Mathematically, it is much more meaningful to construct the sequence of all Fibonacci numbers or the sequence of all prime numbers instead of constructing only the first 100 or 1,000 elements in such a sequence. Having this sequence in hand, we can decide afterwards how many elements in this sequence we want to inspect. Our mini project in the next section shows how to generate the sequence of all Fibonacci numbers efficiently. We will see how to generate the sequence of all prime numbers once we have discussed the
unfoldr
function in the next chapter.