Infinite Data Structures
Lazy evaluation in Haskell allows us to define infinite data structures. We will
discuss list comprehensions in detail in the next chapter. For the example in
this section, it suffices to know that [1..] is the list of all positive
integers. If you try to display this list in GHCi, you better get ready to press
Ctrl+C quickly because GHCi dutifully tries to print all integers in this
list, a process that never ends.
>>> [1..]
[Output omitted, probably a few screens full]
On the other hand, if we assign this list to a variable, this is remarkably fast:
>>> ints = [1..]
That's because all we did here was to assign the thunk representing the infinite
list to the variable ints. Nothing has forced us yet to evaluate any part of
this list. However, trying to inspect what ints looks like produces the same
screens full of integers as before.
Why are infinite data structures useful? Let's look at an example.
Poor Man's Grep
You may be familiar with grep, a UNIX tool that allows us to look for lines in
an input file that match a given regular expression. We won't implement regular
expression search here, but we can implement "poor man's grep" here, which
looks for the occurrence of a given string as a substring of each line. To
start—this part does not need infinite lists yet—we need to break our input into
lines, and we need a function that tests whether our search string is a
substring of a given line. The functions we're looking for are lines and
isSubsequenceOf.
>>> lines "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]
So lines text returns a list of strings, each a line of text with the
trailing newline character stripped off. isSubsequenceOf is not called
isSubstringOf because it does in fact work for arbitrary lists, as long as the
elements can be tested for equality:
>>> import Data.List
>>> :t isSubsequenceOf
isSubsequenceOf :: Eq a => [a] -> [a] -> Bool
>>> [3,4] `isSubsequenceOf` [1,5,3,4,2]
True
>>> [3,4] `isSubsequenceOf` [1,5,3,2,2]
False
>>> "ell" `isSubsequenceOf` "Hello"
True
Our poor man's grep function returns the list of lines in the input that
contain the given search string:
findMatchingLines :: String -> String -> [String]
findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)
>>> findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","Haskell is fun."]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!"]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]
grep also provides an option -n, which does not only show the matching line
but also its line number. Let's add this functionality to findMatchingLines.
We want the return type to be [(Int, String)] containing for each matching
line the number of the line and the line itself. Here's a first attempt:
findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
where
ls = lines text
len = length ls
This uses another list comprehension: [x..y] is the list of all integers
between x and y. Actually, list comprehensions are more powerful: they work
also for characters and other enumerable types, but here, we're working with
integers.
What the function does is to break text into its lines. The list of lines is
stored in ls. We calculate the length of ls, the number of lines, and store
the result in len. Next we zip the list of integers between 1 and len
together with ls. Here's the zip function:
zip :: [a] -> [b] -> [(a,b)]
zip (x:xs) (y:ys) = (x,y) : zip xs ys
zip [] ys = ys
zip xs [] = xs
It takes two lists as arguments and produces a list of pairs where the first element of the first list is paired with the first element of the second list, the second element of the first list is paired with the second element of the second list, and so on until one of the two lists runs out of elements:
>>> zip [1,2,3] ["Hello","world","some","more","text"]
[(1,"Hello"),(2,"world"),(3,"some")]
Note that "more" and "text" were discarded because there were only three
elements in the list [1,2,3].
So, what zip [1..len] ls does is to pair every line of text, every element
of ls, with its line number. We then filter this list. The output should
contain every line number-line pair where the line contains pat as a
substring. To implement this, we use a helper function for pairs:
snd :: (a, b) -> b
snd (_, y) = y
It simply returns the second element of the pair. Thus,
isSubsequenceOf pat . snd applies the test isSubsequenceOf pat to the second
component of the pair, that is, it tests whether the line contains pat as a
substring.
We can try this out. It behaves as expected:
>>> :{
| findMatchingLines :: String -> String -> [(Int, String)]
| findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
| where
| ls = lines text
| len = length ls
| :}
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!")]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(2,"This is a silly text file."),(3,"Haskell is fun.")]
Our implementation is not very good though. It's inefficient, and both
efficiency and elegance can be improved using infinite lists and lazy
evaluation. The inefficiency stems from the following behaviour. To generate the
list [1..len], we need to know the value of len. Calculating len requires
that length traverses the entire list ls to count how many entries there are.
Next, we walk down the list ls a second time, in order to zip it together with
[1..len]. So the list ls is traversed twice.
Here's a much better implementation:
findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..] $ lines text)
Clearly, the double traversal of ls is gone. ls is used only once, as an
argument to zip. Now we're zipping ls together with the infinite list
[1..], so there is no need to count the number of entries in ls anymore.
Since zip discards all elements of the longer list beyond the length of the
shorter list, zip needs only a finite portion of the infinite list [1..] as
long as ls is finite. With lazy evaluation, this means that if ls has length
\(n\), then only the first \(n\) nodes of the list [1..] are evaluated to WHNF.
The remainder of [1..] remains unevaluated. We can observe this behaviour by
using :sprint again:
>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> :sprint ints
ints = 1 : 2 : 3 : 4 : _
:sprint clearly shows that only the first four list nodes of ints were
evaluated. You may wonder why four, not three, nodes were evaluated. That's a
result of the implementation of zip. Let's see how it evaluates the expression
zip ints $ lines text in this example. We have ints = [1..] and lines text
= ["Hello, world!", "This is a silly text file.", "Haskell is fun."]. So we call
zip [1..] ["Hello, world!", "This is a silly text file.", "Haskell is fun."]
This constructs the pair (1, "Hello, world!") and recursively calls
zip [2..] ["This is a silly text file.", "Haskell is fun."]
to zip the remaining list elements together. This constructs the pair
(2, "This is a silly text file.") and recursively calls
zip [3..] ["Haskell is fun."]
This constructs the pair (3, "Haskell is fun.") and recursively calls
zip [4..] []
Now comes the explanation why the list [4..] is not left unevaluated but is
evaluated to WHNF, 4 : _. To check whether to produce an empty or non-empty
list, zip needs to check whether both input lists are non-empty, and it does
so for both its arguments in order. Thus, it evaluates its first argument
[4..] to WHNF. Since this list is non-empty, it needs to inspect the second
argument. The second input list is empty, so the result is []—we are done
zipping the two lists together. By the time zip comes to this conclusion,
however, it has already evaluated [4..] to WHNF.
Let me be a bit more precise here. Calling findMatchingLines pat text does not
itself lead to any splitting of text into lines to produce ls or to the
evaluation of any list nodes of [1..]. Again, this is just one big expression,
which may never be evaluated. The splitting of text into lines and the
evaluation of list nodes of [1..] happens only when we need to know the
entries of the list produced by filter, and even then we may not evaluate all
of ls or [1..]. For example, if we are only interested in the first line
that matches the search pattern, then lines will produce all lines up to the
first line that contains pat and not produce any of the remaining entries in
ls, and zip will also evaluate only as many list nodes of [1..] as
necessary to pair with the lines produced by lines up to this point. All of
this happens thanks to lazy evaluation. Again, :sprint confirms that this is
exactly what happens:
>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> head $ findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
(2,"This is a silly text file.")
>>> :sprint ints
ints = 1 : 2 : _
Here, we are interested only in the head of
findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n",
so this expression needs to be evaluated just enough to find the first element
of the list. To do so, filter pulls elements from zip ints $ lines text
until it sees the first element that matches the predicate
isSubsequenceOf pat . snd. The first element in zip ints $ lines text is
(1, "Hello, world!"), which does not match the predicate because
pat = "ill". Thus, we need the next element,
(2, "This is a silly text file."). This matches the pattern, so we are done.
Up to this point, we have evaluated the first two elements in the list produced
by zip ints $ lines txt. Correspondingly, we have evaluated the first two
elements of ints. The remainder of ints is unevaluated. That's exactly what
:sprint tells us.
Other Examples
There are many other examples where infinite data structures are useful. Here are just two, without providing details of their implementation:
-
You are hopefully familiar with breadth-first search (BFS) and depth-first search (DFS) as strategies to explore a graph. In imperative languages, both are easy to implement in linear time. I am not aware of any purely functional linear-time implementation. Using a binary search tree to maintain the set of vertices already visited, we can achieve a running time of \(O(n \lg n)\). To achieve linear time, we need to resort to using the
STmonad. As we will discuss in the chapter on monads, monads are a functional way of looking at side effects. Most monads we work with are merely abstractions that make it more elegant to express certain computations, such as ones that maintain some state, ones that involve exception handling or ones that model non-deterministic computations, but these computations could be expressed without using monads. The two exceptions areIOandST, which capture true side effects that cannot be expressed using only pure functions. If you embrace functional programming, then one reason is usually that purely functional code is easier to test and easier to even prove to be correct than imperative code, much easier. Thus, it is desirable to keep the amount of code that uses theIOmonad orSTmonad to a minimum. Well-designed Haskell programs have a fairly small amount of "driver code" that uses theIOmonad, and the rest is purely functional code.For BFS and DFS, we could express the whole procedure using the
STmonad, but there's a better way. The only part of BFS and DFS that needs theSTmonad is maintaining the set of vertices that have been visited already, so we don't visit them again. Let's assume for now that we don't care whether we have visited a vertex before; we just visit it again. In that case, exploring our graph from a vertex leads us to all its neighbours, then to all their neighbours, and so on. In general, unless we have a directed acyclic graph, the resulting tree structure will be infinite because the children of every node in the tree are all its neighbours, whether we have seen them before or not. This infinite tree is easy to define purely functionally, and lazy evaluation means that it never gets built fully if we never inspect more than a finite portion of it. To obtain a BFS or DFS tree, we need to prune this tree. For DFS, we want to keep the leftmost occurrence of each vertex in the tree. All other occurrences are discarded, including the subtrees below them. For BFS, we want to keep the occurrence of each vertex that is closest to the root. Again, all other occurrences are discarded. These pruning procedures are fairly easy to implement using theSTmonad. The important part is that by starting with an infinite tree that can be built purely functionally, we ensure that we use theSTmonad only in the pruning procedure. Our code becomes easier to test because the actual graph exploration part is purely functional, and we can easily test whether the children of each node in the constructed tree are all its neighbours. Only testing that we prune the tree correctly requires us to deal with the complications of testing imperative code. -
Infinite sequences are often the right objects to think about. Mathematically, it is much more meaningful to construct the sequence of all Fibonacci numbers or the sequence of all prime numbers instead of constructing only the first 100 or 1,000 elements in such a sequence. Having this sequence in hand, we can decide afterwards how many elements in this sequence we want to inspect. Our mini project in the next section shows how to generate the sequence of all Fibonacci numbers efficiently. We will see how to generate the sequence of all prime numbers once we have discussed the
unfoldrfunction in the next chapter.