Infinite Data Structures

Lazy evaluation in Haskell allows us to define infinite data structures. We will discuss list comprehensions in detail in the next chapter. For the example in this section, it suffices to know that [1..] is the list of all positive integers. If you try to display this list in GHCi, you better get ready to press Ctrl+C quickly because GHCi dutifully tries to print all integers in this list, a process that never ends.

GHCi

>>> [1..]
[Output omitted, probably a few screens full]

On the other hand, if we assign this list to a variable, this is remarkably fast:

GHCi

>>> ints = [1..]

That's because all we did here was to assign the thunk representing the infinite list to the variable ints. Nothing has forced us yet to evaluate any part of this list. However, trying to inspect what ints looks like produces the same screens full of integers as before.

Why are infinite data structures useful? Let's look at an example.

Poor Man's Grep

You may be familiar with grep, a UNIX tool that allows us to look for lines in an input file that match a given regular expression. We won't implement regular expression search here, but we can implement "poor man's grep" here, which looks for the occurrence of a given string as a substring of each line. To start—this part does not need infinite lists yet—we need to break our input into lines, and we need a function that tests whether our search string is a substring of a given line. The functions we're looking for are lines and isSubsequenceOf.

GHCi

>>> lines "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]

So lines text returns a list of strings, each a line of text with the trailing newline character stripped off. isSubsequenceOf is not called isSubstringOf because it does in fact work for arbitrary lists, as long as the elements can be tested for equality:

GHCi

>>> import Data.List
>>> :t isSubsequenceOf
isSubsequenceOf :: Eq a => [a] -> [a] -> Bool
>>> [3,4] `isSubsequenceOf` [1,5,3,4,2]
True
>>> [3,4] `isSubsequenceOf` [1,5,3,2,2]
False
>>> "ell" `isSubsequenceOf` "Hello"
True

Our poor man's grep function returns the list of lines in the input that contain the given search string:

findMatchingLines :: String -> String -> [String]
findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)

GHCi

>>> findMatchingLines pat text = filter (isSubsequenceOf pat) (lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","Haskell is fun."]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!"]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
["Hello, world!","This is a silly text file.","Haskell is fun."]

grep also provides an option -n, which does not only show the matching line but also its line number. Let's add this functionality to findMatchingLines. We want the return type to be [(Int, String)] containing for each matching line the number of the line and the line itself. Here's a first attempt:

findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
  where
    ls  = lines text
    len = length ls

This uses another list comprehension: [x..y] is the list of all integers between x and y. Actually, list comprehensions are more powerful: they work also for characters and other enumerable types, but here, we're working with integers.

What the function does is to break text into its lines. The list of lines is stored in ls. We calculate the length of ls, the number of lines, and store the result in len. Next we zip the list of integers between 1 and len together with ls. Here's the zip function:

zip :: [a] -> [b] -> [(a,b)]
zip (x:xs) (y:ys) = (x,y) : zip xs ys
zip []     ys     = ys
zip xs     []     = xs

It takes two lists as arguments and produces a list of pairs where the first element of the first list is paired with the first element of the second list, the second element of the first list is paired with the second element of the second list, and so on until one of the two lists runs out of elements:

GHCi

>>> zip [1,2,3] ["Hello","world","some","more","text"]
[(1,"Hello"),(2,"world"),(3,"some")]

Note that "more" and "text" were discarded because there were only three elements in the list [1,2,3].

So, what zip [1..len] ls does is to pair every line of text, every element of ls, with its line number. We then filter this list. The output should contain every line number-line pair where the line contains pat as a substring. To implement this, we use a helper function for pairs:

snd :: (a, b) -> b
snd (_, y) = y

It simply returns the second element of the pair. Thus, isSubsequenceOf pat . snd applies the test isSubsequenceOf pat to the second component of the pair, that is, it tests whether the line contains pat as a substring.

We can try this out. It behaves as expected:

GHCi

>>> :{
  | findMatchingLines :: String -> String -> [(Int, String)]
  | findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..len] ls)
  |   where
  |     ls  = lines text
  |     len = length ls
  | :}
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> findMatchingLines "ello" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!")]
>>> findMatchingLines "ll" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(2,"This is a silly text file."),(3,"Haskell is fun.")]

Our implementation is not very good though. It's inefficient, and both efficiency and elegance can be improved using infinite lists and lazy evaluation. The inefficiency stems from the following behaviour. To generate the list [1..len], we need to know the value of len. Calculating len requires that length traverses the entire list ls to count how many entries there are. Next, we walk down the list ls a second time, in order to zip it together with [1..len]. So the list ls is traversed twice.

Here's a much better implementation:

findMatchingLines :: String -> String -> [(Int, String)]
findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip [1..] $ lines text)

Clearly, the double traversal of ls is gone. ls is used only once, as an argument to zip. Now we're zipping ls together with the infinite list [1..], so there is no need to count the number of entries in ls anymore. Since zip discards all elements of the longer list beyond the length of the shorter list, zip needs only a finite portion of the infinite list [1..] as long as ls is finite. With lazy evaluation, this means that if ls has length $n$, then only the first $n$ nodes of the list [1..] are evaluated to WHNF. The remainder of [1..] remains unevaluated. We can observe this behaviour by using :sprint again:

GHCi

>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> findMatchingLines "ell" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
[(1,"Hello, world!"),(3,"Haskell is fun.")]
>>> :sprint ints
ints = 1 : 2 : 3 : 4 : _

:sprint clearly shows that only the first four list nodes of ints were evaluated. You may wonder why four, not three, nodes were evaluated. That's a result of the implementation of zip. Let's see how it evaluates the expression zip ints $ lines text in this example. We have ints = [1..] and lines text = ["Hello, world!", "This is a silly text file.", "Haskell is fun."]. So we call

zip [1..] ["Hello, world!", "This is a silly text file.", "Haskell is fun."]

This constructs the pair (1, "Hello, world!") and recursively calls

zip [2..] ["This is a silly text file.", "Haskell is fun."]

to zip the remaining list elements together. This constructs the pair (2, "This is a silly text file.") and recursively calls

zip [3..] ["Haskell is fun."]

This constructs the pair (3, "Haskell is fun.") and recursively calls

zip [4..] []

Now comes the explanation why the list [4..] is not left unevaluated but is evaluated to WHNF, 4 : _. To check whether to produce an empty or non-empty list, zip needs to check whether both input lists are non-empty, and it does so for both its arguments in order. Thus, it evaluates its first argument [4..] to WHNF. Since this list is non-empty, it needs to inspect the second argument. The second input list is empty, so the result is []—we are done zipping the two lists together. By the time zip comes to this conclusion, however, it has already evaluated [4..] to WHNF.

Let me be a bit more precise here. Calling findMatchingLines pat text does not itself lead to any splitting of text into lines to produce ls or to the evaluation of any list nodes of [1..]. Again, this is just one big expression, which may never be evaluated. The splitting of text into lines and the evaluation of list nodes of [1..] happens only when we need to know the entries of the list produced by filter, and even then we may not evaluate all of ls or [1..]. For example, if we are only interested in the first line that matches the search pattern, then lines will produce all lines up to the first line that contains pat and not produce any of the remaining entries in ls, and zip will also evaluate only as many list nodes of [1..] as necessary to pair with the lines produced by lines up to this point. All of this happens thanks to lazy evaluation. Again, :sprint confirms that this is exactly what happens:

GHCi

>>> ints = [1..] :: [Int]
>>> findMatchingLines pat text = filter (isSubsequenceOf pat . snd) (zip ints $ lines text)
>>> head $ findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n"
(2,"This is a silly text file.")
>>> :sprint ints
ints = 1 : 2 : _

Here, we are interested only in the head of findMatchingLines "ill" "Hello, world!\nThis is a silly text file.\nHaskell is fun.\n", so this expression needs to be evaluated just enough to find the first element of the list. To do so, filter pulls elements from zip ints $ lines text until it sees the first element that matches the predicate isSubsequenceOf pat . snd. The first element in zip ints $ lines text is (1, "Hello, world!"), which does not match the predicate because pat = "ill". Thus, we need the next element, (2, "This is a silly text file."). This matches the pattern, so we are done. Up to this point, we have evaluated the first two elements in the list produced by zip ints $ lines txt. Correspondingly, we have evaluated the first two elements of ints. The remainder of ints is unevaluated. That's exactly what :sprint tells us.

Other Examples

There are many other examples where infinite data structures are useful. Here are just two, without providing details of their implementation:

You are hopefully familiar with breadth-first search (BFS) and depth-first search (DFS) as strategies to explore a graph. In imperative languages, both are easy to implement in linear time. I am not aware of any purely functional linear-time implementation. Using a binary search tree to maintain the set of vertices already visited, we can achieve a running time of $O(n \lg n)$. To achieve linear time, we need to resort to using the ST monad. As we will discuss in the chapter on monads, monads are a functional way of looking at side effects. Most monads we work with are merely abstractions that make it more elegant to express certain computations, such as ones that maintain some state, ones that involve exception handling or ones that model non-deterministic computations, but these computations could be expressed without using monads. The two exceptions are IO and ST, which capture true side effects that cannot be expressed using only pure functions. If you embrace functional programming, then one reason is usually that purely functional code is easier to test and easier to even prove to be correct than imperative code, much easier. Thus, it is desirable to keep the amount of code that uses the IO monad or ST monad to a minimum. Well-designed Haskell programs have a fairly small amount of "driver code" that uses the IO monad, and the rest is purely functional code.

For BFS and DFS, we could express the whole procedure using the ST monad, but there's a better way. The only part of BFS and DFS that needs the ST monad is maintaining the set of vertices that have been visited already, so we don't visit them again. Let's assume for now that we don't care whether we have visited a vertex before; we just visit it again. In that case, exploring our graph from a vertex leads us to all its neighbours, then to all their neighbours, and so on. In general, unless we have a directed acyclic graph, the resulting tree structure will be infinite because the children of every node in the tree are all its neighbours, whether we have seen them before or not. This infinite tree is easy to define purely functionally, and lazy evaluation means that it never gets built fully if we never inspect more than a finite portion of it. To obtain a BFS or DFS tree, we need to prune this tree. For DFS, we want to keep the leftmost occurrence of each vertex in the tree. All other occurrences are discarded, including the subtrees below them. For BFS, we want to keep the occurrence of each vertex that is closest to the root. Again, all other occurrences are discarded. These pruning procedures are fairly easy to implement using the ST monad. The important part is that by starting with an infinite tree that can be built purely functionally, we ensure that we use the ST monad only in the pruning procedure. Our code becomes easier to test because the actual graph exploration part is purely functional, and we can easily test whether the children of each node in the constructed tree are all its neighbours. Only testing that we prune the tree correctly requires us to deal with the complications of testing imperative code.
Infinite sequences are often the right objects to think about. Mathematically, it is much more meaningful to construct the sequence of all Fibonacci numbers or the sequence of all prime numbers instead of constructing only the first 100 or 1,000 elements in such a sequence. Having this sequence in hand, we can decide afterwards how many elements in this sequence we want to inspect. Our mini project in the next section shows how to generate the sequence of all Fibonacci numbers efficiently. We will see how to generate the sequence of all prime numbers once we have discussed the unfoldr function in the next chapter.