Skip to content

Lazy I/O

We have already had a brief look at lazy I/O, but it deserves a separate section. Let's recap what we have learned so far:

  • I/O functions in Haskell are lazy, just as all other functions. They produce their results only when we inspect these results.

  • This helps with reading file contents incrementally, avoiding manual reading of a file chunk by chunk, which is necessary in other languages to avoid using lots of memory to load the whole file into memory.

  • Lazy I/O interacts poorly with hClose because we cannot interact with a closed file handle anymore, not even to lazily read file contents that were logically retrieved before we closed the file.

  • hGetContents file puts file into a "semi-closed" state. That's a state where the file isn't actually closed yet: lazily retrieving more data from the file works. As far as all other file operations are concerned, the file is considered closed.

Lazy I/O is probably the wartiest part of Haskell, one that has attracted lots of discussion. The current consensus seems to be that lazy I/O, just as laziness in general, enables a lot of useful idioms that require much more careful orchestration in languages without lazy evaluation or lazy I/O. Thus, there is little appetite for changing the lazy behaviour of Haskell's I/O function, and newcomers to the language simply have to learn how to manage files correctly to not get bitten by the quirks of lazy I/O.

Most of the time, everything will work just fine, as long as you train yourself to never close files yourself and instead trust the garbage collector to close files for you.

If you only want to learn Haskell enough to make it through this course, then you can stop reading here and move on to the next section. If you want to learn more about how to tame lazy I/O, then keep reading. The remainder of this section is not so much about lazy I/O itself but how to make sure I/O isn't lazy when that's what you really need.

Let's start with the three sledgehammers. getContents, hGetContents, and readFile have three close relatives, called getContents', hGetContents', and readFile'. It is common practice in the standard library to add an apostrophe to the end of a function name if this function does the same as the version without apostrophe, but it evaluates its result eagerly. We saw this before when discussing the difference between foldl and foldl'. Here, just as hGetContents file reads the contents of file, so does hGetContents' file, but the latter does this eagerly, that is, it forces the entire file to be read into memory before hGetContents' returns. In particular, it is perfectly safe to close file immediately afterwards and then work with the file contents returned by hGetContents', something that didn't work when using hGetContents:

GHCi
>>> :{
  | do
  |     h <- openFile "mantra.txt" ReadMode
  |     txt <- hGetContents' h
  |     hClose h
  |     putStr txt
  | :}
I will learn to program in Haskell!
I will learn to program in Haskell!
I will learn to program in Haskell!

The downside is that we do use as much memory as necessary to store the entire file contents now, exactly the problem that lazy I/O helped us avoid.

getContents' does the same but for stdin as the input file: it eagerly reads all of the data available on stdin, whereas getContents would read this data lazily. Similarly, readFile' reads an entire file, just as readFile would do, but it loads the content of the whole file into memory before returning.

Often, it isn't necessary to read the entire text into memory before closing the file. Let's say we want to read a file full of numbers, add them together, and then move on to the next phase in our program, which we assume requires the file to be closed. Let's build up the solution incrementally. We'll start with putting some numbers to be summed into a file:

GHCi
>>> import Control.Monad
>>> import System.Random
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)

This randomly generates 10,000 integers between 1 and 1,000 and writes them to a file numbers.txt, one number per line. Since these numbers are generated randomly, the sum values you will see when you try this are likely different from the ones I obtained in my runs.

Here's our first attempt at summing the numbers in the file numbers.txt:

SumFile.hs
import System.IO

main :: IO ()
main = do
    xs <- map read . lines <$> readFile "numbers.txt"
    let total = sum xs
    -- Below this line is where we expect "numbers.txt" to be closed.
    print total
GHCi
>>> :l SumFile.hs
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4991785

This does what we expect, but it does not meet the requirement that the file numbers.txt is closed once we reach the print total statement. The problem is that readFile is lazy, as is sum. Thus, sum requests the elements of xs to produce the value of total only once we ask for the value of total by printing it. The file gets closed only once we have read all elements of xs from the file.

Let's test that numbers.txt really isn't closed by the time we reach print total. To do so, we use a little trick: We can't open a file for writing if the file is already open (but the same file can be opened for reading multiple times). Thus, we'll add a line that tries to open numbers.txt for writing and see what happens:

SumFile.hs (Edit)
import System.IO

main :: IO ()
main = do
    xs <- map read . lines <$> readFile "numbers.txt"
    let total = sum xs
    -- Below this line is where we expect "numbers.txt" to be closed.
    h' <- openFile "numbers.txt" WriteMode
    print total
GHCi
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
*** Exception: numbers.txt: openFile: resource busy (file is locked)

As expected, we get an exception when trying to open numbers.txt for writing, which proves that it is still open for reading.

You may wonder why we couldn't have used some function from System.IO to ask whether the file is still open. There does exist such a function, called hIsOpen. As the prefix h suggests, it requires a file handle as argument. readFile does not provide us with a file handle, but we can expand its definition to gain access to the file handle. readFile is defined as

readFile name = openFile name ReadMode >>= hgetContents

So instead of calling readFile, we can use openFile to open the file manually, read its contents manually using hGetContents, and finally use hIsOpen to ask whether the file is open:

SumFile.hs (Edit)
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents h
    let total = sum xs
    -- Below this line is where we expectd "numbers.txt" to be closed.
    op <- hIsOpen h
    print op
    h' <- openFile "numbers.txt" WriteMode
    print total
GHCi
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
False
*** Exception: numbers.txt: openFile: resource busy (file is locked)

Unfortunately, hIsOpen h tells us that h is already closed. That's this magical "semi-closed" state that hGetContents leaves behind. From our program's perspective, the file is already closed, but from the operating system's perspective it is clearly still open because, as we request data items from xs, these items are read on the fly from the file. The Haskell runtime system provides the lazy layer between our program and the operating system and results in the OS and our program having two different views of the world.

There doesn't seem to be a standard library function in Haskell that lets us query whether a file is open from the operating system's perspective. It would be nice to have such a function because the operating system's view clearly has an impact on what our program can do: when we try to open a file for writing, we really need to know whether this file is currently closed. The only way to find out whether it is is to try to open the file and catch any exceptions that may arise when this goes wrong.

Back to our example. We somehow need to ensure that by the time we try to open numbers.txt for writing, it is closed. We can make sure that the file is closed by explicitly closing it:

SumFile.hs (Edit)
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents h
    let total = sum xs
    hClose h
    -- Below this line is where we expectd "numbers.txt" to be closed.
    h' <- openFile "numbers.txt" WriteMode
    print total
GHCi
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
*** Exception: numbers.txt: hGetContents: illegal operation (delayed read on closed handle)

We're still getting an error, but it's a different one this time. The line

h' <- openFile "numbers.txt" WriteMode

succeeded without complaints. You can check this by inspecting the contents of numbers.txt now. It's empty, which is the result of opening the file for writing without actually writing anything into it. Where things are going wrong is when print tries to print total. To do so, it needs to know the value of total. To compute total, we need to retrieve the elements of xs, but this doesn't work because we have explicitly closed the file using hClose h. This is exactly the same problem we ran into in the previous subsection when we tried to to read the file using hGetContents, then closed it using hClose, and finally tried to print the contents we thought we had read from the file.

Let's populate numbers.txt with some numbers again, ready for our next attempt:

GHCi
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)

In our next attempt, we read the file completely before closing it. To do so, we use hGetContents', the eager version of hGetContents:

SumFile.hs (Edit)
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents' h
    let total = sum xs
    hClose h
    -- Below this line is where we expectd "numbers.txt" to be closed.
    h' <- openFile "numbers.txt" WriteMode
    print total
GHCi
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4976033

Haha, success. We get the sum and the file is closed correctly, which is confirmed by the successful execution of the line

h' <- openFile "numbers.txt" WriteMode

Once again, you can check that the file numbers.txt is now empty, as a result of this line.

In fact, the hClose h wasn't even necessary now because hGetContents' h eagerly reads the whole contents of the file and immediately closes the file once it has done so. You can check that the following version of SumFile.hs works:

SumFile.hs (Edit)
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents' h
    let total = sum xs
    -- Deleted: hClose h
    -- Below this line is where we expectd "numbers.txt" to be closed.
    h' <- openFile "numbers.txt" WriteMode
    print total
GHCi
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4971670

But now we have the problem that eagerly reading a file creates: We load the whole file into memory before processing it. The problem we're trying to solve does not require this. We are perfectly happy with reading the file lazily as long as we finish reading it before trying to open it for writing. The trick is to force total to be evaluated before we try to open numbers.txt for writing. Currently, total is evaluated only after we do this, because print total is the last line in our program.

If we swap the last two lines,

SumFile.hs (Edit)
import Control.Monad
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents h
    let total = sum xs
    -- Below this line is where we expectd "numbers.txt" to be closed.
    print total
    void $ openFile "numbers.txt" WriteMode

then everything works as expected:

GHCi
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
[1 of 1] Compiling Main             ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4999213

But that's kind of cheating. In general, we should expect that we don't just want to print total on the screen but that we want to work with this value later in our program. So we need to find a different way to force the evaluation of total.

The tool to do this is one that really should have been discussed in the chapter on lazy evaluation, but I didn't want to get into these finer details at that point yet. The standard library includes a function

seq :: a -> b -> b

It's a strange little function: As far as its return value goes, it completely ignores its first argument and simply returns whatever the second argument is. So, x `seq` y = y. But there is an important little effect happening under the hood. Before returning y, x is evaluated to WHNF. For numbers, WHNF is the number itself, that is, evaluating a numerical expression to WHNF means to fully evaluate it. That's exactly what we want here:

SumFile.hs (Edit)
import Control.Monad
import System.IO

main :: IO ()
main = do
    h <- openFile "numbers.txt" ReadMode
    xs <- map read . lines <$> hGetContents h
    let total = sum xs
    total `seq` return ()
    -- Below this line is where we expectd "numbers.txt" to be closed.
    openFile "numbers.txt" WriteMode
    print total

The expression total `seq` return () returns (), and we completely ignore this return value. What we care about is that before evaluating return (), total is evaluated to WHNF. The result is that the file is read completely, and then closed, in time for us to open it again for writing:

GHCi
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
Ok, one module loaded.
>>> main
5025883

We have achieved what we set out to do: During the computation of total, the file is read lazily, because we retrieve its contents using hGetContents, and sum xs retrieves the elements of xs one element at a time to add each element to the current sum. We ensure that this process is finished before opening numbers.txt for writing by forcing the evaluation of total using seq.