Lazy I/O
We have already had a brief look at lazy I/O, but it deserves a separate section. Let's recap what we have learned so far:
-
I/O functions in Haskell are lazy, just as all other functions. They produce their results only when we inspect these results.
-
This helps with reading file contents incrementally, avoiding manual reading of a file chunk by chunk, which is necessary in other languages to avoid using lots of memory to load the whole file into memory.
-
Lazy I/O interacts poorly with
hClose
because we cannot interact with a closed file handle anymore, not even to lazily read file contents that were logically retrieved before we closed the file. -
hGetContents file
putsfile
into a "semi-closed" state. That's a state where the file isn't actually closed yet: lazily retrieving more data from the file works. As far as all other file operations are concerned, the file is considered closed.
Lazy I/O is probably the wartiest part of Haskell, one that has attracted lots of discussion. The current consensus seems to be that lazy I/O, just as laziness in general, enables a lot of useful idioms that require much more careful orchestration in languages without lazy evaluation or lazy I/O. Thus, there is little appetite for changing the lazy behaviour of Haskell's I/O function, and newcomers to the language simply have to learn how to manage files correctly to not get bitten by the quirks of lazy I/O.
Most of the time, everything will work just fine, as long as you train yourself to never close files yourself and instead trust the garbage collector to close files for you.
If you only want to learn Haskell enough to make it through this course, then you can stop reading here and move on to the next section. If you want to learn more about how to tame lazy I/O, then keep reading. The remainder of this section is not so much about lazy I/O itself but how to make sure I/O isn't lazy when that's what you really need.
Let's start with the three sledgehammers. getContents
, hGetContents
, and
readFile
have three close relatives, called getContents'
, hGetContents'
,
and readFile'
. It is common practice in the standard library to add an
apostrophe to the end of a function name if this function does the same as the
version without apostrophe, but it evaluates its result eagerly. We saw this
before when discussing the difference between foldl
and foldl'
. Here, just
as hGetContents file
reads the contents of file
, so does
hGetContents' file
, but the latter does this eagerly, that is, it forces the
entire file to be read into memory before hGetContents'
returns. In
particular, it is perfectly safe to close file
immediately afterwards and then
work with the file contents returned by hGetContents'
, something that didn't
work when using hGetContents
:
>>> :{
| do
| h <- openFile "mantra.txt" ReadMode
| txt <- hGetContents' h
| hClose h
| putStr txt
| :}
I will learn to program in Haskell!
I will learn to program in Haskell!
I will learn to program in Haskell!
The downside is that we do use as much memory as necessary to store the entire file contents now, exactly the problem that lazy I/O helped us avoid.
getContents'
does the same but for stdin
as the input file: it eagerly reads
all of the data available on stdin
, whereas getContents
would read this data
lazily. Similarly, readFile'
reads an entire file, just as readFile
would
do, but it loads the content of the whole file into memory before returning.
Often, it isn't necessary to read the entire text into memory before closing the file. Let's say we want to read a file full of numbers, add them together, and then move on to the next phase in our program, which we assume requires the file to be closed. Let's build up the solution incrementally. We'll start with putting some numbers to be summed into a file:
>>> import Control.Monad
>>> import System.Random
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
This randomly generates 10,000 integers between 1 and 1,000 and writes them to a
file numbers.txt
, one number per line. Since these numbers are generated
randomly, the sum values you will see when you try this are likely different
from the ones I obtained in my runs.
Here's our first attempt at summing the numbers in the file numbers.txt
:
import System.IO
main :: IO ()
main = do
xs <- map read . lines <$> readFile "numbers.txt"
let total = sum xs
-- Below this line is where we expect "numbers.txt" to be closed.
print total
>>> :l SumFile.hs
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4991785
This does what we expect, but it does not meet the requirement that the file
numbers.txt
is closed once we reach the print total
statement. The problem
is that readFile
is lazy, as is sum
. Thus, sum
requests the elements of
xs
to produce the value of total
only once we ask for the value of total
by printing it. The file gets closed only once we have read all elements of xs
from the file.
Let's test that numbers.txt
really isn't closed by the time we reach
print total
. To do so, we use a little trick: We can't open a file for writing
if the file is already open (but the same file can be opened for reading
multiple times). Thus, we'll add a line that tries to open numbers.txt
for
writing and see what happens:
import System.IO
main :: IO ()
main = do
xs <- map read . lines <$> readFile "numbers.txt"
let total = sum xs
-- Below this line is where we expect "numbers.txt" to be closed.
h' <- openFile "numbers.txt" WriteMode
print total
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
*** Exception: numbers.txt: openFile: resource busy (file is locked)
As expected, we get an exception when trying to open numbers.txt
for writing,
which proves that it is still open for reading.
You may wonder why we couldn't have used some function from System.IO
to ask
whether the file is still open. There does exist such a function, called
hIsOpen
. As the prefix h
suggests, it requires a file handle as argument.
readFile
does not provide us with a file handle, but we can expand its
definition to gain access to the file handle. readFile
is defined as
readFile name = openFile name ReadMode >>= hgetContents
So instead of calling readFile
, we can use openFile
to open the file
manually, read its contents manually using hGetContents
, and finally use
hIsOpen
to ask whether the file is open:
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents h
let total = sum xs
-- Below this line is where we expectd "numbers.txt" to be closed.
op <- hIsOpen h
print op
h' <- openFile "numbers.txt" WriteMode
print total
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
False
*** Exception: numbers.txt: openFile: resource busy (file is locked)
Unfortunately, hIsOpen h
tells us that h
is already closed. That's this
magical "semi-closed" state that hGetContents
leaves behind. From our
program's perspective, the file is already closed, but from the operating
system's perspective it is clearly still open because, as we request data items
from xs
, these items are read on the fly from the file. The Haskell runtime
system provides the lazy layer between our program and the operating system and
results in the OS and our program having two different views of the world.
There doesn't seem to be a standard library function in Haskell that lets us query whether a file is open from the operating system's perspective. It would be nice to have such a function because the operating system's view clearly has an impact on what our program can do: when we try to open a file for writing, we really need to know whether this file is currently closed. The only way to find out whether it is is to try to open the file and catch any exceptions that may arise when this goes wrong.
Back to our example. We somehow need to ensure that by the time we try to open
numbers.txt
for writing, it is closed. We can make sure that the file is
closed by explicitly closing it:
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents h
let total = sum xs
hClose h
-- Below this line is where we expectd "numbers.txt" to be closed.
h' <- openFile "numbers.txt" WriteMode
print total
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
*** Exception: numbers.txt: hGetContents: illegal operation (delayed read on closed handle)
We're still getting an error, but it's a different one this time. The line
h' <- openFile "numbers.txt" WriteMode
succeeded without complaints. You can check this by inspecting the contents of
numbers.txt
now. It's empty, which is the result of opening the file for
writing without actually writing anything into it. Where things are going wrong
is when print
tries to print total
. To do so, it needs to know the value of
total
. To compute total
, we need to retrieve the elements of xs
, but this
doesn't work because we have explicitly closed the file using hClose h
. This
is exactly the same problem we ran into in the previous subsection when we tried
to to read the file using hGetContents
, then closed it using hClose
, and
finally tried to print the contents we thought we had read from the file.
Let's populate numbers.txt
with some numbers again, ready for our next attempt:
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
In our next attempt, we read the file completely before closing it. To do so, we
use hGetContents'
, the eager version of hGetContents
:
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents' h
let total = sum xs
hClose h
-- Below this line is where we expectd "numbers.txt" to be closed.
h' <- openFile "numbers.txt" WriteMode
print total
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4976033
Haha, success. We get the sum and the file is closed correctly, which is confirmed by the successful execution of the line
h' <- openFile "numbers.txt" WriteMode
Once again, you can check that the file numbers.txt
is now empty, as a
result of this line.
In fact, the hClose h
wasn't even necessary now because hGetContents' h
eagerly reads the whole contents of the file and immediately closes the file
once it has done so. You can check that the following version of SumFile.hs
works:
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents' h
let total = sum xs
-- Deleted: hClose h
-- Below this line is where we expectd "numbers.txt" to be closed.
h' <- openFile "numbers.txt" WriteMode
print total
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4971670
But now we have the problem that eagerly reading a file creates: We load the
whole file into memory before processing it. The problem we're trying to solve
does not require this. We are perfectly happy with reading the file lazily as
long as we finish reading it before trying to open it for writing. The trick is
to force total
to be evaluated before we try to open numbers.txt
for
writing. Currently, total
is evaluated only after we do this, because
print total
is the last line in our program.
If we swap the last two lines,
import Control.Monad
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents h
let total = sum xs
-- Below this line is where we expectd "numbers.txt" to be closed.
print total
void $ openFile "numbers.txt" WriteMode
then everything works as expected:
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
[1 of 1] Compiling Main ( SumFile.hs, interpreted )
Ok, one module loaded.
>>> main
4999213
But that's kind of cheating. In general, we should expect that we don't just
want to print total
on the screen but that we want to work with this value
later in our program. So we need to find a different way to force the evaluation
of total
.
The tool to do this is one that really should have been discussed in the chapter on lazy evaluation, but I didn't want to get into these finer details at that point yet. The standard library includes a function
seq :: a -> b -> b
It's a strange little function: As far as its return value goes, it completely
ignores its first argument and simply returns whatever the second argument is.
So, x `seq` y = y
. But there is an important little effect happening under
the hood. Before returning y
, x
is evaluated to WHNF. For numbers, WHNF is
the number itself, that is, evaluating a numerical expression to WHNF means to
fully evaluate it. That's exactly what we want here:
import Control.Monad
import System.IO
main :: IO ()
main = do
h <- openFile "numbers.txt" ReadMode
xs <- map read . lines <$> hGetContents h
let total = sum xs
total `seq` return ()
-- Below this line is where we expectd "numbers.txt" to be closed.
openFile "numbers.txt" WriteMode
print total
The expression total `seq` return ()
returns ()
, and we completely ignore
this return value. What we care about is that before evaluating return ()
,
total
is evaluated to WHNF. The result is that the file is read completely,
and then closed, in time for us to open it again for writing:
>>> xs <- replicateM 10000 (randomRIO (1,1000)) :: IO [Int]
>>> writeFile "numbers.txt" (unlines $ map show xs)
>>> :r
Ok, one module loaded.
>>> main
5025883
We have achieved what we set out to do: During the computation of total
, the
file is read lazily, because we retrieve its contents using hGetContents
, and
sum xs
retrieves the elements of xs
one element at a time to add each
element to the current sum. We ensure that this process is finished before
opening numbers.txt
for writing by forcing the evaluation of total
using
seq
.