Better Text Support
Strings are represented as lists of characters in Haskell. This works well when the strings to be manipulated are small or when they are produced lazily. For manipulating large amounts of text, however, using one list node per character is extremely wasteful, and the lack of locality of the list nodes in memory makes traversing a string slow.
As a result, data types have been implemented that are better suited for
processing non-trivial amounts of text. In this section, we briefly discuss the
Text
type from the Data.Text
module. This module is not part of the standard
library but is provided by the text
package. This data type stores strings as
arrays of characters or, if you use the lazy text type from Data.Text.Lazy
, as
a linked list of chunks of text. This makes storing and processing text much
faster.
Of course, we lose the ability to use standard list functions to work with text
if we use the Text
type. However, the Data.Text
and Data.Text.Lazy
modules
provide lots of functions to manipulate text. There are map
, filter
,
foldl
, etc. These functions do exactly the same as the list functions with the
same names. We can also convert between the Text
and String
types using the
functions pack
and unpack
. This can be useful when working with functions
that do expect a String
as input.
Here's a common rule that experienced Haskell programmers use to work with text:1
Store text as
Text
. Process text asString
s.
Let me explain. Text
is compact and efficient as a storage format for text,
because it's really just an array of characters. At the same time, being an
array of characters makes it expensive as the type of an intermediate
result2 in a pipeline of transformations because each intermediate
Text
is constructed in its entirety before passing it as input to the next
stage in the pipeline. Lists are a terrible storage format because of the space
overhead of storing one pointer per list element. The fact that lists are lazy,
however, effectively turns them into really convenient iterators. Logically, the
pipeline
zs = map fst $ filter pred $ map fun $ zip xs ys
creates and reads lots of lists. zip xs ys
produces a list of pairs. This list
gets transformed into a new list using map fun
. filter pred
creates some
sublist of this list. And finally, map fst
converts this list into the final
list zs
. However, we never store more than one node of each of the
intermediate lists. Laziness ensures that we build zs
one element at a time.
To generate each element of zs
, map fst
needs to ask for the next element of
filter pred
. This requests a bunch of elements from map fun
and discards all
of them until it finds one that matches the predicate. map fun
generates each
of the requested elements by asking zip xs ys
for elements one element at a
time, and zip xs ys
creates these elements by zipping together xs
and ys
.
The compiler goes even further. Realizing that the intermediate lists aren't used anywhere else, it completely optimizes the intemediate lists away. Under the hood, the intermediate results become true iterators that yield one element at a time, elements, not list nodes.
All of this remains true even if we assign the intermediate lists to variables in order to give them names so our code becomes more readable. As long as we use these variables only as input to the next stage in our pipeline and don't hold on to them as long-term storage, the lists are generated lazily and the compiler may completely optimize them away, turning them into iterators under the hood.
What all this means is that we get the best of two worlds: Iterators are the tool to process sequences of data in most modern languages. They are efficient and fairly convenient to work with. But lists are even more natural to think about and work with than iterators. Haskell's lazy evaluation, combined with optimizations performed by the compiler, gives us the convenience of lists and the efficiency of iterators.
This was all rather dry and had nothing to do with I/O. I had no appetite for
listing lots of examples of working with Text
because it would all be highly
repetitive. Once you look at the Data.Text
module, you will see that you have
Text
versions of all your standard list functions, and you surely have enough
imagination to figure out how to translate what you would do with String
s
using the standard list functions into doing the same with Text
using the
Text
versions of these functions.
The reason why I mentioned Text
here is that the Data.Text
and
Data.Text.Lazy
modules are accompanied by Data.Text.IO
and
Data.Text.Lazy.IO
modules that allow us to perform I/O using Text
as our
text type, instead of String
. This allows us to use Text
as the text type
throughout our whole program without a need to store String
s anywhere. When
our program does produce String
s using unpack
, this should only be as
iterators. In other words, we should think about String
not as String
but as
the type that it really is, [Char]
, a list (iterator) of characters.
Let me finish this section with at least one short example. It's a silly one,3 but it illustrates the points I made in this subsection. Our goal is to read some text from a file. We want to keep only those characters that are greater than the characters immediately before them, and we want to convert the remaining characters to uppercase. Finally, we print the resulting text on screen. Since we have a whole pipeline of transformations and don't want to store large chunks of intermediate texts, this calls for the following solution:
- We read the input as
Text
. - We build a pipeline that
unpack
s theText
and applies standard list transformations to the list of characters. - We build a new
Text
from the final list of characters usingpack
. - We print the
Text
tostdout
.
One problem we have is that all the functions in Data.Text
have the same names
as standard list functions. Thus, if we were to just import Data.Text
, we'd
have name clashes between these functions: the compiler wouldn't know which
map
function we want to use when using map
somewhere in our code, for
example. To avoid this problem, we use a tool we will discuss in greater detail
in the next chapter. By importing Data.Text
and
Data.Text.IO
using
import qualified Data.Text as T
import qualified Data.Text.IO as T
we can continue using the standard list version of map
using map
, and we can
use the Text
version of map using T.map
. Here then is our complete solution:
import qualified Data.Char as C
import qualified Data.Text as T
import qualified Data.Text.IO as T
main :: IO ()
main = do
text <- T.readFile "mantra.txt"
T.putStrLn $ transformText text
transformText :: T.Text -> T.Text
transformText text = text'
where
text' = T.pack ups
ups = map C.toUpper kept
kept = map fst $ filter (uncurry (>)) $ zip tcs cs
cs = T.unpack text
tcs = T.unpack $ T.tail text
>>> :l Silly
[1 of 1] Compiling Main ( Silly.hs, interpreted )
Ok, one module loaded.
>>> main
WLLRTPRRMINHASLIWLLRTPRRMINHASLIWLLRTPRRMINHASL
The key point I want you to focus on in this example is that we're storing the
input file as Text
, and we're using the Text
I/O functions to read the input
file and print the result on screen. transformText
transforms a Text
into a
Text
, but the intermediate results are lists of characters. We achieve this by
unpacking text
(twice, because we also need its tail), transforming the lists
of characters in the way we are used to by now, and finally, we build the
transformed Text
from the final list using pack
.
-
In fact, we should generally use this rule for any kind of data: Store data in
Array
s orVector
s. Process the data using lists. ↩ -
The lazy
Text
type provided byData.Text.Lazy
mitigates a lot of these issues. It combines the benefits of laziness with the space efficiency of theText
type by storing text as a list of small chunks of text. Thus, it does not waste one list node per character, but neither does it use space to store the whole text when it is used as an intermediate result in a pipeline. Thus, if all intermediate results are themselves sequences of characters, using lazy text is often the better choice than going all the way to lists. If the intermediate results include lists of pairs of characters or lists of other types of data, then we have no choice but to use lists becauseText
can only store characters. ↩ -
You should have got used to this by now. ↩