Better Text Support

Strings are represented as lists of characters in Haskell. This works well when the strings to be manipulated are small or when they are produced lazily. For manipulating large amounts of text, however, using one list node per character is extremely wasteful, and the lack of locality of the list nodes in memory makes traversing a string slow.

As a result, data types have been implemented that are better suited for processing non-trivial amounts of text. In this section, we briefly discuss the Text type from the Data.Text module. This module is not part of the standard library but is provided by the text package. This data type stores strings as arrays of characters or, if you use the lazy text type from Data.Text.Lazy, as a linked list of chunks of text. This makes storing and processing text much faster.

Of course, we lose the ability to use standard list functions to work with text if we use the Text type. However, the Data.Text and Data.Text.Lazy modules provide lots of functions to manipulate text. There are map, filter, foldl, etc. These functions do exactly the same as the list functions with the same names. We can also convert between the Text and String types using the functions pack and unpack. This can be useful when working with functions that do expect a String as input.

Here's a common rule that experienced Haskell programmers use to work with text:¹

Store text as Text. Process text as Strings.

Let me explain. Text is compact and efficient as a storage format for text, because it's really just an array of characters. At the same time, being an array of characters makes it expensive as the type of an intermediate result² in a pipeline of transformations because each intermediate Text is constructed in its entirety before passing it as input to the next stage in the pipeline. Lists are a terrible storage format because of the space overhead of storing one pointer per list element. The fact that lists are lazy, however, effectively turns them into really convenient iterators. Logically, the pipeline

zs = map fst $ filter pred $ map fun $ zip xs ys

creates and reads lots of lists. zip xs ys produces a list of pairs. This list gets transformed into a new list using map fun. filter pred creates some sublist of this list. And finally, map fst converts this list into the final list zs. However, we never store more than one node of each of the intermediate lists. Laziness ensures that we build zs one element at a time. To generate each element of zs, map fst needs to ask for the next element of filter pred. This requests a bunch of elements from map fun and discards all of them until it finds one that matches the predicate. map fun generates each of the requested elements by asking zip xs ys for elements one element at a time, and zip xs ys creates these elements by zipping together xs and ys.

The compiler goes even further. Realizing that the intermediate lists aren't used anywhere else, it completely optimizes the intemediate lists away. Under the hood, the intermediate results become true iterators that yield one element at a time, elements, not list nodes.

All of this remains true even if we assign the intermediate lists to variables in order to give them names so our code becomes more readable. As long as we use these variables only as input to the next stage in our pipeline and don't hold on to them as long-term storage, the lists are generated lazily and the compiler may completely optimize them away, turning them into iterators under the hood.

What all this means is that we get the best of two worlds: Iterators are the tool to process sequences of data in most modern languages. They are efficient and fairly convenient to work with. But lists are even more natural to think about and work with than iterators. Haskell's lazy evaluation, combined with optimizations performed by the compiler, gives us the convenience of lists and the efficiency of iterators.

This was all rather dry and had nothing to do with I/O. I had no appetite for listing lots of examples of working with Text because it would all be highly repetitive. Once you look at the Data.Text module, you will see that you have Text versions of all your standard list functions, and you surely have enough imagination to figure out how to translate what you would do with Strings using the standard list functions into doing the same with Text using the Text versions of these functions.

The reason why I mentioned Text here is that the Data.Text and Data.Text.Lazy modules are accompanied by Data.Text.IO and Data.Text.Lazy.IO modules that allow us to perform I/O using Text as our text type, instead of String. This allows us to use Text as the text type throughout our whole program without a need to store Strings anywhere. When our program does produce Strings using unpack, this should only be as iterators. In other words, we should think about String not as String but as the type that it really is, [Char], a list (iterator) of characters.

Let me finish this section with at least one short example. It's a silly one,³ but it illustrates the points I made in this subsection. Our goal is to read some text from a file. We want to keep only those characters that are greater than the characters immediately before them, and we want to convert the remaining characters to uppercase. Finally, we print the resulting text on screen. Since we have a whole pipeline of transformations and don't want to store large chunks of intermediate texts, this calls for the following solution:

We read the input as Text.
We build a pipeline that unpacks the Text and applies standard list transformations to the list of characters.
We build a new Text from the final list of characters using pack.
We print the Text to stdout.

One problem we have is that all the functions in Data.Text have the same names as standard list functions. Thus, if we were to just import Data.Text, we'd have name clashes between these functions: the compiler wouldn't know which map function we want to use when using map somewhere in our code, for example. To avoid this problem, we use a tool we will discuss in greater detail in the next chapter. By importing Data.Text and Data.Text.IO using

import qualified Data.Text    as T
import qualified Data.Text.IO as T

we can continue using the standard list version of map using map, and we can use the Text version of map using T.map. Here then is our complete solution:

Silly.hs

import qualified Data.Char    as C
import qualified Data.Text    as T
import qualified Data.Text.IO as T

main :: IO ()
main = do
    text <- T.readFile "mantra.txt"
    T.putStrLn $ transformText text

transformText :: T.Text -> T.Text
transformText text = text'
  where
    text' = T.pack ups
    ups   = map C.toUpper kept
    kept  = map fst $ filter (uncurry (>)) $ zip tcs cs
    cs    = T.unpack text
    tcs   = T.unpack $ T.tail text

GHCi

>>> :l Silly
[1 of 1] Compiling Main             ( Silly.hs, interpreted )
Ok, one module loaded.
>>> main
WLLRTPRRMINHASLIWLLRTPRRMINHASLIWLLRTPRRMINHASL

The key point I want you to focus on in this example is that we're storing the input file as Text, and we're using the Text I/O functions to read the input file and print the result on screen. transformText transforms a Text into a Text, but the intermediate results are lists of characters. We achieve this by unpacking text (twice, because we also need its tail), transforming the lists of characters in the way we are used to by now, and finally, we build the transformed Text from the final list using pack.

In fact, we should generally use this rule for any kind of data: Store data in Arrays or Vectors. Process the data using lists. ↩
The lazy Text type provided by Data.Text.Lazy mitigates a lot of these issues. It combines the benefits of laziness with the space efficiency of the Text type by storing text as a list of small chunks of text. Thus, it does not waste one list node per character, but neither does it use space to store the whole text when it is used as an intermediate result in a pipeline. Thus, if all intermediate results are themselves sequences of characters, using lazy text is often the better choice than going all the way to lists. If the intermediate results include lists of pairs of characters or lists of other types of data, then we have no choice but to use lists because Text can only store characters. ↩
You should have got used to this by now. ↩