Arrays as Accumulators
As discussed before, array
takes the last list element with a given index as
the entry stored at that index if there are multiple list entries with the same
index. Often, what we really want is to combine all list entries with the same
index. For this, we have
accumArray :: Ix i => (e -> a -> e) -> e -> (i, i) -> [(i, a)] -> Array i e
That's a whole lot of arguments, so let's take them apart. The last two
arguments, of types (i, i)
and [(i, a)]
are very similar to the arguments of
array
. They're the range of array indices and a list of index-element pairs
with which to populate the array. However, we do not blindly store these
elements in the array. They don't even have the correct type. Here, the array
elements have type e
, and the values associated with the array indices in the
list have type a
. That's where the first two arguments come in. The second
argument, of type e
, is the value to which we initialize every array slot. The
first argument, of type e -> a -> e
is an accumulator function. For every
entry (i, a)
in the index-element list, accumArray
accesses the slot at
index i
and reads the value of type e
stored there. It applies the
accumulator function to this value and to a
to obtain an updated value of type
e
. This values is the new value of type e
stored at index i
. accumArray
iterates over the entire list of index-element pairs in this fashion and returns
the array obtained after applying the corresponding updates to all array slots.
Let's illustrate this using an example. We are given some string and want to count how often every character occurs in this string. For simplicity, we assume that the only characters in the string are lowercase letters. The output should not list any characters that do not occur at all. We only want the number of occurrences for all characters that occur at least once:
>>> countOccurrences "mississippi"
[('i',4),('m',1),('p',2),('s',4)]
The word "mississippi" contains four distinct characters, 'i', 'm', 'p', and 's'. 'i' occurs four times, 'm' occurs once, 'p' occurs twice, and 's' occurs four times.
How can we implement such a countOccurrences
function?
We start by pairing every character in the input with the number 1, reflecting that this is one occurrence of this character. We can use a list comprehension to do this:
>>> [(x, 1) | x <- "mississippi"]
[('m',1),('i',1),('s',1),('s',1),('i',1),('s',1),('s',1),('i',1),('p',1),('p',1),('i',1)]
To count the number of occurrences of each character, we want to construct an
array indexed by all possible characters, that is, with index range
('a', 'z')
. The count associated with each character is initially 0. For each
occurrence of a character, we want to increase its count by 1. We can achieve
this by adding the 1 associated with each character in
[(x, 1) | x <- "mississippi"]
to the current count in our array. Here is how we
do this using accumArray
:
>>> accumArray (+) 0 ('a', 'z') [(x, 1) | x <- "mississippi"]
array ('a','z') [('a',0),('b',0),('c',0),('d',0),('e',0),('f',0),('g',0),('h',0),('i',4),
('j',0),('k',0),('l',0),('m',1),('n',0),('o',0),('p',2),('q',0),('r',0),('s',4),('t',0),
('u',0),('v',0),('w',0),('x',0),('y',0),('z',0)]
This lists the array entries for all indices. Most of these entries are 0 because the character does not occur in our input text "mississippi". The entry for 'i', however, is 4, because there are 4 occurrences of 'i' in "mississippi". The entry for 'm' is 1.
Now, we don't want the array but the list of character-count pairs. We extract
this list using the assocs
function:
>>> assocs $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- "mississippi"]
[('a',0),('b',0),('c',0),('d',0),('e',0),('f',0),('g',0),('h',0),('i',4),('j',0),('k',0),
('l',0),('m',1),('n',0),('o',0),('p',2),('q',0),('r',0),('s',4),('t',0),('u',0),('v',0),
('w',0),('x',0),('y',0),('z',0)]
And finally, we want to discard all entries from this list where the count is 0, that is, we want to keep only those pairs in this list whose second components are not 0:
>>> filter ((0 /=) . snd) $ assocs $ accumArray (+) 0 [(x, 1) | x <- "mississippi"]
[('i',4),('m',1),('p',2),('s',4)]
Let's define this as a function so we can apply it to many different input strings:
>>> :{
| countOccurrences xs = filter ((0 /=) . snd)
| $ assocs
| $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- xs]
| :}
>>> countOccurrences "mississippi"
[('i',4),('m',1),('p',2),('s',4)]
>>> countOccurrences "hello"
[('e',1),('h',1),('l',2),('o',1)]
>>> countOccurrences "abracadabra"
[('a',5),('b',2),('c',1),('d',1),('r',2)]
>>> countOccurrences "Mississippi"
*** Exception: Ix{Char}.index: Index ('M') out of range (('a','z'))
Alas, the last example didn't work because the input contained a character we
didn't expect, an uppercase 'M'
. But that was to be expected because we only
have array indices between 'a'
and 'z'
, so accumArray
didn't know what to
do with the entry ('M',1)
in its input list. Here's how we can make this
function safer. It simply ignores characters that it doesn't have an array slot
for:
>>> :{
| countOccurrences xs = filter ((0 /=) . snd)
| $ assocs
| $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- xs, inRange ('a', 'z') x]
| :}
>>> countOccurrences "Mississippi"
[('i',4),('p',2),('s',4)]
The trick is the condition inRange ('a', 'z') x
in the list comprehension,
which keeps only those characters in xs
that are in the range between 'a'
and 'z'
. inRange :: (i, i) -> i -> Bool
is a function provided by the Ix
type class. It checks whether its second argument is in the range given as the
first argument.