Arrays as Accumulators

As discussed before, array takes the last list element with a given index as the entry stored at that index if there are multiple list entries with the same index. Often, what we really want is to combine all list entries with the same index. For this, we have

accumArray :: Ix i => (e -> a -> e) -> e -> (i, i) -> [(i, a)] -> Array i e

That's a whole lot of arguments, so let's take them apart. The last two arguments, of types (i, i) and [(i, a)] are very similar to the arguments of array. They're the range of array indices and a list of index-element pairs with which to populate the array. However, we do not blindly store these elements in the array. They don't even have the correct type. Here, the array elements have type e, and the values associated with the array indices in the list have type a. That's where the first two arguments come in. The second argument, of type e, is the value to which we initialize every array slot. The first argument, of type e -> a -> e is an accumulator function. For every entry (i, a) in the index-element list, accumArray accesses the slot at index i and reads the value of type e stored there. It applies the accumulator function to this value and to a to obtain an updated value of type e. This values is the new value of type e stored at index i. accumArray iterates over the entire list of index-element pairs in this fashion and returns the array obtained after applying the corresponding updates to all array slots.

Let's illustrate this using an example. We are given some string and want to count how often every character occurs in this string. For simplicity, we assume that the only characters in the string are lowercase letters. The output should not list any characters that do not occur at all. We only want the number of occurrences for all characters that occur at least once:

GHCi

>>> countOccurrences "mississippi"
[('i',4),('m',1),('p',2),('s',4)]

The word "mississippi" contains four distinct characters, 'i', 'm', 'p', and 's'. 'i' occurs four times, 'm' occurs once, 'p' occurs twice, and 's' occurs four times.

How can we implement such a countOccurrences function?

We start by pairing every character in the input with the number 1, reflecting that this is one occurrence of this character. We can use a list comprehension to do this:

GHCi

>>> [(x, 1) | x <- "mississippi"]
[('m',1),('i',1),('s',1),('s',1),('i',1),('s',1),('s',1),('i',1),('p',1),('p',1),('i',1)]

To count the number of occurrences of each character, we want to construct an array indexed by all possible characters, that is, with index range ('a', 'z'). The count associated with each character is initially 0. For each occurrence of a character, we want to increase its count by 1. We can achieve this by adding the 1 associated with each character in [(x, 1) | x <- "mississippi"] to the current count in our array. Here is how we do this using accumArray:

GHCi

>>> accumArray (+) 0 ('a', 'z') [(x, 1) | x <- "mississippi"]
array ('a','z') [('a',0),('b',0),('c',0),('d',0),('e',0),('f',0),('g',0),('h',0),('i',4),
('j',0),('k',0),('l',0),('m',1),('n',0),('o',0),('p',2),('q',0),('r',0),('s',4),('t',0),
('u',0),('v',0),('w',0),('x',0),('y',0),('z',0)]

This lists the array entries for all indices. Most of these entries are 0 because the character does not occur in our input text "mississippi". The entry for 'i', however, is 4, because there are 4 occurrences of 'i' in "mississippi". The entry for 'm' is 1.

Now, we don't want the array but the list of character-count pairs. We extract this list using the assocs function:

GHCi

>>> assocs $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- "mississippi"]
[('a',0),('b',0),('c',0),('d',0),('e',0),('f',0),('g',0),('h',0),('i',4),('j',0),('k',0),
('l',0),('m',1),('n',0),('o',0),('p',2),('q',0),('r',0),('s',4),('t',0),('u',0),('v',0),
('w',0),('x',0),('y',0),('z',0)]

And finally, we want to discard all entries from this list where the count is 0, that is, we want to keep only those pairs in this list whose second components are not 0:

GHCi

>>> filter ((0 /=) . snd) $ assocs $ accumArray (+) 0 [(x, 1) | x <- "mississippi"]
[('i',4),('m',1),('p',2),('s',4)]

Let's define this as a function so we can apply it to many different input strings:

GHCi

>>> :{
  | countOccurrences xs = filter ((0 /=) . snd)
  |                     $ assocs
  |                     $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- xs]
  | :}
>>> countOccurrences "mississippi"
[('i',4),('m',1),('p',2),('s',4)]
>>> countOccurrences "hello"
[('e',1),('h',1),('l',2),('o',1)]
>>> countOccurrences "abracadabra"
[('a',5),('b',2),('c',1),('d',1),('r',2)]
>>> countOccurrences "Mississippi"
*** Exception: Ix{Char}.index: Index ('M') out of range (('a','z'))

Alas, the last example didn't work because the input contained a character we didn't expect, an uppercase 'M'. But that was to be expected because we only have array indices between 'a' and 'z', so accumArray didn't know what to do with the entry ('M',1) in its input list. Here's how we can make this function safer. It simply ignores characters that it doesn't have an array slot for:

GHCi

>>> :{
  | countOccurrences xs = filter ((0 /=) . snd)
  |                     $ assocs
  |                     $ accumArray (+) 0 ('a', 'z') [(x, 1) | x <- xs, inRange ('a', 'z') x]
  | :}
>>> countOccurrences "Mississippi"
[('i',4),('p',2),('s',4)]

The trick is the condition inRange ('a', 'z') x in the list comprehension, which keeps only those characters in xs that are in the range between 'a' and 'z'. inRange :: (i, i) -> i -> Bool is a function provided by the Ix type class. It checks whether its second argument is in the range given as the first argument.