The previous blog post in this series introduced to the Huffman algorithm and the data structures used to implement it. Be sure to check out that post if you missed it!
In this post we will be exploring how to leverage these data structures and some interesting feature of Elixir to complete the implementation of the Huffman algorithm. The features we will be using are: binary pattern matching, iolists, streams, and recursion.
You can find the full source of the program with comments, typespecs, and unit tests on Github.
Let’s get started!
Counting Characters
As you may recall, the Huffman algorithm compresses text by assigning shorter encodings to characters that occur more frequently in some input text. Let’s create a file lib/huffman/counter.ex
and implement a character counting module.
We will be implementing the function count/1
which takes a list of binaries and returns a map of character => count
entries. Let’s do some initial exploration on how we can accomplish this in an iex
shell:
Looking good! Now, we’re ready to count all of these characters. Let’s start our implementation of the Huffman.Counter
module:
The function count_helper/2
takes a list and and an accumulator (in our case, a map) as parameters. Notice that the function is recursive. In count_helper([], acc)
we have finished consuming characters from the input list and we can return the accumulator. In the other version of count_helper/2
we match on the head
and tail
of the list (the first element and remaining list, respectively), update the count of the head
character (or use 1 as a default) using Map.update/4
, then continue the recursion using the remaining list. Let’s implement a function which will call our helper:
There’s a few new things going on here. We use the Stream
module in order to do lazy enumeration on our input. Using Stream
lets us create a “recipe” of computation that will be evaluated when required. I will not be covering the intricacies of lazy and eager enumeration, however, there are plenty of resources online if you are curious. What is important to know is that the Stream
module allows us to enumerate large, potentially infinite, input data without blowing up the program’s memory usage. Let’s try to count some characters:
Whoops! Remember, streams are not evaluated until required. Let’s enumerate the Stream
into a list:
There’s still a problem! We counted the characters for each binary in the input list, however, we need to merge the maps together:
Notice that we added a call to merge_maps/2
inside of Enum.reduce/2
at the end of count/1
. This will evaluate the stream for us! Let’s see what the output of count/1
looks like now:
Compression
We have most of the pieces we need to compress some data! First, let’s take a look at what we need to accomplish:
- Find the occurrence count for each character in the input
- Generate the Huffman encodings based off the character count
- Replace each character in the input with the corresponding Huffman encoding
- Buffer the Huffman encoded bitstrings into binaries
Let’s open lib/huffman.ex
and start implementing compress/1
:
The compress(bin)
function is a helper so we can call Huffman.compress("go go gophers")
without having to worry about only passing lists of binaries to compress/1
.
Notice that get_encodings/1
is the exact transformation we saw at the end of the previous blog post. Now, let’s try to replace each character in the input with the Huffman encodings:
We start by doing a similar transformation that we did in Huffman.Counter.count/1
to split the input and clean up “garbage” strings. Then, we call a helper function IOHelper.encode_characters/2
to replace the input characters with their Huffman encoding. Finally, we flatten the stream into a singular list. Let’s see the implementation of the IOHelper.encode_characters/2
function:
This function is pretty simple, we use Enum.map/2
to replace each character from the encodings
map we pass as a parameter. However, we have a problem. The result of compressed_output/2
(when enumerated with Enum.to_list/1
) will look something like: [7::size(3), 13::size(4), 2::size(2), ...]
. We have a list of bitstrings!
In Elixir strings, binaries, and bitstrings are all related. All are representations of binary (the “bit” kind of binary) data, however, with certain guarantees. Strings are the most specific: they are Elixir binaries that are UTF-8 encoded. Binaries, on the other hand, must have a bit length that is a multiple of eight. Finally, bitstrings are the least specific. Bitstrings are any groupings of bits.
This means that all strings and binaries are also considered bitstrings. For more information, this is a pretty comprehensive article with nice graphics to help you understand the difference.
Our problem is that we can only output binaries but we have bitstrings! It doesn’t matter if we want to output to stdout
or to a file, we need to buffer the bitstrings into binaries.
Buffering Output
Let’s start by looking at the changes to Huffman.compress/1
that we need to make:
We pass the list of bitstrings into a new function IOHelper.buffer_output/3
. This function is responsible for buffering the bitstrings into binaries. Let’s see the implementation:
Let’s look at what each parameter of buffer_output/3
are used for. The first parameter is the input list of bitstrings. The second parameter buffer
will buffer bitstrings into a binary. When the buffer is full, we will append the completed binary to iolist
.
The function starts by matching on the head
(the bitstring on the front of the list) and tail
(the remainder of the bitstring list). The head
is appended to the buffer. Then, the we check if the buffer has completed a byte with completed_byte/1
.
The completed_byte/1
function makes use of binary pattern matching. We try to match on a byte and catch the remainder in rest
. We return the completed byte and the leftover bitstring. If there is no completed byte on the front of the buffer we simply return {nil, nil}
.
If we have completed a byte, we append the completed byte onto iolist
. The leftover bitstring will be used as the buffer for the next recursive iteration. If no byte was completed, we use the current buffer
and iolist
for the next recursive iteration.
Notice that we are using an iolist to join the completed binaries. We don’t take the usual penalty of appending to the end of a list by doing this. Because of this behavior, you should always use an iolist when you are building output. However, iolists are considered improper lists. Most of the standard list operations will not function as expected on an iolist. However, this isn’t a problem for us since the functions we will use to output the compressed data can handle improper lists.
We end the recursion when we have exhausted the list of bitstrings. However, there might be something (most likely a bitstring) left in the buffer. We return {iolist, buffer}
and will handle the leftover buffer in Huffman.compress/1
. Let’s take a look at that:
There are two cases for IOHelper.pad_bitstring/1
: the leftover buffer is either a binary or a bitstring. If it is a binary we can simply return it. Otherwise, we pad the bitstring into a binary with 0’s in the least significant bits.
Okay! We can finally compress input data! Let’s see an example:
Let’s check if we actually compressed the data:
Looks good! We saved 8 bytes with our implementation!
Conclusion
At this point we have completed the implementation of the Huffman algorithm. We managed to take some input data, compute Huffman encodings, and replace the input data with the encodings. However, this binary data isn’t very useful currently. We have no way to get the original data back!
In the next post we will see how we can store some metadata about the input data that was compressed. This metadata will be used to reconstruct the Huffman tree which can be used to decompress the data returned by Huffman.compress/1
. Keep your eyes peeled for that post (or check out the Github repo)!