A Note on W-plus String Encoding
Since a string in general can contain arbitrary bytes, there may be
different reasons why we want to encode them into a more manageable
format, such as some form of text format. These kind of encodings are
sometimes called
binary-to-text encoding schemes. We also
usually want to be able to decode encoded string back into the
original format. A very useful such encoding would be one that encodes
a binary string into a string of letters, digits, and the underscore
character (`_'), since these are rarely treated as special characters,
or so-called meta-characters.
Strings encoded in such way can be used as a part of identifiers in
programming languages for example, or part of URLs. They are matched
with the regular expression /\w+/, and for this reason let us call these
kind of encodings
w-plus encodings. There are many ways we can
easily come up with such encodings, such as simply encoding each byte
as a pair of hexadecimal digits, which we could call
the
hexadecimal encoding. This kind of encoding would not be
very intelligible in the sense that some obvious textual strings such
as
"hello world"
would be encoding in an unrecognizable
way:
"68656C6C6F20776F726C64"
. We will consider a w-plus
encoding
intelligible if it preserves most original letters,
digits, and underscore characters. Additionally, we would prefer a
more space-preserving encoding; i.e., a more optimal encoding in terms
of length; than doubling the length of string on average, which is the
case with the mentioned hexadecimal encoding.
W-plus Encoding
One simple encoding which would be more intelligible and more space
preserving than hexadecimal, is the encoding in which we would
preserve all word character (letters, digits, and underscore) except
one character, which we will use as the `escape' character signaling
that the original, encoded, character is encoded in the next two
hexadecimal digits. One obvious choice for this special escape
character would be underscore, however the lowercase letter `x' is
also a good choice since it appears relatively infrequently in typical
English text, and it conveniently reminds us of the hexadecimal code
used in the next two characters. It is also used in Perl, C, and some
other languages to indicate hexadecimal numbers, as in `
0x1f
'.
So, the W-plus encoding of the strings ``hello world
''
and ``hexadecimal numbers
'' would be
``hellox20world
'' and ``hex78adecimalx20numbers
''.
W-plus Encoding in Perl
Another advantage of the W-plus encoding is that encoding and decoding are
very easy to implement with minimal code in Perl (and also C, and
likely other languages). W-plus encoding of a string in Perl can be
executed using the following substitution:
s/[\Wx]/'x'.uc unpack("H2",$&)/ge;
or as the following function
encode_w:
sub encode_w {
local $_ = shift;
s/[\Wx]/'x'.uc unpack("H2",$&)/ge;
return $_;
}
W-plus Decoding in Perl
W-plus decoding in Perl can be done with the following substitution:
s/x([0-9A-Fa-f][0-9A-Fa-f])/pack("c",hex($1))/ge;
or as the following function
decode_w:
sub decode_w {
local $_ = shift;
s/x([0-9A-Fa-f][0-9A-Fa-f])/pack("c",hex($1))/ge;
return $_;
}
created: 2020-05-17, last update: 2020-05-18,
email me comments