Primer in Data Encoding

by daniel packard

A Primer on Number Systems and Data Representation

Encoding the same data in different ways requires an understanding of how the different encodings represent the same information. For example, the number 127 (i.e. the number one-hundred-twenty-seven, as represented in Decimal or Base10) can also be represented in different numerical bases.

The following are all different representations of the same data:

  • Decimal (base10) = 127
  • Binary (base2) = 0b01111111
  • Hexadecimal (base16) = 0x7f
  • Base64 = fw==

Detailed explanation for the Decimal (Base10) representation of a number

In decimal numbers are represented with ten unique symbols:

  • 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

This makes it easy to represent numbers from zero to nine inclusive... After nine, you run out of symbols. The only way to keep going without introducing additional symbols is to add a second digit in your representation. The second digit represents an order of magnitude increase in value (in the case of base10, "one order of magnitude" means a 10x increase).

The number ten, represented by its decimal form 10 means something like... "I counted all the way up to the number of symbols I have, and I ran out of symbols one time."

The number thirteen, in decimal 13, means.. "I counted up to the number of symbols I have one time... and then I counted three more"

The number twenty-seven, 27, means "I counted up to the number of symbols available, ran out twice, and then counted seven more."

This works up to ninenty-nine... but then you add a third digit to your representation to represent one hundred as 100.

Through this process of counting and grouping, it becomes clear that your representation builds up into a sequence of digits, each representing an "order of magnitude" increase in the number it represents. In base10, you run out of symbols in groups of ten, so each "order of magnitude" is a multiplier of ten (1, 10, 100, 1000, etc).

There are less detailed explanations for a few other bases ahead.

Binary (base2):

Binary has two digits: 0, 1

Therefore you run out of digits every other count, and each "order of magnitude" is equivalent to a multiple of two increase. The numerical value attributed to each column in the binary representation increases by a multiple of two (1, 2, 4, 8, 16, etc)

127_binary.png

Note: when writing binary numbers, it's conventional to prefix with 0b and pad the number to a width of four or eight digits with leading zeros e.g.

  • 127 = 0b00111111
  • 8 = 0b00001000
  • 3 = 0b00000011

More on why we use four or eight digits later.

Hexadecimal (base16)

Hexadecimal has sixteen symbols: 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f

Therefore you can count all the way up to fifteen (i.e. the number represented by the hexadecimal symbol f) before running out of symbols and adding a second digit.

127_hex.png

note: It's conventional to write hexadecimal numbers using the prefix 0x and padding on the left with zeros to two characters.

e.g.

  • 127 = 0x7f
  • 8 = 0x08
  • 3 = 0x03

More on why hex are written with two digits later...

A Primer on Data Representation in Computers

This post focuses on data types and encodings found in C-style languages (char, uint8_t, etc).

Why binary?

Modern computers run on digital hardware — every transistor is either ON (1, HIGH, etc) or OFF (0, LOW, etc). This electrical implementation detail makes binary the most natural internal representation for data in a computer. If transistors had 3 states, a ternary representation would work better.

Each digit in a binary chunk of data is called a bit - it represents a single on/off or 0/1 value.

Bytes

For practical reasons, data in a computer is not addressed as individual bits. It's more efficient to access and load data by grouping it together into chunks. Instead, bits are grouped together into addressable chunks of memory called bytes.

For historical reasons, a byte contains eight bits — that's enough to represent the counting numbers from 0 to 255. When you overlay a "text encoding" on top of those 255 unique data points there are more than enough to represent all of the characters in the western alphabet and all punctuation.

note: some (much improved) text encodings, such as UTF-16 and UTF-32, use more than one byte which enables international character support, emoji, and more.

The Interpretation of Bytes

As Unsigned Decimal Integers:

Because of the nature of Bytes and Binary arithmetic described above, one very common interpretation of bytes is a direct map onto the decimal counting numbers (i.e. integers greater than or equal to zero: 0, 1, 2, etc) ...

  • 0b00000000 = 0
  • 0b00000001 = 1
  • ...
  • 0b11111110 = 254
  • 0b11111111 = 255

... up to the maximum binary number that can be represented using 8-bits (2^8-1 = 255).

This set of integers (0 .. 255) is sometimes called the unsigned eight-bit integers. In C/C++ it is referred to by its c/c++ type annotation: uint8_t

As Unsigned Hexadecimal Integers

Each of the sixteen hexadecimal symbols can be represented in binary using four bits.

  • 0b0000 = 0x0 (0)
  • 0b0001 = 0x1 (1)
  • ...
  • 0b1110 = 0xe (15)
  • 0b1111 = 0xf (16)

Therefore each byte can be thought of as a pair of hexadecimal digits. Since each hex digit represents half a byte, it's sometimes cleverly referred to as a nibble. Each byte then consists of two hex digits: a "high nibble" and a "low nibble".

  • 0b0000 0000 = 0x00
  • 0b0000 0001 = 0x01
  • ...
  • 0x0001 0001 = 0x11
  • 0x0001 0002 = 0x12
  • ...
  • 0b1111 1110 = 0xfe
  • 0b1111 1111 = 0xff

As Characters (Text Encoding):

When bits/bytes are interpreted as characters, it is done using a text-encoding (basically a map from the counting numbers onto printable text characters). For a loooong time, ASCII dominated the text-encoding world. ASCII maps the unsigned 8-bit integers back onto the western alphabet and punctuation.

  • 48 = '0'
    • (i.e. the byte corresponding to the decimal unsigned integer48 can be interpreted as the ASCII character0)
  • 65 = 'A'
  • 66 = 'B'
  • ...
  • 97 = 'a'
  • 98 = 'b'
  • ...
  • etc.

See the [full table of ASCII character mappings here][0]

It's important to note that the ASCII text encoding does not map (as one might expect) the numerical values 0,1,2... onto their character equivalents '0', '1', '2'....

One final note on text encodings and eight-bit integers: for eight-bit characters such as those from ASCII or UTF-8, the c/c++ type annotation is char. Both char and uint8_t occupy the same number of bits in memory, so they are sometimes used interchangeably in c/c++ programs. But if you're being a pedant (as I often am), uint8_t refers to an unsigned eight-bit integer, and char refers to a text-encoded printable character.

cryptopals