It's all bytes...

Every developer knows that computers and programming languages operate on bytes. Hard disks are measured in gigabytes, memory is addressed up to single byte, double number takes 8 bytes and so on. Byte comprises 8 bits and can take value from 0 to 255. But when it comes to some very specific questions and actions we halt and ask Google. And then copy ready recipes from Stack Overflow without really understanding all the nuances. E.g.:

How to read an entire file to a string?
How primitive data types are stored in memory?
Why binary data from one computer looks like gibberish on another?

Let’s try to describe some basic terms and to see how one may express them in C++, Python (2.x) and C#.

This post intends to shed some light on how data represented in bytes looks like in different mediums. But how to make bytes of data? For this encoding is used, and there’s at least one for every type of data. Here this is assumed to be known or some relevant details are explained on the way. As well in order to keep the post shorter, some advanced aspects of discussed concepts are deliberately ignored.

Notation

In some cases hexadecimal C-style notation will be used for a byte value: \x00, \x01, …, \xff. So a sequence of 3 bytes with decimal values 10, 16, 254 will be written as \x0a\x10\xfe. With such a notation, non-printable characters in a string are represented as well as byte-arrays.

Basic terms

Endianness

An unsigned “short” 16-bit integer requires 2 bytes. E.g. since 1025 = 4 * 256 + 1 the number 1025 can be represented with 2 bytes with values: 4 and 1. But how exactly do we write them? Is it \x04\x01 or \x01\x04? It occurs that both are possible and this is what Endianness is all about. The first variant \x04\x01 is called big-endian; it is used in Motorola 68000 and PowerPC and some network protocols. The second variant \x01\x04 is called little-endian and is used in Intel x86 and AMD64.

Now stop for a second and think, how data is stored in memory or on disk? Yes, that’s it. Usually it happens in native binary format, which is what processor operates on. And when data is sent over network it may be not in the native order, but rather in order dictated by the protocol. So your Intel Core i7 processor will take the number 1025 as \x01\x04 and reorder bytes to \x04\x01 before using it for a Total Length field in IP datagram and sending it over Internet. (As a side note, such order is specified by RFC 791 from 1981, but not by earlier RFC 760 from 1980.)

Similar ordering extends to larger types including 32-bit and 64-bit integers.

Alignment

Let’s move from primitive data types to structs. While computer addresses single bytes, the operations are usually performed on larger chunks. 64-bit processor natively adds 8-byte numbers. SSE2 instruction takes 128-bit as operands. Cache blocks can be 64-byte long. From all these examples it should be logical, that data should be optimized for such things. For a struct of 1-byte character and 2-byte “short” integer three bytes are needed in total. But those 3 bytes should fit into 4-bytes chunk on which memory access commands operate. Two memory access commands will be required to load the entire struct of 3 bytes that cross 4-byte border. And this happens for array of 3-byte structs squeezed together.

So compiler pads 3-bytes to 4 by adding one non-meaningful byte. And similarly the “short” integer in struct will come to even memory address thus facilitating other operations.

All those actions are called alignment. And this is the reason why bytes can appear in “unexpected” locations in memory or in binary format.

Character

This is arguably the most difficult concept of all discussed here. Probably the reason is its illusory simplicity. Indeed, looking at A letter we know that it maps to decimal code 65. Every even non-american keyboard has such a key and console knows how to depict this letter from 65 code in memory or read from a file. Why to complicate simple things?

And indeed mainstream computer languages before 90’s did not make much deal of characters and used a byte for one and bytearray for strings. How do they call it - ASCII. But wait, ASCII encodes only 128 characters, not 256 and you have other languages than English. Hmmm. Yes, now we mostly work with Unicode. It covers most alphabets and other symbols; latest versions describe more than 110,000 characters. And this number is far beyond the ability of one single byte.

The story begins with the term charset - or code page, or charmap. It enlists characters in our domain and gives each of one a number. Some characters, like \n - end-of-line, are functional and not printable. Others does not appear in textual files at all. There are many extensions of ASCII encoding to national letters and symbols with numbers 128-255 giving perfect fit to one byte code. But, in case of Unicode, the domain is significantly larger and in general 4 bytes are used. Encodings define how to save Unicode characters in a file. UTF-8 is the dominant encoding in Internet.

To code!

C++

Let’s start from a simple program:

struct data {
    char c;
    int i;
};

int main() {
    int i = 1025; // = 256*4 + 1 = \x01\x04
    short s = 16;
    char c = 127;
    data d; d.c = 65; d.i = 31;

    return 0;
}

Compile it with g++ -g 1.cpp, hit the debugger gdb a.out and execute (note, I added comments to each line and tags to memory output):

(gdb) break 12          // set breakpoint at return statement
(gdb) start             // run the program, stop at main entrance
(gdb) continue          // run to breakpoint
(gdb) print &i          // print memory address of the first variable
$5 = (int *) 0xbffff054
(gdb) x/24b 0xbffff050  // examine 24 bytes in memory starting here
0xbffff050:     -60   c:127    s:16-------0     i:1-------4-------0-------0
0xbffff058:  d.c:65    -121       4       8  d.i:31-------0-------0-------0

Look, how little-endian and alignment manifest themselves in this example. The 4-byte long integer numbers are placed at memory addresses that divide to 4 without remainder and there’s pad of 3 bytes between d.c one byte and d.i. You may also pay attention, that compiler reordered local variables but left layout of struct members intact. This is normal behavior. In between random padding bytes are left.

The char data type in C/C++ is one byte long. It is convenient for work with ASCII files or direct map between memory, byte array and file content. And since the string is essentially byte array, one may easily read entire file or its part to the memory and use it as text or binary.

Python 2.x

From “bytes” perspective, Python 2.x is very similar to C. Default strings are also byte arrays. So one can read entire file with s = open("file.ext").read() and process it as ASCII string or as binary data. Python is however more limited in raw memory access. While in C every single byte is accessible with pointers, Python is closer to human kind. It has a dedicated struct module that packs and unpacks bytes. See for example:

from struct import pack, unpack

p = pack("!ih", 65, 66)         # ! = network order; i = integer; s = short
print repr(p)                   # string '\x00\x00\x00A\x00B'
open('data.dat', 'w').write(p)

s = open('data.dat').read()
t = unpack("!ih", s)
print t                         # (65, 66)

Note, that we specified the byte order - network or big-endian, so that this script will save in file “binary” same data on all architectures. With the same ease, one may do byte manipulations with this module. E.g. split 32-bit integer to two 16-bit shorts, just like in C.

Python 2.x also provides a unicode data type for Unicode strings. Given file:

aaa угу bbb

(note Russian letters in the middle) we may do:

s = open('text.txt').read() # incorrectly reads as ASCII / byte array
print repr(s)               # 'aaa \xd0\xb5\xd0\xbb\xd1\x8c bbb\n'

import codecs               # encoder/decoder
u = codecs.open("text.txt", "r", "utf-8").read()
print repr(u)               # correct: u'aaa \u0435\u043b\u044c bbb\n'

C#

C# is a modern language. Being such it’s fully aware of byte quirks and the difference between the character and the byte. The char data type explicitly represents a Unicode character. It is 16-bit long which is enough for many applications. So while we’re working with data, we have to tell .NET plainly when raw bytes should be read or written:

public static void Main()
{
    int i = 65; short s = 66;
    var bytes = Enumerable.Concat(
        BitConverter.GetBytes(i),
        BitConverter.GetBytes(s)
    );        
    File.WriteAllBytes("bytes.bin", bytes.ToArray());
    
    byte[] read = File.ReadAllBytes("bytes.bin");
    Console.WriteLine(
        "Read: int {0} and short {1}",
        BitConverter.ToInt32(read, 0),
        BitConverter.ToInt16(read, 4)
    );
}

And what will happen if a textual file is read into a string? Let’s try:

public static void Main()
{
    string s = File.ReadAllText("text.txt"); // same file as in Python
    Console.WriteLine(s); // aaa угу bbb
}

File.ReadAllText() automatically detects UTF-8 encoding and loads file content correctly. C# makes clear the distinction between the byte array the string.

Bottom line

All data is made of bytes. Advanced operations with such basic types as integer, struct or require awareness of endianness, alignment and characters. We understood how to operate on those concepts in three mainstream programming languages: C++, Python and C#.