Points on plane compression
Background
In our website we need a capability to depict “live” 2D points viewer. Probably
the best example would be the Google Maps, but our
control is much simpler. While it does need zoom in and panning, it’s mostly
presents only one kind of objects, like “homes” whose appearance is defined
solely by their 2-dimensional coordinate (x, y)
. But there are a couple of
thousands of them. 10K - 100K is an actual range. For the rest of the post we’ll
assume 50K points as a close to the average number in that range.
Previously, for a rich client, the viewer was built with OpenGL. After a move to web a nice and simple control was developed on top of canvas element powered by a standard JavaScript library running in Chrome with no extensions. After some not too heavy optimizations it became very fluent and responsive in terms of presentation, pan and zoom times.
And with all that speed, we suddenly became aware of communication times. The rich client loaded all the data including the points at startup. But now loading the web page with the viewer started taking dozens of seconds. One generally doesn’t wait that much for a browser. Actually several factors contribute to that time: read data from the database, encode it in JSON, send it through HTTP and decode back. Data sent came to 20-25MB, as each point has some fields in addition to coordinates. The issue was temporarily mitigated with compression, and total bulk was reduced to 2MB. But whether one can do much better for this specific kind of data?
The task
Suggest optimal representation for 50K 2D points where coordinates in both axes are in range [-30,000 .. 30,000] (here I cheat a little, see below why this range is so convenient). The result should be generated on the server side, parsed on the client side and used for the points viewer with an ability to pan and zoom. The solution should allow for the viewer startup time to be as fast as possible.
Approaches
First of all, the normal or native representation might be a CSV file with two comma-separated doubles in each line. Something like:
6672.1173417,6023.36126150
940.5201417,4767.60126150
...
4653.0588,1881.1905
This comes to about 1.5MB for 50K lines and can serve a baseline.
Image
Probably the easiest compression might be a PNG file where all the points are already depicted. PNG format utilizes lossless compressions, is easy to generate, doesn’t require parsing and can easily support pan and zoom. Unfortunately the quality of an image quickly reduces with zooming. Actually, generated image can be made very large, which will allow for a better zooming but at the cost of increased file size. A pretty good-looking picture can fit to a 100KB file, which you can’t really enlarge.
Binary format
Double numbers are long in textual representation, while occupying only 8 bytes in the binary format. Given that we may simply output number pairs without any delimiters: 8 bytes for the x coordinate, 8 bytes for y, and so on. In total 8 bytes * 2 coordinates * 50K points = 800KB.
Approximate doubles with ints
Next, when presenting many points in wide rectangle, you don’t really need the precision of double. Unless one comes to a very fine zooming, which we, in our specific project, don’t need. So doubles can be simply replaced with integers and even - note the range! - to signed short integers of 2 bytes each. I’m using here some details related to the project, but this can be easily generalized. Simply rescale different set of points to fit into short numbers. In any case, such a precision is mostly enough for the modern screen resolutions. So now we stand at 2 bytes * 2 coordinates * 50K points = 200KB. Not a bad progress. So far, so good!
Split to 256x256 squares
Is it possible to have only one byte for coordinate? Meaning, reduce to a
range [0 .. 255]. For this we should split the plane to 256x256 squares.
Assuming our points doesn’t require the entire range [-30000 .. 30000] but
crowd inside [-1280 .. 1280] then they fit precisely into 100 sqaures of
256x256 each. Inside each such square we need only one byte per coordinate.
And to define it one needs a header of: (X, Y)
for the square corner and
the number of points in this square to specify region in the output stream. Then
array of twice the number of points bytes will follow, like this:
X, Y, n: x1, y1; x2, y2; ...; xn, yn
2B 2B 2B 1B 1B 1B 1B 1B 1B
X, Y, n: x1, y1; x2, ...
2B 2B 2B 1B 1B 1B
The number of points in one square is less then 50K, hence fit nicely into two bytes. In total we have 100 squares * 6 bytes per square header = 600B. Then all the points together require 1B * 2 coordinates * 50K points = 100KB which together with headers gives 101KB.
The encoding part was implemented and can be found in this Gist.
Split to 16x16 sub-squares
Can we reduce even further? Half a byte for coordinate, a byte for entire 2D point, down towards [0 .. 15] range. Let’s assume distribution of points that is close to uniform. Then 50K points split to about 500 in every 256x256 square. Inside each there are 256 16x16 sub-squares with 2-3 points in every sub-square. Per near uniformity assumption, we may assume each sub-square to be occupied, hence one can write each of them them in some order, e.g. from left to right, from bottom to top. And the only header needed is for the number of points in the sub-square, like this:
(0, 0) n: x1, y1; x2, y2; ...; xn, yn
4b 4b 4b 4b 4b 4b 4b
(0,16) n: ...
...
(0,32) n: ...
...
(240,240) n: ...
where 4b
stands for 4 bits or half a byte. Now there’s 100 squares * 256
sub-squares in each * 0.5B per sub-square header = 13KB. In addition to 1KB
for square headers and 100KB / 2 = 50 KB for points themselves it leaves us
with bare 64KB in total!!
Sort then compress
It can be noticed, that eventually, the ability to save less bits per each
single points was achieved due to organization of entire set in smaller units
of smaller area. And this is the essence of
Huffman code on which popular
achiving algorithms are based. So probably we don’t need any code of our own?
Maybe usual compressing programs will do? And indeed, take the sample file
ints.csv
with 50K lines of size 500KB looking like:
-960,1619
2592,805
...
4382,3104
Next sort it and zip using maximal compression level:
zip -9 ints.zip ints.csv # 231KB
7zr a ints.7z ints.csv # 210KB
cat ints.csv | sort -n > ints.sort.csv # 516KB
zip -9 ints.sort.zip ints.sort.csv # 188KB
7zr a ints.sort.7z ints.sort.csv # 126KB
Meaning, 7-zip
indeed succeeds to utilize sorted data and to reach results
comparable with our 256 squares encoding - 126KB. One may assume even better
results if the file is arranged in the same way used by encodings: arrage
that all points that belong to the same 256x256 square adjust one another.
But experiment shows the opposite - archive size grows to about 150KB.
An idea: describe with “patterns”
Completely different approach will be to discover “patterns” in the points and to describe them in a short manner. Here probably even more precision might be lost, but a more compact encoding can be achived. And probably the decoded picture will look very similar to the real view and be easily resizable. In any event this is just an idea and wasn’t thoroughly checked yet.
Summary
Apparently the optimal solution that require minimal investment and reaching
good compression level would be the “sort and zip”. Every modern browser
supports compression, through Content-Encoding: gzip
HTTP header and this is
seamless for the application on the client side. So virtually the only thing
needed for implementation is to sort the data. To make things even more faster
a precompressed data can be saved in advance in the database.