Compression and prediction as finite and infinite games
> Men are born for games. Nothing else. Every child knows that play is nobler than work.
Recently I watched a lecture by Ilya Sutskever, his topic was a question: why does unsupervised learning work? He proposed a solution based on an analogy between prediction and compression. Suppose that we have two datasets, that we concatenate and jointly compress these—what would an optimal compressor do? It would use patterns in one to help compress the other and vice versa, that this would mean the joint compression was more efficient than the compression of either alone.
Here he introduces the notion of Kolmogorov complexity. The name refers to Andrey Kolmogorov, the Soviet mathematician, and the concept is this: for a given object (e.g., a string), the Kolmogorov complexity is the length of the shortest program that would reproduce this object in full. This is a measure of complexity to the extent that a very simple object, for instance, can be reproduced by a very short program.
Cilibrasi gives an example: 1415926535897932384626433, and so on—what is the shortest possible program to reproduce this in full? If we do not recognise the nature of this sequence, then we may think it impossible to reproduce this with a short program. If we recognise, however, that this is a slice of π then we could quite easily write a simple program to reproduce precisely the same string.
Kolmogorov complexity can be considered a measure of randomness, that the randomness of a string corresponds to the length of the program necessary to reproduce it. We might imagine that some strings require a program almost equivalent in length to reproduce. The idea here is that these strings lack underlying patterns which can be leveraged to reproduce them from simpler primitives.
We can return here to the case with which we begun, that of jointly compressing two datasets. Here we can measure the gain in efficiency that comes from concatenating and jointly compressing them, and we can understand this increase in efficiency as corresponding to similarities between the two sets.1 The idea is that there are shared primitives which can be reproduced using simpler functions, that these can then be combined to reproduce each whole more efficiently.
Sutskever speaks of Kolmogorov complexity as “the ultimate compressor”—and it is this, if only analogically in that it is not computable; yet it is a crucial notion, an important idea.2 We can, however, approximate this Kolmogorov compressor. This is the case with compression algorithms, but our interest is artificial neural networks.
He goes on to describe the neural network architecture as constituting a possibility space in terms of compression programs, and then stochastic gradient descent as a search algorithm which operates within this space.3 The training process thus seeks to approximate a Kolmogorov compressor for a given dataset, that it succeeds in this to the extent that it can leverage simpler patterns and recombine these.
At this point we can diverge from Sutskever’s line, although still tracing the rough trajectory that we have followed so far. The idea is to further investigate the implications of this way of thinking about unsupervised learning. Here our specific interest is in the difference between compression and prediction.
When it comes to compression, we are dealing with a finite object. The object is compressed, and we can consider this—even for stream-based compression algorithms, even for the fact that ultimately all objects are at their basis comprised of strings—as fundamentally concerned with the two-dimensional form of the object. The object is bounded, and it is defined by these bounds.
Prediction, in contrast, is an infinite game; and here we are dealing instead with a one-dimensional object, a string. The relevance of all we’ve spoken about above changes slightly in this case, because we don’t simply want to compress and then decompress a given object. We are concerned instead with creativity, with generalisation.4
The opening from Sutskever is an argument for why this generalisation is possible: that the Kolmogorov compressor would leverage basic patterns in the dataset, that the extent to which it learns these primitives and the principles of their recombination corresponds to its success during training.
Training is the closest to compression, that it can be considered a funny form of lossy compression: we slice the end off a dataset, and we ask it to infer the next token based on what it has seen previously. This aims merely to reconstitute the existing string, hence it is compression.
Inference is different, here prediction is not used for compression but rather is valued for its generalisation. The idea is that now we can feed it partial strings that it has never before seen, and then—based on the primitives and principles of combination that it has learnt—that it will continue these in a meaningful manner.
This differs fundamentally from compression, and this in the way we described above: that compression is finite, whereas prediction is an infinite game.
The Cilibrasi dissertation is a fascinating use of this principle: “In each case, the individual compressed sizes of each object are calculated. Then, some or all possible pairs of objects are combined and compressed to yield pairwise compressed sizes. It is the tiny variations in the pairwise compressed sizes that yields the surprisingly powerful results of the following experiments. The key concept to realize is that if two files are very similar in their contents, then they will compress much better when combined together prior to compression, as compared to the sum of the size of each separately compressed file. If two files have little or nothing in common, then combining them together would not yield any benefit over compressing each file separately.” He goes on to use this method to analysing similarities in a wide variety of data types: written, musical, genetic, etc.
Per Legg: “The biggest problem with Kolmogorov complexity is that the value of K is not in general computable. It can only be approximated from above. The reason for this is that in general we cannot find the shortest program to compute a string x on U due to the halting problem. Intuitively, there might exist a very short program p∗ such that U(p∗) = x, however we do not know this because p∗ takes such a long time to run. Nevertheless, in theoretical applications the simplicity and theoretical power of Kolmogorov complexity often outweighs this computability problem. In practical applications Kolmogorov complexity is approximated, for example by using a compression algorithm to estimate the length of the shortest program (Cilibrasi and Vitányi, 2005).”
Sutskever makes an interesting point about variations in architectures, that usually the new can simply be simulated by the old. The trick is to find a shape which can fit something further, as with the Transformer allowing us to overcome the hidden state bottleneck in RNNs.
The idea is something like decompression but for an object that was never compressed, that within the bounds of a given model’s context window we can expand any sequence.