Universal pre-training by iterated random computation

pre-print

A monkey behind a typewriter will produce the collected works of Shakespeare eventually. But what if we put a monkey behind a computer?

Zero-shot performance on various datasets as universal pretraining progresses (evaluated at every 100 000 pre-training instances). Note that this performance is without the model ever seeing the data except for the small amount encountered in the context.

The monkey behind a typewriter needs to be lucky enough to type all characters of all of Shakespeare correctly. The monkey behind the computer only needs to be lucky enough to type a program for Shakespeare. Since human language contains a lot of structure, such a program will be much shorter than the collected works themselves, so the second monkey will get there faster,

This suggests that if we pass random noise through a random computation, we enrich it. We make it more likely for interesting structures to appear. This means that such data could be useful for pre-training: we sample a random computer program and pass random noise through it and we pre-train a large machine learning model on the output. Then we see how well the model does on real-world data afterwards, whether by in-ontext learning or by finetuning.

This may seem like a violation of the No-Free-Lunch principle: how can we pretrain before we know what the task is? Wouldn’t that suggest that there’s some universal way of pre-training. The answer is that this kind of pretraining isn’t universal over all tasks, but it is universal over all tasks that emerge from a mixture of computationa and randomness. Since this covers all data that we are likely to encounter in machine learning, the term universal is still suitable.

In this paper, we first approach this idea theoretically, and show various things:

We then test the approach experimentally by pretraining a transformer model and show that on a range of synthetic and real-world datasets the model outperforms a random baseline by zero-shot in-context learning. Moreover, on several real-world datasets, the model beats a strong in-context Markov model.

References

[1] Grau-Moya, J., Genewein, T., Hutter, M., Orseau, L., Delétang, G., Catt, E., … & Veness, J. (2024). Learning universal predictors. arXiv preprint arXiv:2401.14953.