Bureaucratic intelligence and the zone of proximal development
> What resides in the machines is human reality, human gesture fixed and crystallised into working structures.
The design of an agentic apparatus is the zone of proximal development for artificial intelligence. This seeks to realise the most with extant tools rather than waiting for architecture to save us. Earlier there was great excitement about this, with AutoGPT and BabyAGI and even the idiotic ChaosGPT. These were all ultimately failures, or perhaps this disenchantment is simply that of an AI winter in miniature. These means fell short of their initial promise, now interest has been attracted by other hopes.
Of course, the history of this game shows success to be rarely in the wheelhouse of the majority. They tend to tinker endlessly while pressure accrues elsewhere, which eventual developments have all the appearance of earthquakes. These occur when the conditions have been accumulating all the while; that pressure builds and then gives in an instant, echoes long with aftershocks.
The present hope is that advances in architecture and training will finally carry us to the promised land. These people want to build themselves a god, presumably that they might worship; hence what is perhaps more important than whether we really build a super-intelligence is whether people believe that we have done so. We can only pray that the dominant cohort of technocrats and oligarchs end up picking the right God.
The great hope here is the apparent importance of scaling, that we might reach the sky simply by ever greater data and compute. We might be given cause to doubt this hypothesis by the recent leaks as to GPT-4’s alleged architecture. The use of a mixture of experts here suggests a desire to maximise performance within some limits.
Of course, these limits may merely be in terms of compute—what are our prospects here? The fact that people are buying so many chips doesn’t seem too promising. We might suppose that this will be overcome, and here we could see substantial gains in the possibilities for these systems. If the cheque of quantum computing is ever cashed then it may well bring about unheralded advances.
The risk here is the sensitivity of these systems to the slightest perturbations in their starting conditions. This is the case already for the larger constructions of machine learning, let alone with entanglement. These systems seem more fragile the larger they are, and this may play some role in the mixture of experts architecture. What we can perhaps expect more reliably is an increase in inference rate and context windows.
While increased context length seems a strong solution, there is already reason to believe that performance decreases with longer context windows.1 Instead I am more optimistic about inference rate. The fact that I can already run LLaMA 7B on my dirt cheap MacBook Air is absurd. GPT-3.5-Turbo is very fast, and GPT-4 really isn’t so bad; it is often faster than most people would be able to speak, let alone comprehend.
The real gains from inference rate, however, will be in the possibilities of bureaucratic intelligence. This is the theme with which we began: rapid inference allows networks of specialised language models to act in concert as artificial cognitive systems.
There are those that simply want to scale a single model until it cracks some sort of qualitative threshold and becomes a machine super-intelligence. The problem seems to be that most of the work in language models as artificial intelligence is not done by computing but by language itself. Much of the trick seems that we have managed to latch onto this miraculous phenomenon—as originally with speech and later writing.
These prior advances have been similarly fundamental for the story of humanity, and it is only right that we accept the present shift as foreboding like changes of our own. We have endowed language itself with a generative aspect that once required human mediation, that now language can give rise to further language outside of the human mind. This is a step in line with that of writing, especially as enhanced by the printing press—that there we saw the first instance of asexual reproduction in language, where information was simply replicated without variation.
The difference with language models is that they operate as nodes akin to human language users, that they enter into the dialectic with us. They effect a transformation only understandable in terms of a language user, that like us they perform a black box operation upon incoming data. This is particularly prominent with instruction-tuned models, that somehow within their machinations they manage a turn of the dialectic.
The implementation of this turn is not limited to the singular form of call and response; here we arrive at the topic of our outset: what is a bureaucratic intelligence? The bureaucratic structure is a universal of our age, while many decry these—surely they must have a purpose. It is simply that their limitations are known while their advantages have been forgotten, that further the institutional form has an inherent tendency to decay. The same was so with cavalry and it was not the fault of any horses.
Likewise the bureaucrat ought not be blamed—or rather, the bureaucratic form. This is a structure which can well be applied to the pursuit of artificial intelligence. It has been already, albeit not in so many words, by projects such as AutoGPT and BabyAGI. These all share in the aims of bureaucracy: efficiency, consistency, specialisation. The principles of their success are likewise accountability and transparency, that the sum combination of these aspects endows them also with scalability.
This is the zone of proximal development for artificial intelligence, that already there have been quiet efforts in sphere. The idea is to develop artificial cognitive systems in the form of networks integrating language models and various resources: embeddings, strings, state machines, etc. Here the essential principle, however, is the interrelation of specialised language models—whether by prompt engineering alone or fine-tuning.
The possibilities of this method have yet to be fully explored, but they promise more certain results than simply expecting solutions to coalesce from compute and masses of data. The answers in these machines will always be limited by their training data. This is not their radical newness. They are not merely a new mode of query, although they are that and significantly so, but are rather a revolution in language itself.
This allows for language to come alive, that these are the self-assembling echoes of human gesture. The task for the engineer now is to choreograph this dance, to bring order to the various characters summoned to its stage. They are to be given scripts, parts to play—all that from this society something resembling a mind might emerge.
These seem to show something resembling a serial-position effect, the recency and primacy biases. They are much better at using data which falls at the beginning or end of the context window. This may be an artefact of training data, whether that this tends to be the case for human uses or simply that they fine-tune the models on max context length training data. The latter would suggest a method for overcoming this by using a more even spread.