Theoretical Limitations of Autoregressive Models
Thanks to Alex Cai for the discussion that inspired this post.
The Autoregressive Type Signature
Let’s say you want to train an AI chatbot. It needs to be able to take in text (the user’s question or instructions) and output text (its response). Intuitively you would think that the model, as a function \(\phi\) from inputs to outputs, should have the following type signature:
\[\phi: \mathrm{Text} \to \mathrm{Text}\]or more precisely,
\[\phi: \Sigma^* \to \Delta(\Sigma^*)\]where \(\Sigma\) is the set of all possible ASCII characters (or in practice, “tokens”), \(*\) is the Kleene star, and \(\Delta(S) \subset \mathbb{R}^{\mid S \mid}\) denotes the space of all probability distributions over a set \(S\).
Unfortunately, it’s basically impossible to directly design an architecture of this form. Any neural-network based model needs an output space of constant dimension, which \(\Delta(\Sigma^*)\) is not (because there are an infinite number of possible strings).
To get around this, people came up with the ingenious idea of generating text autoregressively. Instead of having the model generate its entire response at once, we just train it to output the very next character of its output, given the user’s question and its output so far. This has the type signature
\[\phi: \Sigma^* \times \Sigma^* \to \Delta(\Sigma).\]For a given question \(Q \in \Sigma^*\), we can generate a response character by character. First, sample
\[x_0 \sim \phi(Q, \text{""}).\]Then, sample
\[\begin{align*} x_1 &\sim \phi(Q, x_0) \\ x_2 &\sim \phi(Q, \overline{x_0x_1}) \\ x_3 &\sim \phi(Q, \overline{x_0x_1x_2}) \\ &\quad \vdots \end{align*}\]repeating this process until the model outputs a special stop token \([\mathrm{STOP}] \in \Sigma\) that indicates it’s done with its output. Under this procedure, \(\phi\) induces a function of the form \(\Sigma^* \to \Delta(\Sigma^*)\). So why am I making such a big deal about the autoregressive type signature being different than a general text-to-text function?
The point of this post is show that there are many distributions of responses that can easily be outputted a general text-to-text algorithm, but can’t be performed by an autoregressive model in polynomial time (given some standard cryptographic assumptions like “factoring semiprimes is hard on average”).
A Note on Architecture
Modern language models are almost always transformers, but I won’t restrict my analysis to a particular choice of architecture for this post. The results here hold for any polytime algorithm with the autoregressive type signature mentioned above. In particular, I’ll allow the algorithm to be randomized and perform looping, as long as it terminates in polynomial amount of time. Both of these are in contrast with transformers, which are deterministic1 and have bounded circuit depth.
Randomized Tasks
First, I want to distinguish two types of of tasks we may give a model. Most of the time when we prompt a language model, we don’t really care about its distribution of responses, as long as the response satisfies our request with high probability. An example is “Write a proof of Fermat’s Last Theorem in the style of Shakespeare” – the model could place 100% probability on a single, high quality response for all we care. Contrast this with what I’ll call randomized tasks, in which we care about a property of the model’s response distribution more than its particular response. A simple example is “give me a random 6-digit number.” We’d be upset if the model always outputted \(392819\) for this prompt (even if we wouldn’t be able to tell from just one sample). While for the first type of task we might measure model’s performance as an expectation of a preference function over its response distribution, performance on randomized tasks are better expressed as a function of the response distribution itself.
Autoregressive models can do many randomized tasks just fine. For the previous example, all it needs to do is output a uniform distribution over \(1, 2, \dots, 9\) for the first character, then a uniform distribution over \(0, 1, \dots, 9\) for the next five characters, and then the \([\mathrm{STOP}]\) token2.
Autoregressive models can also solve more complicated randomized tasks like “Give me a random \(n\)-digit prime number” (for some large value of \(n\)). Here’s where it really helps to allow the model to perform a randomized algorithm. Given a prefix of already-outputted digits, to generate the next digit the model can randomly sample \(n\)-digit completions of the prefix until it finds a prime. It can then output a next-character probability distribution that places 100% of its mass on the next digit of that prime. It’s straightforward to see that this process induces a perfect uniform distribution over the primes, and it also runs in expected \(\mathrm{poly}(n)\) time. (See details about a deterministic solution in the Appendix.)
Factoring
Next, consider the task: “Choose two random \(n\)-digit prime numbers \(p, q\), then output the tuple \([p, q, pq]\) and nothing else.” Given the ability to do the prime generation task in the previous section, this should be a no-brainer. The model first generates appropriate distributions for the digits of \(p\), and does it again for \(q\). Then when generating the digits of \(pq\), it looks back at the values of \(p\) and \(q\) it has previously generated, multiplies them in its head, then outputs the digits of this product one at a time (always placing 100% probability on the single correct next digit). The resulting distribution of responses perfectly matches the desired one.
But now, consider this seemingly trivial variation of the task: “Choose two random \(n\)-digit prime numbers \(p, q\), then output the tuple \([pq, p, q]\) and nothing else.” All we’ve done here is changed the order of the tuple so the model is forced to generate the digits of \(pq\) before either of its factors.
Claim 1: No polytime randomized algorithm can output \([pq, p, q]\) autoregressively with the correct distribution.
The insight here is that an autoregressive algorithm has no place to “store its work” apart from the tokens it directly outputs. When generating the first digit of \(p\), the algorithm \(\phi: \Sigma^* \times \Sigma^* \leadsto_{r} \Delta(\Sigma)\) only has access to what’s written in the output stream – in this case \(pq\). It has no way of knowing the values of any intermediate variables it used to generate \(pq\) in the first place.
In fact, we can use this observation to prove the claim. Assume for the sake of contradiction that there exists such an algorithm \(\phi\). Then, we can use \(\phi\) to factor any semiprime in polynomial time (a semiprime is an integer with exactly two prime factors).3 Given a semiprime \(s\), we simply prompt the model with:
\[\phi(\text{"Choose two primes } p, q \text{ and output }[pq, p, q]\!", "\![s, ")\]and then autoregressively generate the rest of its response, which will have the form: \(a, b]\). We’re then guaranteed that \(a \cdot b = s\) because conditioning the true distribution of \([pq, p, q]\) on the first element of the tuple forces the next two integers to be the correct factors. But since factoring arbitrary semiprimes is hard, no polytime \(\phi\) is believed to exist.
This demonstrates that some output distributions which are easily generated by a general text-to-text algorithm (of the form \(A: \Sigma^* \leadsto_r \Sigma^*\)) cannot be generated autoregressively. See the Appendix for a stronger result which shows that the correct output distribution cannot even be approximated autoregressively.
Where this comes up
At this point, you might rightfully think: OK, but why would you ever ask an autoregressive language model to perform such a weird task? If for some reason you really need a random semiprime and its factors, why not just ask the model to output the tuple in the original order \([p, q, pq]\)?
This example naturally extends to a more general class of randomized tasks which involve the model needing to make and remember randomized decisions that are not visible to the user until later in the conversation. For example,
-
Games. Imagine playing Twenty Questions against an autoregressive model, where you’re asking the questions and the model’s job is to respond yes or no. Since the model is autoregressive, it isn’t able to “choose” an object at the start to refer back to. Instead, every time you ask a question, it has to look back at the game’s transcript and estimate the probability that a random object, conditioned on satisfying all previous answers, also satisfies the new question. This is hard in general – much harder than fixing an object at the start and answering questions about it. In fact, in the extreme case where you ask questions like “Is the \(i\)-th bit of its SHA-265 hash 1?”, it’s computationally intractable.
One can also imagine other game-like tasks, like generating a Sudoku puzzle or a riddle, where it is difficult to come up with the problem without first generating a solution and then erasing information. Autoregressive models will fail at these.
-
Educational use. Students, even now, use ChatGPT to help them study. For most topics this is fine (“test me on the history of 20th century Spain”), but we once again run into issues when certain questions are better generated by backtracking from a solution. For example, “I’m learning about polynomials. Can you give me a polynomial with \(n\) integer roots for me to try to factor?” It’s important that the model’s responses here be randomized because the student may want to get multiple examples of the problem using the same prompt.
-
Storytelling. It seems possible that the skill of telling a good story cannot be captured autoregressively. Consider mystery novels in which the culprit is only revealed at the end. Throughout the story, there are clues that cryptically point towards a certain suspect, but they’re hard to pick up on unless you know what you’re looking for. In particular, it’s hard to figure out the culprit from the clues, but it’s easy to generate plausible clues if you know the culprit.
It would be unheard of for an author to begin their creative process by first coming up with the crime scene and clues, and then choosing who’s most likely to have been the culprit. In the real world, causation flows in the other direction: the identity of the culprit determines which clues appear. An autoregressive model wouldn’t be able to do this.
Ways to get around this
Do these considerations present a barrier to achieving human-level AI systems? Not at all. There are many simple ways to adapt the autoregressive type signature that get around this problem.
-
Hidden tokens. The simplest option would be to allow the model to output tokens to a separate output stream that the user doesn’t see. When generating new tokens, the model would be able look at what it wrote in both the hidden stream and the public stream. This would allow an autoregressive algorithm to simulate any general text-to-text algorithm (with the hidden output stream acting as the algorithm’s internal memory). In our example, the model could first autoregressively generate \(p\) and \(q\) in the hidden stream, then easily compute and write \([pq, p, q]\) into the public stream.
-
Nondeterministic activation functions. This is similar to the previous option, but specific to the transformer architecture. If we were to allow activation functions in internal layers of a transformer to be nondeterministic, the model could make randomized choices hidden from the user. When generating future tokens, the model could attend back to these activations to “remember” these choices.
-
Random seed. At the start of a session, prepend a random string of bits to the prompt (hidden from the user). The model could then make “random” choices by applying deterministic functions to these bits. In fact, if you did this you would be able to remove all other forms of randomness from the model if you wanted – even randomness during the autoregressive sampling step.
Conclusion
While I don’t expect this observation about the limitations of autoregressive type signatures to be that important in the design of contemporary ML systems, it does provide an interesting theoretical idea to consider. There are various ways to adapt the autoregressive type signature to address these limitations, but understanding the fundamental differences between general text-to-text functions and autoregressive models remains valuable. Thank you for reading, and I encourage you to share your thoughts or point out mistakes!
Appendix
Generating primes deterministically
Say we want to solve the task “Give me a random \(n\)-digit prime number,” but with a deterministic algorithm like a transformer. We’re no longer able to make random choices (like choosing random numbers until we stumble upon a prime) to generate digits. Instead, for any prefix \(\overline{d_1\!\dots d_k}\) we have to deterministically output \(10\) probabilities that correspond to the proportion of primes starting with prefixes \(\overline{d_1\!\dots d_k0}, \overline{d_1\!\dots d_k1}, \dots, \overline{d_1\!\dots d_k9}\).
I think this can roughly be done with the following procedure. The distribution over the first character should be slightly weighted towards lower digits because primes get rarer the bigger they are (to determine how much to weight smaller digits, the model could integrate the average prime density \(\log^{-1} x\) over the appropriate ranges). The second character’s distribution would depend on what it outputted for the first character, but it can also be approximated with an integral. The model could repeat this process up until the last few characters, at which point the integral approximation wouldn’t be precise enough. For these last few digits, the model would have to manually think through all possible endings it could create, run primality tests on each of them, and then weight its distribution to be proportional to the number of primes it found in each range. Clearly, this process won’t exactly achieve the distribution we want – as far as I know, it’s impossible to compute the exact proportion of \(n\)-digit primes that start with a given prefix in polynomial time unless you assume some crazy things like \(P = \#P\). However, modulo some messy details, it seems like there exists a polynomial-time autoregressive algorithm that can output the desired distribution with some exponentially small error, as measured by total variational distance4.
Even approximating is hard
Let \(S_n\) be the set of all valid tuples \([pq, p, q]\) for \(n\)-digit primes \(p\) and \(q\). We’ve shown with Claim 1 that no polytime \(\phi\) can autoregressively produce a uniform distribution over \(S_n\). But maybe there’s a way to get the distribution approximately correct? Here we show that this is also impossible. First, we observe the following.
Claim 2: Say that a polytime randomized algorithm \(\phi\) has an autogressive response distribution with support \(\mathrm{supp}(\phi) \subset S_n\) when prompted as above. Then \(\frac{\mid \mathrm{supp}(\phi) \mid}{\mid S_n \mid} = o(1)\) (it must be vanishingly sparse).
This follows from our belief that no polytime algorithm exists that can factor even a constant fraction of semiprimes. Since \(\phi\) only ever outputs correct factorizations, any semiprime that appears in its support must be factorizable by the previous reduction with probability \(1\).
Next, we show that even if we allow \(\phi\) to sometimes output incorrect answers, the distance between the two distributions must still be large.
Claim 3: For any polytime randomized algorithm \(\phi\), the total variational distance4 between its autoregressive response distribution and \(\mathrm{Unif}(S_n)\) must be \(1 - o(1)\).
Proof. This is a trickier; feel free to skip it if you’re not interested (this is an appendix after all). Let \(N = \mid \!S_n\mid\), i.e. there are \(N\) relevant semiprimes \(s_1, \dots, s_N\). WLOG, assume that \(\phi\) only ever outputs tuples of the form \([s, a, b]\) where \(s \in \{s_1, \dots, s_N\}\) (note that we’re not requiring \(a\cdot b = s\)). Doing anything else would only increase its TVD from the desired distribution without affecting the proof, since we always condition \(\phi\) on a semiprime.
Say that when we try to get \(\phi\) to factor \(s_i\), it outputs the right factorization with probability \(p_i\). By standard cryptographic assumptions, factoring semiprimes is “hard on average”, which means picking a random semiprime and trying to factor it with \(\phi\) has a vanishing probability of working. We can write this as:
\[\tfrac{1}{N} \sum_{i=1}^N p_i < \epsilon\]However, \(\phi\) may try to cheat by outputting certain “easier” semiprimes more frequently than others. Say that \(\phi\) chooses to output \(s_i\) as the first element of its tuple with probability \(a_i\). Then the “overlap” between \(\phi\)’s autoregressive response distribution and \(\mathrm{Unif}(S_n)\) can be written as:
\[\mathrm{Overlap} = 1 - \mathrm{TVD} = \sum_{i=1}^N \min(a_ip_i, \tfrac 1N).\]We want to prove an upper bound on this overlap. How large can this expression be? Think of this in terms of the following setup: You have \(N\) types of drinks, and each drink has a certain sugar ratio \(p_i\). You want to mix the drinks together in some way, with drink \(i\) making up \(a_i\) proportion of the mixed drink (\(\sum a_i = 1\)). The sweetness of the mixed drink is the appropriate linear combination of sweetnesses of its component drinks, but with the strange restriction that the absolute sugar contribution from any given drink type is capped at \(\frac 1N\) (to mirror the expression above). How should you combine the drinks to get the sweetest mixture?
It’s clear that you should start by pouring in the sweetest drink until its absolute sugar contribution reaches \(\tfrac 1N\). Then, pour in the next sweetest drink until its contribution reaches \(\tfrac 1N\), and so on. Keep doing this until you’ve used a total volume of \(1\) (say that it requires going through \(k\) drink types before this happens – there’s a potential off-by-one error here but it doesn’t matter). If we assume \(p_1 \geq p_2 \geq \dots \geq p_N\), this corresponds to \(\tfrac{1}{p_iN}\) volume of drink \(i\), for \(i = 1, \dots, k\). The sweetness of the mixture is:
\[\sum_{i=1}^k \frac{1}{p_iN} p_i = \frac{k}{N}\]and we have the constraint that the total volume is \(1\):
\[\sum_{i=1}^k \frac{1}{p_iN} = 1.\]How large can \(k\) be? Well, given the constraint that \(\sum_{i=1}^k \frac{1}{p_i} = N\), the smallest possibility for \(\sum_{i=1}^k p_i\) is achieved when \(p_1 = \dots = p_k = k/N\) by concavity. But this means
\[\frac{k^2}{N} \leq \sum_{i=1}^k p_i \leq \sum_{i=1}^N p_i < N \epsilon\] \[k^2 < N^2 \epsilon\] \[\frac{k}{N} < \sqrt{\epsilon}\]So we’ve proven the sweetness of any mixed drink, and thus the overlap between \(\phi\)’s autoregressive response distribution and \(\mathrm{Unif}(S_n)\), must be vanishingly small. Thus, \(\mathrm{TVD} = 1 - o(1).\) \(\blacksquare\)
-
Don’t confuse the randomization that comes from the sampling procedure with internal randomness of the model. Transformers are completely deterministic functions from a sequence of tokens to a vector of output logits. The only randomness during the text generation process comes from the process of sampling from the next-token distribution at every step. Constrast this with general random algorithms, for which the output logits themselves are allowed to be randomly generated. Of course, we can convert any such randomized algorithm \(\phi: \Sigma^* \times \Sigma^* \leadsto_{r} \Delta(\Sigma)\) into a deterministic algorithm \(\gamma: \Sigma^* \times \Sigma^* \to \Delta(\Sigma)\) by averaging its output distribution over all possible sequences of internal coin flips (this is a “conversion” in the sense that their response distributions on any input, as induced by autoregressive generation, are equivalent). However, \(\gamma\) may no longer be polytime. ↩
-
In practice though, GPT-4 is really bad at this. I asked this exact prompt a bunch of times at temperature 1 and got the following responses: 253489, 572931, 523749, 935271, 754823, 543219, 523891, 523689, 537218, 247856, 527893, 537891, 523198, 523817, 523481, 254839. I’m sure we could solve this with training techniques better suited for randomized tasks, although it’s very interesting to think about how the current RL from human feedback methods would have to be modified to accomplish this. ↩
-
…as long as both factors have the same number of digits \(n = \lceil \tfrac 12 \log_{10} s \rceil\). But this can be fixed with a trivial restatement of the prompt. ↩
-
Given two ditributions \(D_1, D_2 \in \Delta(S)\), we define their total variational distance to be \(\delta_{TVD}(D_1, D_2) = \tfrac 12 \sum_{s \in S} \mid\!D_1(s) - D_2(s) \mid\). Note that this distance is always between \(0\) and \(1\). ↩ ↩2