What AI Fails to Understand – For Now

by Ali Minai

Most people see understanding as a fundamental characteristic of intelligence. One of the main critiques directed at AI is that, well, computers may be able to “calculate” and “compute”, but they don’t really “understand”. What, then, is understanding? And is this critique of AI justified?

Broadly speaking, there are two overlapping approaches that account for most of the work done in the field of AI since its inception in the 1950s – though, as I will argue, it is a third approach that is likelier to succeed. The first of the popular approaches may be termed algorithmic, where the focus is on procedure. This is grounded in the very formal and computational notion that the solution to every problem – even the very complicated problems solved by intelligence – requires a procedure, and if this procedure can be found, the problem would be solved. Given the algorithmic nature of computation, this view suggests that computers should be able to replicate intelligence.

Early work on AI was dominated by this approach. It also had a further commitment to the use of symbols as the carriers of information – presumably inspired by mathematics and language. This symbolic-algorithmic vision of AI produced a lot of work but limited success. In the 1990s, a very different approach came to the fore – though it had existed since the very beginning. This can be termed the pattern-recognition view, and it was fundamentally more empirical than the algorithmic approach. It was made possible by the development of methods that could lead a rather generally defined system to learn useful things from data, coming to recognize patterns and using this ability to accomplish intelligent tasks. The quintessential model for this are neural networks – distributed computational systems inspired by the brain.

Over the last three decades, neural networks have become exceptionally good at many tasks that the algorithmic approach had failed on, and gone on to even bigger things such as beating grandmaster at chess and Go. However, neural networks have run into their own limitations. It can be argued that not every aspect of intelligence can be reduced to pattern recognition. To be sure, neural networks can do other things too, such as organizing data into useful representations, storing and recalling complex memories from partial cues, finding structure in complex data, and so on – all important aspects of intelligence. However, every task requires its own kind of network, trained on its own data, and in its own way. Ultimately, each neural network too is just an algorithm, albeit one that is not pre-designed by a clever human designer, and of a form that can be adapted flexibly to multiple tasks and implemented in distributed hardware. These are big improvements, but can they lead to the kind of general intelligence that one expects even from a rat or a lizard, let alone to human intelligence? The difficulty of doing so can be seen in the fact that, while neural networks – and their latest form, deep learning – are being applied widely to information-driven tasks such as image processing, recommendation, text analysis, etc., they are meeting with considerably less success in more performative ones such as building intelligent robots. This is largely because behavior is not a specialized task, but a very general one. A robot that can only do a few things is not much more intelligent than a backhoe. Intelligent behavior requires engaging with the complexity of the world “in the wild”, not in controlled situations or on specific tasks. There is no deep learning system for that – so far.

But why not? And can there never be such a system?

Insight into this can be gained by looking at how AI is tackling what is arguably its most challenging area: Language. After decades of evolution, the AI of language – usually called natural language processing or NLP – has branched into two approaches mirroring the algorithmic-symbolic vs. pattern recognition divide of AI in general. The first approach looks at language in terms of symbol manipulation, leading it to focus on grammar, syntax, prosody, etc., as fundamental to turning meaningful words into meaningful expressions. This is an extremely laborious task, requiring a great deal of knowledge and organization of information, not all of which is easily automatable. Dictionaries must be constructed, ontologies built, and texts parsed to make sense of them. As a result, systems that adhere to this approach do not scale easily, and are typically limited to specific tasks. The other approach is data-driven and based on the assumption that applying a large, flexible learning system such as a powerful neural network to vast amounts of natural language data will infer meanings, learn analogies, translate texts, and thus understand language without explicit design: Dictionaries and ontologies will emerge; syntax will be accounted for implicitly; grammar will be sublimated after serving its purpose without ever being dealt with explicitly.

To some extent, this has been uncannily successful, as exemplified in methods such as word2vec and fastText – and most spectacularly in the language model GPT-2, which can generate extremely complex, comprehensible, and grammatically correct texts on arbitrary topics after being primed with a brief snippet. The approach has shown that, in many cases, linguistic tasks that might have been thought to require sophisticated analysis can indeed be accomplished in a rather simple – even mindlessly automated – way. However, when one compares the performance of even the most sophisticated deep learning methods to human performance, it is dismal. Nowhere is this seen better than in machine translations, where much lauded systems such as Google Translate or the translator built into Facebook fail miserably and comically when faced with translating languages outside a narrow set. Douglas Hofstadter and others have suggested that this is precisely the result of a failure to operate from understanding, which, they say, is very different than just recognizing patterns. And this brings us back to the original “computers don’t understand” critique. If they could, they would surely be able to behave intelligently, engage in complex conversation, solve difficult problems, and be creative. In other words, they would demonstrate artificial general intelligence (AGI), not just the narrow types of AI seen in today’s systems. So what is understanding, and why is it important?

Two related points need to be made before looking at what understanding might be. First, the critics who dismiss every performance of a complex task by a machine as “mere calculation” need to specify how exactly one can know that something is not calculation. What can one ask of a putatively intelligent system beyond that it generate intelligent responses in a general sense, i.e., not just on narrowly specific tasks? To ask for anything more is just a kind of dualism – a demand for a disembodied mind, an immaterial soul. Second, one reason why it is difficult to extend the presumption of understanding to machines is a primordial notion of agency embedded in the human psyche, which assigns agency to those systems – mainly humans and animals, but not necessarily all of these – that are considered to be “like us” – i.e., of the same category – and deny it to the rest. Both these issues are exactly what the Turing Test seeks to address.

But even with these qualifications, there is clearly something missing from the current AI systems. One possibility is that we are just early in the process, and AGI will emerge from the current way of doing AI at some point – that we have the principles right. I expressed this optimistic opinion some time ago, but it is becoming clearer that fundamental changes are needed that move the AI enterprise towards a less reductionistic, goal-driven paradigm, and give greater space to the emergent, creative aspect of intelligence that are fundamental to understanding.

The good news for a pattern-recognition based approach to intelligence is that, in fact, understanding is a form of recognition*. We understand something when we find a resonance for it in our own mind; when the relationships involved click into place by fitting some template that already exists within our cognitive repertoire. The feeling of understanding is the internal perception of this resonance. But there is a very important difference between the recognition involved in understanding and the pattern recognition performed by most neural networks. The latter respond to recognizing a pattern declaratively, saying “that is a car” or “there is a dog in the lower left corner of the image,” or by classifying a review as good or bad. This, clearly, is a necessary component of understanding, but not a sufficient one. The pattern recognition involved in understanding is internal to the animal, and goes beyond the declarative: It is fundamentally generative.

When some piece of information clicks into a mental template, it immediately makes possible the generation of many new things: Ideas, analogies, intuitions, insights, expressions, etc. This is because the web connecting such things already exists within the mind, for which an understood idea acts as a stimulus. Integrated intelligent behavior is possible because, as information flows continuously into the brain-body system, it is processed within this unitary but multi-level web of perception, memory, and behavior that has been put in place by genetics, development, and previous experience. Intelligence arises only in the context of this web. If we want to produce general intelligence, we have to focus less on teaching machines to do very specialized tasks, and concentrate instead on understanding how the web of the mind comes to be, and how it is instantiated in the physiology of the brain and the body. Ultimately, perception, cognition, and behavior are all memory – a deep, complex, layered memory from which the environment continuously generates patterns. Some of these patterns are perceptions; others are thoughts or ideas; and some are behaviors. A thought is a pattern of activity across cells of the brain; and action is a pattern of activity across the muscles of the body. Both are intimately connected and in continuous coordination.

Ultimately, the data-driven, machine-learning based view of intelligence is not very different than the earlier algorithmic-symbolic view. Both concentrate on procedure, albeit at different levels and in different ways. The algorithmic view makes the procedure explicit. The neural network approach embodies it in the architecture of the network and the learning procedure – algorithm – it uses to acquire its intelligence. Both visions are fundamentally reductionistic, seeking to solve the problem of intelligence piecemeal, in the hope of eventually putting it all together. This did not work with the algorithmic-symbolic approach, and it likely will not with neural networks – especially since researchers in the field have every incentive to focus on finding useful and lucrative solutions to particular tasks, and almost none for addressing the larger problem. Even those who try to look at the big picture end up focusing on issues such as mental architectures or cognitive maps – general and important, to be sure, but still far from the level of integration needed.  Most importantly, too much (though not all) of today’s data-driven AI uses supervised learning, i.e., learning in a goal-driven manner from data labeled with correct answers. Almost none of the learning done by animals is supervised. Rather, it is based on the perception of regularities in the world (unsupervised learning), or the experience of consequences (reinforcement learning). Such learning is more uncertain, but also potentially more creative.

Notably, the most successful AI models today, such as GPT-2 and AlphaZero, make substantial use of these learning modes, for which well-developed methods exist in the field of neural networks. This is one way in which the neural approach has a decisive advantage over its earlier algorithmic methods. Another one is that, however imperfectly, it is grounded in the principles of biology and biophysics. As such, there is the possibility that more sophisticated and biologically informed versions of neural networks could, in principle, get closer to general intelligence. They will do so, not by running as abstract programs on disembodied GPU clusters, but by being embedded in behaving bodies that can experience the world directly and build minds for themselves. And, as we know from biology, minds are not built from scratch: Much of the framework for this is pre-configured by evolution into the genetic code, and instantiated physically by development. The biases – the priors – configured into the body by these processes serve as the scaffolding on which experience builds the mind through continuous learning. Somehow, from the arrangements of sensory receptors, neurons, muscles, tendons and joints, there emerges a behaving animal with intelligence. If we want to build artificial intelligence, we will need to build bodies with minds, not just neural networks that can learn to play Go.

The limitations of current AI methods show up in many ways, but one simple example will suffice to illustrate the main issues. One of the most successful recent advances in automated processing of language has been the development of word embedding methods such as word2vec, which allow words in any language to be represented as vectors in a semantic space that can be inferred automatically by a neural network using a large amount of text (e.g., the entire content of Wikipedia). This technique has also been extended to phrases, sentences, documents, and even to networks. One astonishing – and purely emergent – capability of the representations produced by word2vec is the automatic production of analogies. Given the pair of words “man : woman” and queried with the word “king?”, the system returns “queen” as the likeliest response. It also works for “France : Paris”, with “Italy?” yielding “Rome”. Very impressive! And most reports make this point before moving on (read this for a nice overview).

However, it is instructive to look at the results reported by one user of the method in a recent blog. He trained the word2vec model on a very large body of hotel reviews – clearly a biased source – and then queried it for various analogies. Sure enough, analogizing “man : woman” with “king?” produced “queen” as the likeliest match, but the second likeliest match was “kingsize”, followed by “double”, “twin”, and “queensize”! It is obvious what has (hilariously) happened here (or is it?) The poor network, with its knowledge limited to the pros and cons of hotels, associates the words “king” and “queen” with mattress sizes, not with humans. But then it is interesting that the “man : woman” template did return “king : queen”. A complete explanation of this would be hard to find – it is unlikely that the hotel reviews included anything about real kings and queens or even men and women! – but perhaps the system inferred something about size. In any case, the opacity of the system is not the issue – humans too are often at a loss to explain why they think something. The issue is the kind of error being made, and why.

The limitations of the system become even clearer when “human : house”, queried with “bird?” produces “charlie”, “brandt”, “horse” and “frankies”! And “grass : green” queried with “sky?” leads to “woodside”, “ravenscourt”, “beihei”, and “belsize”. Clearly, something is amiss, and it is not hard to guess what it is. The system is being asked to do the impossible in two ways. First, it is being asked to use knowledge about a single domain – hotels – to answer general questions. But a human would never be in that position except in very peculiar circumstances. The problem humans face is usually the opposite: Being asked to say something on a specialized topic based on generic knowledge – as is on painful display at most US Congressional hearings! This is because the human experience is automatically that of the world at large. The “general” in “artificial general intelligence” comes – in part – from the generality of experience that any human (or other animal) undergoes. Only computer programs and animals in captivity are limited to narrow experiences. But a second – related – point is even more important: The knowledge in the human mind is connected and grounded across many modalities of experience.

Even to a person who has never met a king, the word “king” conjures up feelings, thoughts, images, stories, legends, visions of pomp and grandeur, of war and peace, and many other things based on their own subjective experience. Crucially, a person thinking of a “king” automatically identifies him as a fellow human, which links him immediately to the person’s own perceptions, needs, emotions, and desires. Their understanding of “queen” is in context of all this, and is thus a far richer informational object than that which emerges in the computer – even in one that has been trained on all of Wikipedia rather than on hotel reviews. Of course, the word “king”, also activates many things – called features – in the computer, and it would be incorrect to say that these are different in principle from those activated in the human mind, but there is a difference in the source of the features.

The features in word2vec come from the analysis of word statistics; those in the human brain are laid down by genetics, development, and experience of the physical world. In other words, the latter are grounded in experience – not only at the individual level, but the experience of countless generations as distilled into the genome by evolution – not just from relationships between words. The hope is that human language, in all its complexity, has captured enough of the world’s complex reality that one can infer most of human experience purely from the patterns of language – or from images, or audio, or video data. This hope is not as vain as it might appear, and is the inspiration behind some of the most ambitious computational projects in history, such as Google’s Knowledge Graph. Even more ambitiously, GPT-2 embodies a generative approach to learning language that actually begins to resemble the processes of the mind, but still learns all its knowledge from text rather than experience. It can thus generate plausible texts – e.g., fake news – and engage in natural conversations. Most importantly, it can make inferences and statements that go beyond what it was taught explicitly, thus evincing a degree of thought and creativity. Will this approach bootstrap its way to truly general intelligence, or remain confined to superficial tasks?

At some point, it may become impossible to answer this question, since general plausibility of response is indistinguishable from understanding. However, the likelier scenario is that, as long as systems such as GPT-2 remain confined to learning only from text rather than from actual experience, they will not become sufficiently grounded to achieve the elusive goal of AGI. A system such as GPT-2 but using much more than just text, implemented in a complex robot with multiple modes of sensing and the capability to act, would be another matter – and one that requires very careful thinking.

In his masterpiece, “The Lady of Shalott”, Alfred Tennyson describes a character from the Arthurian legend who is cursed to remain in a tower, looking at the world only through a mirror, and weaving the “mirror’s magic sights” into her web. AI today is, it seems, in its Lady of Shalott stage, trying to weave four-dimensional reality into a two-dimensional web by looking into the dim, distorting, and often deceptive mirror of data. Eventually, even the Lady of Shalott had to exclaim, “I am half-sick of shadows,” but it was only when Sir Lancelot showed up in her mirror that she finally decided to look upon the world rather than its reflection. This encounter with reality did not end well for the Lady of Shalott, and perhaps it will not for the builders of AI, since creating an autonomous embodied intelligence with open-ended learning capabilities is ultimately the path to Skynet and Ultron. That seems far off for now, but you never know…..

* * *

* This comes close to the assertion of Recognition Theory that “all cognition is recognition” – a phrase used by Jung in the context of unconscious knowledge, but much discussed by epistemologists and psychologists with regard to cognition in general. However, here the reference is only to the feeling of understanding, which is certainly not all of cognition.