On LLMs and the way they work
Ever since ChatGPT shook the world in late 2022, as a data scientist and linguist, the single question I was asked the most was "What do you think about LLMs? Do you think they would work in 5 years? Do you think they're worth investing in?" etc. Having studied neural networks and transformers in a data science context, I did know some things about how LLMs work, but I didn't feel expertly enough to form my beliefs at the time. As we draw close to the 5-year mark, I see the picture more clearly, and my opinions have changed.
Then
Back in 2023, I was in the camp of "LLMs are just a fad". After all, they have proof of neither cognition nor inherent understanding of language; "they are just next-token predictors", as more and more people learned to say in the past few years. Languages are recursive; the sequential nature of LLMs can't build sentences the same way humans do, or at least how Chomsky believes they do. The very way in which they work is an abomination to the linguistic model we've been carefully building for decades. At that time, my vision for a "true AI" was still to build a bit-to-bit model of human cognition:
- For an AI to be truly "talking", it must process language in the way humans do, i.e., recursively parse the string and generate semantics.
- For an AI to be truly "thinking", it must have some structured representation of the world, and have a "thought process" that manipulates this representation. The language generator is only connected as an output.
- For an AI to be truly "perceptive", it must have a unified sensory system of the world that doesn't rely on discrete modalities.
The analogy I liked to draw was this: a computer needs a keyboard, a monitor, and a CPU to function. Someone connected the keyboard directly to the monitor, and found a way to show "the most likely screen" for each keystroke (think about F4 causing the screen to flash, etc.), and many people get convinced that they are interacting with a real computer—but in reality, they are just watching convincing recordings of a computer screen, without any real computation going on.1 It happens that LLMs are far more deceptive than image-generating models, because people tend to associate language with thought and intelligence (just like they associate facial features with humanhood), but now we have a language-wielding being that's actually void of thinking.
Indeed, this was more or less true as of 2022, or earlier. I had played with GPT-2 and had witnessed its evolution from its primitive state, so I was hard to impress. I even, pretending to know what I was doing, conducted some "experiments" with GPT-3.5 in early 2023. At that time, it still felt like a toy.
Technology has advanced since then. Fine-tuning; RAG; chain of thought; multimodal attention; agentic AI; and more. Many pain points I had in 2023 no longer exist, and conversing with AI has become increasingly less frustrating. But we all know that, deep inside, nothing has changed about the way LLMs work. So perhaps it's time to reevaluate.
Now
As of 2026, my mind is no longer completely made up that there exists a single way to do cognition. Humans have some abstract notion of "thought" that's independent of language, and we pretend to know something about it, but in fact we know very little because everything about what we think must be manifested to the outside, through language, muscle movement, or something else. Brain imaging only works occasionally with not-great interpretability. How can we build a machine that thinks like us, if we don't even know how we think? On the other hand, perhaps language is a viable engine for driving thought. Think about this way: most machines operate based on a "working medium", whose movement/phase change transfers energy from the power source to the output. It can be water, steam, air, oil—doesn't matter. All that matters is that we need something that moves to drive the energy from point A to point B. Similarly, perhaps language is viable as that working medium, on which thought can be executed.
I'm not sure how true it is for humans, but I don't think it's completely off, either. It's tempting to bring up deaf and nonverbal people as examples of "people thinking without language", but remember that losing spoken language doesn't mean losing the language faculty. Perhaps aphasia patients' cognitive capabilities are worth looking into, but it's hard to tell when they can't express themselves and can't understand complex instructions.2 For most of us, "thinking out loud" is a thing, which isn't far from how LLMs work in terms of cognitive process. As another example: I know some people who can never answer questions like "does K come before N in the alphabet" on the spot, and they always have to chant the alphabet in their head to figure it out. This is akin to LLMs driving their thought by generating tokens; these tokens by themselves don't mean much, but they carry the thought from point A to point B, until we arrive at a plausible answer.
Of course, there's also the question of "do LLMs actually know how to speak". On this point, I'm more certain: yes, they do. There's no recursive data structure being built, but the very way the attention mechanism works is reminiscent of how we process language. We don't have a single "stack" of words that we build up and tear down as we parse sentences; instead, we keep track of certain dependencies between words, most of them closed instantly (such as "closed" and "instantly") and don't need to be memorized, only a few of them open for a long time (such as "stack of words" and "tear down"). In this sense, attention also enables effect-at-a-distance, a hallmark of human language. If AI learns that "an" can only be followed by words starting with a vowel, "he" must be followed by a verb ending with -s, and a "what" must be paired with a gap somewhere later in the sentence, then it's hard to claim that it hasn't learned the language, even if we can't point to a specific data structure in its "brain" that corresponds to that process.
Future
Suppose that LLMs are indeed the way to go. MCP already provides the capability of interfacing with arbitrary tools, similar to motor control. Multi-modal attention already provides the capability of processing arbitrary sensory input, at least visual, textual, and (to a limited extent) auditory. The whole agentic architecture is apt for coordinating complex tasks. But for LLMs to truly be "thinking", I think at least we need two more things: internally, a sense of "truth"/"doctrine"/"belief" (however you'd like to call it) to fix hallucinations, and externally, a sense of "causality" to understand the evolution of states. These are grand problems to solve, but I think we are solidly moving in that direction.
I do know that there are people actively researching the cognitive science of LLMs—for example, Paul Smolensky, from whom I heard a colloquium talk on whether AI falsifies generative linguistics. I won't pretend to know more about this than average people, because my linguistics upbringing does not include much about either cognitive science or LLMs. I also know that many AI researchers continue to be unconvinced by LLMs, most famously Yann LeCun, who pursue things like "world models". Of course, that would be the ultimate dream of all of us, but until we reach there, I've at least persuaded myself that LLMs have a promising future ahead.
Footnotes
-
By the way, in late 2024, someone actually implemented this idea: it was Oasis AI. ↩
-
Case in point: I've read reports about Broca's aphasia patients being able to understand the thematic relation in "The dog chases the cat", but not being able to understand the same relation in "The cat is chased by the dog". This may suggest degraded syntactic processing, but also degraded cognitive processing in general. ↩