Things, as I learn...: Looking beyond the uncanny valley of Generative AI

The concept of the uncanny valley, originally proposed for human-like robots, describes the eerie feeling people experience when confronted with something very close to human, but not quite there. In the context of LLMs, this uncanny valley manifests through the gap between human expectations and AI performance.

When something behaves closer to humans, people have positive feelings towards it. The more it behaves like a human the more positive the feeling is. However, after it is close enough to be human, people will react repulsively when they notice a small misalignment in the behavior. This is called an uncanny valley of the experience.

In Generative AI, LLM's capabilities are close to human behavior in many examples. LLMs can do creative writing and respond to messages with a human-like tone. The general hype around AGI (reasoning), or the anthropomorphized design further sets the expectations to make us believe that AI's performance is human-like. These (human-like) capabilities of LLM invoke a positive feeling toward it; however, the very nature of LLM also challenges it by introducing inconsistencies. It may seem like LLM is reasoning; however, it is just a probabilistic machine—Stochastic Parrots. They can be good in something specific in a complex domain (making it believable to reason) but can fail badly on simple quests—e.g., counting 'r' in strawberries.

Broken Metaphors:

GenAI is not deterministic software in traditional terms, so it doesn't fit the framework of general software for most people. People must adopt new frameworks to understand how AI works. The most common generalization is that – LLMs can learn from interaction by default. However, that is not a default behavior for foundational models, unless teams build a scaffold around the LLM to manage this i.e. using RAG, and graphs.

Most models are wrong, but some are useful. However, useful models also break down as it is hard to generalize the non-deterministic behavior. The metaphor "AI is not good software. It is pretty good people" gives us a good grasp and helps users cope with LLM's probabilistic behavior. However, the same metaphor could also lead to KIA—the uncanny valley creeps in again.

Useful Framing of LLM capabilities:

The "AI Stone soup" metaphor is based on the old fable, where a villager declares making a delicious stone soup – a parallel that LLM algorithms are akin to stones. Then they add other ingredients like vegetables, and spices to make the soup tastier. In the case of LLM, the additional ingredients are the source data, fine-tuning data, or even humans-in-the-loop to course correct when training a model. Where does the real flavor of the soup come from? From the stones or other ingredients or cooking process?

Adopting these metaphors to think about LLM capabilities will help not to fall for the hype around AGI, and other false narratives. We can see teams not having a realistic framework often fail in productionizing an LLM model, as they overestimate the capabilities of the LLM, and fail to build the guardrails for real-world use.

Improving Human-AI co-ordination:

To achieve high levels of coordination between people and machines, cooperating automation must be both observable and directable. As we improve these aspects, we can reduce automation surprises.

Observability is how much the user can understand how the machine will work; it is the ability to predict what the machine will do. Directability is the ability to control the machine, so the human can direct it to a specific action or course-correct early as they predict its behavior. Both aspects are a challenge for Generative AI/LLMs.

As probabilistic models, LLMs function as black boxes, making their inner workings difficult to understand. The capabilities of LLM are also not cohesive, so generalizing it also fails. We can improve this by designing LLMs to provide steps, and assumptions as an LLM agent navigates through the data to come up with an answer. With the latest model O1, OpenAI has designed the interface to show the reasoning steps as the model thinks. Even a simple listing of the sources used to derive the answer will help humans review and cross-validate the outputs.

Prompts and Context Length are some of the parameters that will improve the directability of the LLMs. A prompt that works for the given problem, may need to be continued to be tweaked to maintain that it continues to be performant. The larger context length of the models also doesn't make it proportionally better, sometimes larger context lengths also make it hard to debug the LLM behavior.

Building Reliability on Unreliable Foundations

If we must review every output of LLM, it is not possible in terms of effort. On the flip side, if I end up automating a larger chunk of tasks using automation, it is also impossible to debug if the outputs are correct. The assumption that AI can replace a task as done by humans completely, isn't proven. Humans are the adaptive part of the system that can course correct towards its success, even when system underneath is unreliable. We have to have humans in the loop of the automation; the question is at what level and how much for it to be effective.

We can see people are building approaches, i.e., providing prompt templates that LLM can use to follow a task, and instructing LLM to cross-check its work to a lower extent. We also see patterns like using other LLMs as a Judge to validate the outputs of LLM. RAG is a model where additional constructs are used to reduce hallucinations of LLM. Evals are also other techniques that will help validate models in production, to assert that they are continuing to work as closely as possible to as they were designed. Agent architecture is also a way to build a network of agents to work together from outputs of each other to achieve a specific outcome.

The current AI landscape covers a lot of these techniques, and tools – in a way it is analogous to building a reliable network layer on top of unreliable data transmission by adopting error correction techniques. We have in the past built reliable cloud services on top of failing hardware by adopting redundancy and scaling techniques. Similarly, it is viable with some degree of reliability to build useful systems on top of probabilistic models, but not in the same sense as deterministic systems – and certainly not without addressing biases that it inherits from the data.

How do we tackle this challenge of navigating the uncanny valley?

We all must agree that humans-in-the-loop is the only way to design safe and reliable systems out of LLMs. Mental models are the hidden architects, and their influence is higher than we imagine as we develop or use LLM-based systems.

Embrace AI's Stochastic Nature: Avoid falling into the thinking that AI could reason; provide guided exploration for users. Establish a guide, inform the users about the limitations of the technology, and prepare them to see the idiosyncrasies of the LLM design.

Reduce Operational Surprises: Learn from the history of automation and design the system to make it easy for humans to make sense of ––focusing on the observability and directability of the system. It is possible to build reliable systems to an extent on top of unreliable foundations, with early and continuous validation.

Look Beyond the Task at Hand: Look for second-order effects of tasks that are automated with LLM. For example, LLMs can generate code quickly, but is it easy for humans to read when they must debug it in production? Adoption of LLM improves the skills of experts but breaks the cycle of learning for a novice.

Design for Humans in the Loop: Do not build automation to replace human tasks, rather design for human-machine coordination. Consider the cognitive load of users, and the impact of it while using in production. Train users with required framing/guidance to handle ineffective outcomes arising from LLM use.