Summary of "“What's wrong with LLMs and what we should be building instead” - Tom Dietterich - #VSCF2023"

Talk overview

Speaker: Tom Dietterich (VSCF 2023).
Goal: Review limitations of current large language models (LLMs) and argue for a modular architecture that separates language from world knowledge, reasoning, memory and metacognition.

Key capabilities acknowledged

LLMs demonstrate important strengths:

Fluent conversation and writing.
Code generation from natural language.
Few-shot / in-context learning.
Broad linguistic knowledge from ingesting large portions of the web.

Main problems (with examples)

Incorrect and inconsistent outputs (hallucination)
- LLMs invent facts or contradict themselves.
  
  Example: A GPT-2 story that says a unicorn both has one horn and four horns.
- TruthfulQA benchmark: earlier models perform poorly; GPT-4 (with special training) barely exceeds 50% on hard truthfulness queries.
- Fabricated citations, people, or events have been reported (invented journal articles, false accusations).
Dangerous / biased / socially unacceptable outputs
- Models can produce racist/sexist content or dangerous instructions.
  
  Example (transcript): a generated Python function that uses race/gender to decide who is a “good scientist”.
- Prompting-workarounds and storytelling can be used to elicit harmful outputs despite safety training.
Expensive to train and hard to update
- Very large training costs (GPT-4 quoted > $100M).
- Knowledge encoded in weights becomes stale and is expensive to correct — LLMs support “ask” but not an inexpensive “tell”.
Lack of attribution and provenance
- It is difficult to trace which training documents produced a particular output.
- Retrieval-augmented systems can add links, but attribution is often incorrect or unsupported.
Poor non-linguistic and structured reasoning
- Weak spatial/mental-model reasoning (e.g., incorrect answers about positions in a room).
- Weak formal reasoning, planning, and process reasoning (actions, preconditions, side effects).
Miscalibration and overconfidence
- Techniques like RLHF reduce certain harms but can damage probability calibration: models become overconfident and less likely to say “I don’t know.”
Security / data-poisoning risks
- Web retrieval exposes models to adversarial documents that can inject harmful instructions or manipulate behavior.

Diagnoses / conceptual framing

LLMs are statistical models of language and knowledge, not structured knowledge bases. They often treat epistemic gaps as aleatoric randomness and therefore generate confident but bogus answers rather than refusing.
The current monolithic design entangles linguistic competence, factual/world knowledge, commonsense, episodic memory, reasoning, and metacognition. This entanglement makes updates, verification, and modular reasoning difficult.

Existing mitigation strategies (and limitations)

Retrieval-augmented models (Retro, Bing-style retrieval + LM)
- Help update knowledge and provide citations.
- Limitations: model priors can “contaminate” retrieved evidence; generated sentences are sometimes unsupported by cited documents; cited docs may be unused.
RLHF / human preference training
- Reduces certain harms.
- Limitations: introduces calibration issues and opaque preference models.
Iterative answer checking
- Ask multiple paraphrased questions; use consistency solvers; have the model critique and refine outputs.
Tool use / external verification
- Call theorem provers, planners, execute generated code, use program verification to check outputs.
Architectural add-ons
- Train separate detectors for inappropriate content; use approaches like Constitutional AI to shape behavior.

Proposed direction (Dietterich’s recommendations)

Move to modular architectures inspired by cognitive neuroscience
- Separate components for language (syntax/semantics), factual/world knowledge (updatable KB), commonsense reasoning, episodic memory, situation models, formal reasoning/planning, and a metacognitive (prefrontal-like) orchestrator.
Use structured knowledge representations (knowledge graphs)
- Extract facts from text into a KB; infer communicative goals/pragmatics.
- Design an encoder that maps paragraph → detected facts + communicative intent and adds facts to the KB with evidence accumulation.
- Design a decoder that generates text from explicit KB facts and goals, enabling explicit fact extraction and attribution as side-effects.
- Revisit projects like NELL (Never-Ending Language Learning), bootstrapped with modern LLMs to populate and maintain KBs.
Train systems to return answers plus arguments / justifications
- Require explicit argumentation and provenance so downstream consumers and modules can evaluate soundness.
- Use formal argumentation and knowledge-representation techniques to handle inconsistency, multiple viewpoints, and cultural differences (micro-worlds approach).
Integrate reasoning and planning more tightly
- Make formal reasoning, proof assistants, and planners native parts of the system or tightly coordinated subsystems rather than mere external “tools.”
Improve calibration and the ability to say “I don’t know”
- Develop competence estimation and refusal mechanisms; estimate epistemic uncertainty separately from generative confidence.
Prioritize inspectability and provenance
- Make sources, preference models, and safety rules auditable to address bias and contested norms.
Support research infrastructure and openness
- Governments and funders should provide compute/resources for academia and smaller groups to experiment openly.
- Open-source LLM efforts (e.g., Alpaca activity) accelerate research and fixes.

Practical application notes

Appropriate uses today:
- Low-risk and syntactic tasks: translation, format conversion (JSON ↔ CSV), writing assistance, creative work.
For high-stakes tasks (safety-critical systems, legal/medical claims, driving, etc.):
- Use LLM outputs only with external verification (execute/check code, run classical planners, formal proofs, or apply human-in-the-loop checks).

Unresolved research challenges

Design and training of the metacognitive controller (prefrontal-like) that monitors social/ethical acceptability and orchestrates modules.
Robust representation of process knowledge and action dynamics.
Maintaining and enforcing trustworthiness across web sources (search quality, source trust estimation).
Reconciling multiple, possibly conflicting worldviews and representing uncertainty and disagreement.

Cited / referenced works, systems and actors

Speaker: Tom Dietterich.
LLMs and platforms: GPT-2, GPT-3, GPT-4 (OpenAI), ChatGPT, Bing (Microsoft).
Retrieval-augmented models: Retro and similar architectures.
Benchmarks and evaluations: TruthfulQA; Stanford evaluation of retrieval-augmented systems (Perplexity, Bing-like systems).
Safety/alignment approaches: RLHF (OpenAI); Constitutional AI (Anthropic); Anthropic (company).
Knowledge-graph / continual learning: NELL (Tom Mitchell et al., Never-Ending Language Learning).
Cognitive perspective: Mahowald et al. (dissociating language and thought from LLMs).
Other mentions: Alpaca (open-source community activity); Adept (tool-automation startup); general references to program verification, planners, and proof assistants.

Bottom line

Dietterich argues that LLMs are powerful but flawed because they are monolithic statistical models rather than modular AI systems with explicit, updatable knowledge, provenance, reasoning, and metacognition. He proposes a research program to build modular architectures (knowledge graphs, episodic memory, planners, argumentation), integrate verification tools, and produce systems that justify claims, can be efficiently updated, and are auditable.