FANToM: A New Benchmark for Machine ToM in Interactions

Abstract

Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM 👻, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning.

Benchmark Design

An example of FANToM's question-answer set.

We construct FANToM by leveraging information-asymmetry in conversational contexts. It consists of multi-party conversations centered around a certain topic (e.g., pets, family). Initially, the conversation begins with two or three characters. As the conversation progresses, characters join and leave the discussion and the conversation's subtopic changes over time. During the absence of a character, the conversation continues and information is shared among the remaining participants, creating a natural information asymmetry that reflects real-life interactions. After a series of utterances, the character who was absent (re)joins the conversation, unaware of the information that was previously shared with other participants.

On top of this asymmetry, we build fact questions and convert them to multiple challenging belief questions: (1) BeliefQ (choice and free-response types), (2) AnswerabilityQ (list and binary types), and (3) InfoAccessQ (list and binary types). All of these questions require the same underlying theory of mind (ToM) reasoning: "Who is aware of the information in the conversation."

Results

Check out more results from recent models, such as GPT-4o, Gemini, Llama-3, Mixtral, and Claude Opus, on our GitHub repo!

LLMs do not have a coherent theory of mind

All SOTA LLMs exhibit scores that are significantly worse than human performance. We find models perform significantly better on BeliefQ[Choice] compared to AnswerabilityQ[List] and InfoAccessQ[List]. Despite the AnswerabilityQ[List] and InfoAccessQ[List] being prerequisites for solving BeliefQ[Choice], they are much more challenging for models. Furthermore, models' performance sharply drops when evaluated for coherent reasoning across multiple question types with the same underlying theory of mind (ToM) reasoning (i.e., All Question Types). These findings suggest that some instances of successful LLM ToM reasoning should be interpreted as illusory.

Barchart for comparing LLMs with human performance

Updated model performance on FANToM benchmark in table format.

LLMs are tricked
by their own use of shortcuts

The token F1 scores for FactQ shows the model's basic comprehension capability for interactions. Scoring high in FactQ indicates the model is good at identifying the most relevant information piece to answering the question. Meanwhile, we deliberately design the incorrect answers in BeliefQ[Dist.] to have greater word overlap with the context than correct answers. Also, BeliefQ[Dist.] and FactQ share significant word overlap. Thus, if the model mindlessly copies the most relevant information piece to answering the belief question as well, it will be scoring low accuracy.

Barchart comparing the performance on fact questions and belief questions

Chain-of-thought and straight-forward fine-tuning is not enough

We observe an improvement in scores with zero-shot chain-of-thought (CoT) applied. However, there are still significant score gaps compared to human performance. Our benchmark is not intended for training purposes, but we also fine-tune (FT) Flan-T5 XL on FANToM to see how much it gains performance. Although the model shows a significant improvement in individual question types, it does not exhibit coherent ToM reasoning.

Model performance when applying chain-of-thought reasoning or fine-tuning

Even the errors they make are inconsistent

We analyze the error types of AnswerabilityQ and InfoAccessQ for each model with and without chain-of-thought (CoT).
(1) For list-type questions, models make more errors by including characters who are unaware (i.e., false positive) of the information in the responses, rather than excluding characters who are aware (i.e., false negative). Interestingly, when CoT is applied, the error of including unaware characters decreases, whereas the error of excluding characters who are aware increases for most models. (2) In the case of binary questions, models tend to exhibit false negative responses more frequently for binary questions compared to list-type questions.

BibTeX

@inproceedings{kim2023fantom,
    title={FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions},
    author={Kim, Hyunwoo and Sclar, Melanie and Zhou, Xuhui and Le Bras, Ronan and Kim, Gunhee and Choi, Yejin and Sap, Maarten},
    booktitle ={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
    year=2023
}

FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions