視覺、聽覺、記憶與推理：具備長期記憶的多模態代理

arxiv.org

17 天前

AI 生成摘要

研究人員開發了M3-Agent，一個具備長期記憶的新型多模態代理框架，它能處理視覺和聽覺輸入以建立和更新記憶，從而實現自主的多輪推理和任務完成。同時，他們也推出了新的基準M3-Bench來評估此類代理，M3-Agent的表現優於強大的基準模型。

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

1]ByteDance Seed
2]Zhejiang University
3]Shanghai Jiao Tong University
\contribution[*]Equal contribution
\contribution[†]Corresponding author

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding, general knowledge extraction, and cross-modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances multimodal agents toward more human-like long-term memory and provides insights for their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.

[Project Page]https://m3-agent.github.io

1 Introduction

Imagine that in the future a household robot can autonomously carry out household tasks without your explicit instructions; it must have learned the operational rules of your home through daily experiences. In the morning, it hands you a cup of coffee without asking “coffee or tea?", because it has gradually formed a memory of you, tracking your preferences and routines through long-term interactions. For a multimodal agent, achieving this level of intelligence fundamentally relies on three capabilities: (1) continuously perceiving the world via multimodal sensors; (2) storing and organizing its experiences into a long-term memory, and gradually building knowledge of its environment; (3) reasoning over this accumulated memory to guide its actions.

To achieve the goals, we propose M3-Agent, a novel multimodal agent framework equipped with long-term memory. As shown in Figure 1, it operates through two parallel processes: memorization, which continuously perceives real-time multimodal inputs to construct and update long-term memory; and control, which interprets external instructions, reasons over the stored memory, and executes the corresponding tasks.

During memorization, M3-Agent processes the incoming video stream, capturing both fine-grained details and high-level abstractions by generating two types of memory, analogous to human cognitive systems [44, 45]:

Episodic memory: Concrete events observed within the video. For example, "Alice takes the coffee and says, ‘I can’t go without this in the morning,’" and "Alice throws an empty bottle into the green garbage bin."

Semantic memory: General knowledge from the clip. For example, "Alice prefers to drink coffee in the morning" and "The green garbage bin is used for recycling."

The generated contents are then integrated into the agent’s long-term memory, which supports multimodal information such as faces, voices, and textual knowledge. The memory is organized in an entity-centric structure. For example, information related to the same person (e.g., their face, voice, and associated knowledge) is connected within a graph, as shown in Figure 1. These connections are incrementally established as the agent extracts and integrates episodic and semantic memory.

During control, M3-Agent leverages its long-term memory to reason and complete tasks. It autonomously retrieves relevant information from its long-term memory across different dimensions, such as events or characters. Instead of using single-turn retrieval-augmented generation (RAG) to load memory into context [22], M3-Agent employs reinforcement learning to enable multi-turn reasoning and iterative memory retrieval, resulting in higher task success rates.

The memorization task relates to long video description [13, 57, 18] but goes beyond it, introducing two key challenges: (1) Infinite information processing. Memorization requires handling infinitely long input streams. Existing methods optimize architectural efficiency to process longer, but still finite, offline videos [42, 40, 58, 14, 41]. In contrast, M3-Agent continuously processes arbitrarily long multimodal streams online, more closely mimicking how human long-term memory forms, through ongoing perception and incremental experience integration. (2) World knowledge construction. Traditional video description [24, 26, 27, 55, 1] often focuses on low-level visual details while overlooking high-level world knowledge [36, 19, 12] such as character identity and entity attributes, which may lead to ambiguity and inconsistency in long-term contexts. M3-Agent addresses this by incrementally building world knowledge through an entity-centric memory structure. It forms rich, multimodal representations of key entities, enabling coherent and consistent long-term memory.

We evaluate M3-Agent on long video question answering (LVQA), where the videos simulate the multimodal input streams (visual and auditory) received by an agent. Most existing LVQA benchmarks [11, 62, 2, 50] mainly focus on visual understanding, such as action recognition and spatial/temporal perception, leaving a gap in evaluating higher-level cognitive abilities that rely on long-term memory and are crucial for real-world agents, such as understanding persons, extracting general knowledge, and performing cross-modal reasoning. To bridge this gap, we introduce M3-Bench, a new LVQA benchmark designed to evaluate a multimodal agent’s ability to reason with long-term memory. M3-Bench consists videos from two sources: (1) M3-Bench-robot, consisting of 100 real-world videos recorded from a robot’s perspective, and (2) M3-Bench-web, comprising 920 YouTube videos spanning a broader range of content and scenarios. We define five question types, as shown in Table 1, targeting different aspects of memory-based reasoning. In total, we annotate 1,276 QA pairs for M3-Bench-robot and 3,214 QA pairs for M3-Bench-web.

We conduct experiments on the M3-Bench-robot, M3-Bench-web, and VideoMME-long [11].
Results show that M3-Agent trained via reinforcement learning outperforms all baselines on all three benchmarks. Compared to the strongest baseline, Gemini-GPT4o-Hybrid, which implements M3-Agent framework by prompting Gemini-1.5-Pro [43] for memorization and GPT-4o [17] for control, M3-Agent improves accuracy by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively.
Our ablation study demonstrates the importance of semantic memory: removing it reduces accuracy by 17.1%, 19.2% and 13.1% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Furthermore, we examine the impact of RL training, inter-turn instructions, and reasoning mode on control performance. Specifically, RL training improves accuracy by 10.0%, 8.0%, and 9.3% on the respective benchmarks. Removing inter-turn instruction results in a 10.5%, 5.8% and 5.9% decrease in accuracy, while disabling reasoning mode leads to accuracy declines of 11.7%, 8.8% and 9.5% on the three benchmarks.

The main contributions of this paper are summarized as follows:

We introduce M3-Agent, a novel framework for multimodal agents with long-term memory. M3-Agent continuously processes real-time multimodal inputs (seeing and listening), incrementally builds world knowledge by generating both episodic and semantic memories (remembering), and performs reasoning over these memories to complete complex instructions (reasoning).

We develop M3-Bench, a new LVQA benchmark designed to evaluate the effectiveness of memory and memory-based reasoning for multimodal agents.

Our experiments demonstrate that M3-Agent, trained by reinforcement learning, consistently outperforms agents based on prompted commercial models across multiple benchmarks.

2 Related Work

2.1 Long-Term Memory of AI Agents

Long-term memory is essential for AI agents [10], enabling them to retain distant contextual information and support more advanced reasoning. A common approach is to append entire agent trajectories, such as dialogues [33, 46, 29, 61] or execution trajectories [31, 48, 38, 29, 16, 37], directly to memory. Beyond raw data, some methods incorporate summaries [46, 23, 16, 61], latent embeddings [58, 30, 42, 6], or structured knowledge representations [35, 52]. Recent systems further construct sophisticated memory architectures, giving agents finer control on memory management [5, 46, 20].

However, most existing approaches focus on LLM agents. In contrast, multimodal agents process a broader range of inputs and store richer, multimodal content and concepts in memory [8, 7]. This also introduces new challenges, particularly in maintaining consistency of long-term memory. Moreover, just as humans acquire world knowledge through experience, multimodal agents should form internal world knowledge in memory, rather than merely storing description of experience.

2.2 Online Video Understanding

For multimodal agent, memory formation is closely related to online video understanding, a challenging task requires real-time processing of video streams and decision-making based on past observations. Traditional approaches to long video understanding, such as extending the context window in multimodal models [4, 60] or compressing visual tokens to increase temporal coverage [49, 21, 49], do not scale effectively for infinitely long video streams. In practical settings, such as interactive agent scenarios, reprocessing the entire video history for each new instruction is computationally prohibitive.

To improve scalability, memory-based methods [58, 14, 42, 59] introduce memory modules that store encoded visual features for future retrieval. These architectures are suited for online video processing. However, they face a fundamental limitation: maintaining long-term consistency. Because they store only visual features, these methods struggle to maintain coherent tracking of entities such as human identities or evolving events over time.

With the rapid advancement of large multimodal and language models [17, 43, 53, 1, 55], the Socratic Models framework [56, 28, 57] has emerged as a promising approach for online video understanding. By leveraging multimodal models to generate video descriptions as language-based memory, this method improves scalability. Nevertheless, it still encounters challenges in maintaining long-term consistency across complex, evolving video content.

3 Datasets

In this section, we introduce M3-Bench, an LVQA dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets: (1) M3-Bench-robot, which contains 100 real-world videos recorded from a robot’s first-person perspective, and (2) M3-Bench-web, which includes 920 web-sourced videos covering a wider variety of content and scenarios. To comprehensively assess an agent’s ability to recall past observations and perform memory-based reasoning, we curate five distinct types of questions, as summarized in Table 1. Overall, M3-Bench is featured by
(1) long-duration, real-world videos that encompass diverse real-life scenarios relevant to the deployment of multimodal agents, and
(2) challenging questions that extend beyond shallow perceptual understanding and require complex reasoning over long-term contexts.

Figure 2 presents examples from M3-Bench. The overall statistics of M3-Bench is shown in Figure 3. Table 2 provides a comparative analysis with existing LVQA benchmarks. The remainder of this section elaborates on the data collection and annotation procedures for M3-Bench-robot and M3-Bench-web, respectively.

3.1 M3-Bench-robot

Robots are representative examples of multimodal agents. A general-purpose robot should be able to maintain long-term memory and reason with it to guide its actions.
For example, as it processes observations, the robot may remember a person’s name, where they left their coat, or their coffee preference. Reasoning over long-term memory enables higher-level cognitive functions, such as inferring a person’s personality, understanding relationships among individuals, or identifying the functions of surrounding objects. To systematically evaluate these capabilities, we record a new collection of videos from robot’s perspective and manually annotate corresponding question-answer pairs.

Scripts Design
We begin by designing video scripts for M3-Bench-robot across seven everyday scenarios where robots are expected to operate: living room, kitchen, bedroom, study, office, meeting room, and gym. Each script involves one robot interacting with two to four humans. Annotators are instructed to design human–robot interactions that reflect the desirable capabilities of general-purpose service robots.

To ensure diversity in the script content, we introduce multiple thematic variations for each scenario. For example, the living room scenario may include themes such as meeting friends, engaging in family conversations, or hosting a Thanksgiving party. Annotators write one script for each theme, thereby ensuring broad coverage and high variability across scripts.
Specifically, each script is structured as a sequence of discrete events and questions. Some events are designed as reference events, containing information relevant to a future question. Questions may appear after any event or at the end of the script. When appearing within the event sequence, questions are typically closely tied to the current plot; moving them can alter their answers or affect difficulty. An example script is provided in Table LABEL:robot_script_example (§ 8.5).

To ensure the complexity of video content and the quality of downstream video filming and annotation, annotators must meet the following criteria:

Annotate at least 15 questions, each labeled with the reference events required to answer them.

Ensure each question is assigned to at least one type listed in Table 1.

Each script must contain at least 70 events to ensure a minimum video duration of 30 minutes.

Video Filming
Recording videos with actual robots poses significant challenges due to high operational costs, hardware limitations, and deployment complexities. To address these constraints, we adopt a practical alternative: employing human actors to simulate robot behavior. This approach simplifies data collection while preserving both the first-person robot perspective and the multimodal quality required for our benchmark.

Each script involves multiple actors, with one designated to simulate the robot. This actor wears head-mounted camera equipment to capture the robot’s egocentric visual and auditory perspective. The resulting footage constitute the final videos in M3-Bench-robot. To ensure diversity and minimize location bias, we recruit 67 actors and film across 51 distinct locations, with no more than three videos recorded at each location.

We collect two types of audio tracks for each video. The first is directly recorded by the head-mounted device, reflecting the raw auditory input a robot would naturally receive, including ambient sounds and spatial acoustic variations. The second is captured using individual lapel microphones worn by each actor, providing high-fidelity voice recordings to complement the primary audio stream.

Annotations
After recording the videos, annotators curate QA pairs for each video. Although some questions are pre-scripted, the final video content may deviate from the original script due to realistic filming conditions. Consequently, not all scripted questions remain applicable. Annotators carefully review each scripted question to determine whether it should be retained, revised, or discarded, and provide corresponding answers when necessary. For all retained or revised questions, annotators are required to specify the precise timestamp at which the question should be asked. Importantly, the timestamp must precede the robot’s corresponding response or action to avoid inadvertently revealing the answer.

In addition to the script-based questions, annotators are also required to create new questions to ensure that each video contains at least 12 QA pairs. All newly added questions should also align with one or more of the question types listed in Table 1.

Besides QA pair creation, annotators also generate subtitles to enhance the usability of the dataset. Specifically, they manually annotate the start and end timestamps for each dialogue segment, together with the speaker’s identity and the transcribed dialogue content.

Full annotation guidelines, annotators information and quality control details for M3-Bench-robot annotation are presented in Appendix 8.

3.2 M3-Bench-web

To further increase video diversity, we collect extra videos from YouTube following existing practice [11, 9, 34].

Video Collection The video collection adopts a question-driven approach: annotators select videos that could support the design of at least five questions belong to the types listed in Table 1. This strategy naturally leads to the selection of videos with rich narratives and complex inter-entity relationships, making them well-suited for assessing agent’s capability of reasoning with long-term memory.

To promote video diversity and avoid overrepresentation of easily annotated content, we provide annotators with a reference list of video categories emphasizing high information density and relevance to real-world multimodal agent applications. Annotators are required to submit up to 20 videos from each category and are allowed to suggest new categories, which are included if deemed sufficiently distinct from the existing category list by the authors. The final dataset comprises 46 distinct video types, as summarized in Figure 3.

QA Annotations
The same annotator who collects the video also generates at least five corresponding question-answer pairs. Each question must correspond to at least one type defined in Table 1. In M3-Bench-web, all question timestamps are set to the end of the video.
All questions are required to be specific, objective, and have a single unambiguous answer that can be reasonably derived from clues in the video, ensuring both the effectiveness and fairness of subsequent evaluation. For example, questions answerable from multiple perspectives or with ambiguous references, such as "the man" or "in the middle part of the video," are not considered valid.
Appendix 9 provides the full annotation guidelines, annotators’ information, and quality control details for M3-Bench-web.

3.3 Automatic Evaluation

We use GPT-4o as an automatic evaluator for M3-Bench by prompting it to assess the correctness of a generated answer by comparing it to the corresponding reference answer for the same question. The prompt template is shown in Table LABEL:prompt_gpt4o_evaluation (§ 15.1).

To validate GPT-4o as a reliable judge, we construct a test set of 100 randomly sampled triples, each consisting of a question, its reference answer, and a generated answer from our method or various baselines (§ 5.1). Three authors independently evaluate the correctness of each generated answer, and GPT-4o’s judgments are compared with the majority vote of human annotations. GPT-4o achieves 96% agreement with human judges, confirming its effectiveness as an automatic evaluator.

4 Approach

As shown in Figure 1, M3-Agent consists of a multimodal LLM and a long-term memory module. It operates through two parallel processes: memorization, which enables continuous processing of arbitrarily long video streams and builds a lifelong memory; and control, which reasons over long-term memory to execute instructions. In the following subsections, we detail long-term memory storage, memorization, and control, respectively.

4.1 Long-Term Memory

Long-term memory is implemented as an external database that stores information in a structured, multimodal format (text, images, audio). Specifically, Memories are organized as an entity-centric multimodal graph, where each node represents a distinct memory item. Each node includes a unique ID, modality type, raw content, weight, embeddings, and other metadata such as timestamps. See Table 3 for details.
Nodes are connected by undirected edges that represent logical relationships between memory items. For example, items sharing the same entity ID are linked to form an entity-centric memory graph. This design supports not only sequential retrieval of memories based on timestamps but also associative retrieval based on entities.

The agent constructs its memory by incrementally adding new text, image, or audio nodes. When a memory generated by the memorization process already exists in long-term memory, the corresponding node or edge is reactivated and its weight increased; if it is new, a corresponding node or edge is added to the graph. Conflicting information may be introduced during construction. To resolve this, M3-Agent applies a weight-based voting mechanism during inference: frequently activated entries accumulate higher weights and override conflicting entries with lower weights. This mechanism ensures the robustness and consistency of the memory graph over time.

Search Tool To facilitate memory retrieval, we provide a suite of search tools that enable the agent to retrieve relevant memories based on specific requirements. In particular, we implement two types of search mechanisms operating at different levels of granularity, as summarized in Table 4. Detailed implementation of these retrieval mechanisms is provided in Appendix 10.

4.2 Memorization

As shown in Figure 1, during memorization, M3-Agent processes the incoming video stream in clip-by-clip manner, generating two types of memory: episodic memory, which captures visual and auditory content from the raw video; and semantic memory, which extracts general knowledge such as character identities, attributes, relationships, and other world knowledge. Semantic memory not only enriches the memory content, but also provides additional retrieval cues, enhancing retrieval effectiveness for control process.

Consistent Entity Representation
A key challenge in constructing high-quality long-term memory is maintaining consistent representations of core concepts—such as main characters and objects—across arbitrarily long time spans. Existing works typically generates language-based descriptions, such as "a man with a beard" or "a woman in a red dress". However, such textual descriptions are inherently ambiguous and prone to inconsistencies when accumulated over time. To address this issue, M3-Agent preserves the original multimodal features and constructs persistent identity representations within its long-term memory. This approach provides a more stable and robust foundation ensuring consistency over time.

Specifically, we equip M3-Agent with a suite of external tools, including facial recognition and speaker identification. These tools extract the faces and voices of characters appearing in the clip and return their corresponding identities from the long-term memory. Each extracted face or voice is associated with an existing node by using search_node function or assigned to a newly created node. The resulting identifiers (face_id or voice_id) serve as persistent references to the corresponding characters. By leveraging the globally maintained memory graph as a unifying structure, M3-Agent ensures consistent character identity mapping across local memories from different clips, thereby forming a coherent long-term memory.

This approach can be generalized to encode more concepts, such as key locations or objects, into long-term memory, thereby further improving the consistency of memory generation. Detailed implementations of both tools are provided in Appendix 10.

Memory Generation
Having the face and voice identities, M3-Agent continues to generate both episodic and semantic memory. Each character must be referenced by their face_id or voice_id. For example: "<face_1> wears a red hat and blue top," or "<voice_2> speaks to <face_3>, ‘How are you doing today?’" This mechanism ensures that each character is unambiguously grounded with physical features stored in long-term memory.
Specially, in semantic memory, M3-Agent can perform cross-modal reasoning to infer relationships between different entity IDs (e.g., linking a face and a voice belonging to the same person). These inferred equivalences can then be used to update the connections between face and voice nodes in the memory graph. Once linked, the pair is treated as a single character. During retrieval, connected nodes are unified under a shared <character_id>, enabling the model to reason about characters more consistently across modalities.

With respect to the output format, M3-Agent generates both episodic and semantic memory as a list of text entries. Each entry is stored in the memory graph as a text node, except for entity ID relationships represented as edges. As described in the memory storage, conflicting information is resolved through a voting mechanism. For example, <voice_3> corresponds to <face_0>, but in some challenging clips, the system might temporarily link it to a different face. Over time, as correct associations accumulate, the weight of the correct mapping (<voice_3>, <face_0>) increases and dominates. This allows the system to robustly learn and maintain accurate knowledge, even in the presence of occasional local errors.

4.3 Control

When an instruction is received, the control process is triggered. As illustrated in Figure 1, during control, M3-Agent autonomously performs multi-turn reasoning and invokes search functions to retrieve relevant memories. Unlike traditional single-turn RAG, this iterative approach enables more complex planning, making the system more flexible and more capable. Specifically, the control process follows Algorithm 1, with prompts in Table LABEL:search_agent_prompt (§ 15.3). Here πθ\pi_{\theta} is the control policy, qq is user question, and 𝒟\mathcal{D} is the long-term memory. At each round, πθ\pi_{\theta} generates a response consisting of reasoning, an action, and associated argument. If the action is [Search], the system queries 𝒟\mathcal{D} with the argument and appends retrieved results to the context for the next round. Depending on the context, it can call different search functions to retrieve memories from multiple perspective (e.g., search_node for people or search_clip for events). If the action is [Answer], the system returns the content and the process terminates. This loop continues for up to HH rounds.

4.4 Training

We apply reinforcement learning to optimize the M3-Agent. Although the memorization and control are conceptually handled by a single model, we trained two separate policy models to achieve optimal performance. Memorization relies strong multimodal understanding, while control requires strong reasoning capabilities. Accordingly, we initialized each policy model with different foundation models: Qwen2.5-Omni [51], an advanced open-source multimodal model supporting both visual and audio inputs, for memorization; and Qwen3 [53], an open-source large language model with powerful reasoning abilities, for control.

The training data are sourced from our in-house video dataset, which we have permissions for model training. We collect videos along with corresponding question-answer pairs, adhering to the same annotation standards used in the M3-Bench-web dataset. In total, the training dataset comprises 500 long videos, corresponding to 26,943 30-second clips, and 2,736 question-answer pairs.

Memorization To improve the model’s ability to generate desired memory, we perform imitation learning on Qwen2.5-Omni-7b to create memory-7b-sft. The process begins with constructing a high-quality synthetic demonstration dataset. We segment each video in the dataset into 30-second clips, and corresponding memory annotations are generated through a three-stage process: (1) Episodic memory synthesis: We perform a hybrid annotation strategy by jointly prompting Gemini-1.5-Pro and GPT-4o. Accordingly, GPT‑4o supplies frame‑level cues, which serve as priors for Gemini‑1.5‑Pro; the two outputs are merged to form richer narrative summaries than either alone. (2) Identity equivalence detection: We propose an algorithm that automatically mines high-confidence meta-clips, short monologue clips containing exactly one face and one voice, from a long video to construct a global face-voice correspondence. These meta-clips offer clear identity cues, enabling accurate face-voice pairing. Once the global mapping is established, it can be used to automatically annotate face-voice associations in any 30-second subclip. (3) Other semantic memory synthesis: We design prompt templates to extract semantic memories from various perspectives, guiding semantic memories to include information listed in Table 10 (§ 11). Details of the data synthesis process are provided in Appendix 11. In total, we synthesize 10,952 samples: 10,752 for training and 200 for validation.

Fine-tuning is conducted for 3 epochs with a learning rate of 1e−51e-5 and batch size of 16, using 16 GPUs with 80GB memory.

Control We first set up the environment for RL training. For each video in the dataset, we generate the corresponding long-term memory using memory-7b-sft. For any given question, the agent is restricted to searching within the memory generated from the video associated with that question.

We then train the policy model πθ\pi_{\theta} using DAPO [54], which initialized from control-32b-prompt. For each question-answer pair (q,a)(q,a) sampled from training dataset 𝒟\mathcal{D},
the policy πθ\pi_{\theta} rollouts a group of GG trajectories τi=1G{\tau}{i=1}^{G}, using the algorithm shown in Algorithm 1. For each trajectory τi\tau{i}, the final submitted answer yiy_{i} is extracted and evaluated using the GPT-4o evaluator introduced in Section 3.3. The reward of the ii-th trajectory is given by:

Then, the advantage of the ii-th response is calculated by normalizing the group-level rewards {Ri}i=1G{R_{i}}_{i=1}^{G}:

Note that during training, we compute loss only on LLM-generated tokens. The optimization objective is:

where the indicator 𝕀(τi,t)=1\mathbb{I}(\tau_{i,t})=1 if τi,t\tau_{i,t} is an LLM-generated token; and 0 otherwise.
Table 14 (§ 13) lists the hyperparameters used during the DAPO training process.

5 Experiments

5.1 Baselines

We evaluate M3-Agent against three types of baselines:

Socratic Models This baseline adapts the Socratic Models framework [56], which uses a multimodal model to describe 30-second video clips. These descriptions are stored as long-term memory. To answer a question, an LLM performs retrieval augmented generation (RAG) [22]: It first invokes a search_clip function to retrieve memory relevant to the question, and then generates a response based on the retrieved content.

We implement both closed-source and open-source multimodal models for memory generation:

Gemini-1.5-Pro [43]: Takes the full 30-second video clip as input.

GPT-4o [17]: Since it does not process audio, we provide video frames sampled at 0.5 fps and ASR transcripts.

Qwen2.5-Omni-7b [51]: An advanced open-source multimodal model that supports both visual and audio inputs. It receives the full video as input.

Qwen2.5-VL-7b [1]: An open-source vision-language models with SOTA results in visual-language tasks. Like GPT-4o, it receives both video frames (sampled at 0.5 fps) and ASR transcripts.

For all variants, GPT-4o serves as the LLM for RAG-based question answering. We apply extensive prompt engineering to optimize performance for each setup. All prompts are provided in Appendix 15.2.

Online Video Understanding Methods
We further compare our approach with three online video understanding frameworks: MovieChat [42], MA-LMM [14], and Flash-VStream [58]. Unless otherwise specified, we adopt their official pretrained weights and default configurations.

MovieChat [42]: It uses a sliding-window to extract frame-level features and stores them in a hybrid memory; the LLM performs QA conditioned on this memory.

MA-LMM [14]: It processes frames in an online manner, consisting of feature extraction (1 fps), temporal modeling (100-frame input), and LLM decoding.

Flash-VStream [58]: It adopts a two-stage asynchronous pipeline: stream video frame compression (1 fps), and LLM-based QA over the compressed features.

Agent Methods We also compare M3-Agent with agents implemented via prompting closed-source commercial models. Specifically, we consider the following two baselines:

Gemini-Agent: Gemini-1.5-Pro is prompted separately for memorization and control process. During memorization, it receives the full video with audio, facial recognition results and speaker identification results to generate episodic and semantic memories, denoted as memory-gemini-prompt. In the control, it performs memory searches and generates responses, referred to as control-gemini-prompt.

Gemini-GPT4o-Hybrid: We also evaluate a setup where GPT-4o is prompted to perform memory search and generate responses (control-gpt4o-prompt). The memorization remains handled by memory-gemini-prompt.

The prompts are provided in Appendix 15.3.

We set the maximum number of execution rounds HH to 5 for M3-Agent and all agent-based baselines. In the implementation of search_clip, the top 2 most relevant memory clips (i.e., k=2k=2) are returned if any relevant clips are found. If none of such clips can be found, the method returns an empty result.

5.2 Dataset and Evaluation

We evaluate M3-Agent and all baselines on both M3-Bench-robot and M3-Bench-web. To demonstrate the generality of our approach, we also test M3-Agent on a long-video understanding benchmark, VideoMME-long [11], following its official evaluation protocol111https://github.com/thanku-all/parse_answer/blob/main/eval_your_results.py.

5.3 Main Results

As shown in Table 5, M3-Agent outperforms all baselines on M3-Bench-robot, M3-Bench-web, and VideoMME-long. Specifically, on M3-Bench-robot, M3-Agent achieves a 6.3% accuracy improvement over the strongest baseline, MA-LLM. On M3-Bench-web and VideoMME-long, it surpasses the strongest baseline, Gemini-GPT4o-Hybrid, by 7.7% and 5.3%, respectively.

We further evaluate M3-Agent against all baselines across different question types in M3-Bench. M3-Agent shows strong performance in human understanding and cross-modal reasoning. Specifically, compared to the best-performing baseline on M3-Bench-robot, MA-LMM, M3-Agent achieves improvements of 4.2% in human understanding and 8.5% in cross-modal reasoning. On M3-Bench-web, M3-Agent outperforms the top baseline, Gemini-GPT4o-Hybrid, with gains of 15.5% and 6.7% in the respective categories. These results demonstrate M3-Agent ’s superior ability to maintain character consistency, deepen human understanding, and effectively integrate multimodal information.

We also assess the memorization model via precision and comprehension, as reported in Appendix 12.

5.4 Ablation Study

To evaluate the impact of memorization on overall performance, we fixed the control model to control-7b-rl and compared different memorization methods, as shown in Table 6. First, we replaced the memory with that generated by memory-gemini-prompt, resulting in accuracy drops of 2.0%, 2.6%, and 9.1% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. This suggests that memory-7b-sft produces higher-quality memory than memory-gemini-prompt. Next, we evaluated memory-7b-prompt, which led to accuracy reductions of 5.4%, 9.0%, and 11.0% on the same benchmarks, highlighting the importance of imitation learning in generating effective memory. Finally, we ablated key components in the memory generation process. The results show that removing character identity equivalence or semantic memory significantly degrades QA performance.

Next, we investigate the impact of control on final performance. We fix memorization model as memory-7b-sft and evaluate various control models, as shown in Table 7. First, we compare two RL algorithms: GRPO and DAPO. Training details for GRPO are provided in Appendix 13. Our results show that control-32b-rl trained with DAPO consistently outperform control-32b-grpo across all test sets. Second, we analyze how DAPO’s performance scales with model size. The results indicate substantial improvements across all sizes. Specifically, after DAPO training, control-32b-rl achieves improvements of 10.0%, 8.0%, and 9.3% in accuracy over control-32b-prompt on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Finally, we ablate two designs: inter-instruction and reasoning. Both are shown to be critical. Removing inter-instruction results in accuracy drops of 10.5%, 5.8%, and 5.9% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Removing reasoning leads to decreases of 11.7%, 8.8%, and 9.5% on the same benchmarks.

5.5 Case Study

Memorization Table LABEL:table:case_study_memory_web, LABEL:table:case_study_memory_robot (§ 14) present two examples illustrating the episodic and semantic memories generated during memory access. Compared to memory-gemini-prompt, memory-7b-sft demonstrates (1) more detailed episodic memory generation, including richer scene descriptions, character actions and expressions, and dialogue; (2) improved recognition of identity equivalence, enabling consistent long-term tacking of human identities; and (3) richer semantic memory extraction, proactively generating knowledge about characters and environments.

Control To illustrate the control process in detail, Table LABEL:control_trajectory (§ 14) presents a complete generation trajectory of control-32b-rl. The input question is: "Is Tomasz a person with rich imagination or someone who lacks imagination?"

In the first round, the agent searches its memory for Tomasz’s character ID. In the second round, having identified Tomasz as <character_4>, it attempts a direct query: "What is <character_4>’s personality regarding imagination?" Finding no relevant memory in the third round, the agent reasons based on <character_4>’s role as CTO of a company and generates a more targeted query: "What are <character_4>’s creative problem-solving methods?" This yields a relevant memory: "<character_4> is innovative and forward-thinking, as evidenced by his interest in scaling drone technology for personal flight."—a piece of semantic memory. By the fourth round, the agent has collected enough information in its context to generate the final answer.

Hard Case in M3-Bench The accuracy of various methods demonstrates that M3-Bench, particularly M3-Bench-robot, presents a significant challenge. We perform a detailed error analysis of M3-Agent on M3-Bench, identifying two representative hard cases and their associated challenges that demand further investigation.

The first category involves reasoning about fine-grained details. For instance, questions like "Who wants to eat the ham sausage?" or "Which coat rack should Emma’s hat be laced, taller one or shorter one?" require the agent to extract precise information from its observations. However, retaining all such details in memory is impractical and may cause cognitive overload. To address this, the agent must use attention mechanisms that enables selective memorization. During execution, it can develop task-specific world knowledge, allowing it to focus on relevant details while ignoring the irrelevant, thereby improving task performance.

Another category of hard cases is related to spatial reasoning. In the M3-Bench-robot, a number of questions challenge the agent’s capability on spatial cognition, such as understanding spatial layout and tracking spatial changes. Examples include: "Where can the robot get the snacks?" and "Is Leo’s water cup currently on the second or third shelf from the top of the rack?" Since verbal memory is generally less effective than visual memory for retaining spatial information, the long-term memory should be designed to incorporate richer visual content, e.g., snapshots, to better support spatial reasoning.

6 Conclusion and Future Work

In this paper, we introduce M3-Agent, a multimodal agent framework equipped with long-term memory. M3-Agent perceives real-time video and audio streams to build both episodic and semantic memories, enabling it to accumulate world knowledge and maintain consistent, context-rich memory over time. When responding to instruction, M3-Agent can autonomously reason and retrieve relevant information from memory to complete tasks more effectively. To evaluate memory effectiveness and reasoning, we develop M3-Bench, a LVQA benchmark featuring real-world, robot-perspective videos in practical environments, and challenging questions revolving human understanding, knowledge extraction, and cross-modal reasoning, also closely reflecting real-world demands. We evaluate our method against various baselines, including Socratic models, online video understanding methods, and M3-Agent implemented by prompting closed-source models. Experimental results on M3-Bench-robot, M3-Bench-web and VideoMME-long show that M3-Agent consistently outperforms all baselines, demonstrating its superior memorization and reasoning capabilities. Furthermore, by conducting detailed case studies, we identify key limitations that point to promising future directions. These including enhancing attention mechanisms for semantic memory formation and developing richer yet more efficient visual memory.

7 Acknowledgment

We would like to thank Xiran Suo, Wanjun Wang, Liu Ding, and Jianghui Xie of ByteDance for their help with data annotation, and Peng Lin for creating the illustration.

References

8 M3-Bench-robot

8.1 Script Annotation Guidelines

Actor Setup

Four to five actors participate, including one playing the role of robot. The robot actor wears a head-mounted camera, either an iPhone 16 Pro, Xiaomi 14 Ultra, or GoPro HERO13, to capture a single point-of-view video from the robot’s perspective.

Definitions

Script: Consists of events and questions and provides actors with dialogue and stage instructions.
Robot: Played by a human actor. It is an ideal highly intelligent robot with reasoning and memory abilities similar to humans.
Scenario: living room, kitchen, bedroom, study, office, meeting room, and gym.
Event: A complete, short plot within the script. A reference event includes information relevant to future questions, such as robots interacting with humans while observing and learning human preferences or the placement of objects in real-world scenes.
Question: Designed to evaluate the robot’s memory. Each question must align with at least one type listed in Table 1.

Requirements

Annotate at least 15 questions, each labeled with the corresponding reference events.

Each script must contain at least 70 events to ensure a minimum video duration of 30 minutes.

Avoid asking questions that rely solely on common sense or that can be answered without watching the video.

Do not ask questions that remain unanswerable even after watching the video.

Avoid questions that can be answered based solely on the dialogue.

Do no include questions that are weakly related to the reference events.

The question should have a clear and unambiguous answer that can be objectively verified by comparing it to the reference answer.

8.2 QA Annotation Guidelines

Background

In the future, robots will help humans complete many tasks in indoor environments such as homes. Based on this imagination, we filmed a video from the perspective of a robot.

In order to evaluate the model’s ability, we set questions at different timestamps, typically related to the robot’s upcoming tasks. Correct answers are essential for the successful completion of these tasks.

Some questions require manual review or additional annotations to ensure each video includes at least 10 questions.

Task

Provide a 30–45 minute video along with a corresponding script that includes a series of questions.
Note: Minor script modifications may occur during filming to accommodate practical constraints. As a result, the script may not perfectly align with the final video.

Review existing questions.

For each question in the script:

Annotate the corresponding timestamp in the video based on the related script event.

Determine whether the question can be answered using the video content up to that point. If so, annotate the answer.

If the question is unanswerable, consider whether modifying it could make it answerable. If applicable, revise the question and provide the answer.

For each question-answer pair, annotate the reasoning process used to derive the answer and specify the question types according to Table 1.

Annotate additional questions:

If fewer than 10 questions remain after reviewing the script, generate new questions that must belong to at least one type listed in Table 1.

8.3 Quality Control

The annotation process consists of two rounds. In the first round, the goal is to ensure that annotators fully understand the annotation guidelines. Each annotator is required to perform QA annotations on three videos. The authors then review the annotations, provide feedback, and the annotators may revise their annotation accordingly. Based on the quality of these initial annotations, the authors determine whether the annotator is qualified to proceed to the formal annotation phase. In the second round, each annotator annotates five videos at a time. The authors randomly select one video from each batch for quality inspection. If more than one invalid question-answer is found in the selected video, the entire batch must be re-annotated. Otherwise, the batch is considered accepted. Two authors are involved in the quality control process throughout the annotation workflow.

In addition, to ensure the quality of the questions in M3-Bench-robot, we recruited five annotators to answer each question. Annotators were allowed to first read the question and then watch the video as many times as needed. The final human accuracy on M3-Bench-robot is 90.7%. Our error analysis shows that the most common mistakes are counting-related problems.

8.4 Annotator Information

All annotators are employed by a commercial data annotation company. We sign a contract with the company and pay the company for the annotation work at a market price. The annotators are all college graduates with strong English proficiency. For script annotation, eleven annotators are involved. Video filming engage 67 actors. For QA annotation, five annotators participate.

8.5 Data Examples

Table LABEL:robot_script_example provides an example of script annotation.

9 M3-Bench-web

9.1 Annotation Guidelines

To better help the annotators understand the requirements and better ensure the overall quality, safety, and validity of the datasets, we provide the following detailed guidelines, which clearly specify the acceptable and unacceptable annotation practices.

Questions must allow for verifiable and objective evaluation of correctness. This entails avoiding overly open-ended questions, compound questions that mix multiple sub-questions, or questions with multiple equally valid answers.

Each video must include at least two questions targeting character attribute modeling and two questions involving commonsense reasoning.

All visual information required to answer a question must remain clearly recognizable at lower resolutions (≤\leq720p), ensuring that all questions are answerable.

For videos between 20 and 40 minutes in length, 5 questions should be generated; for videos exceeding 40 minutes, 10 questions should be provided. Compensation considers both the number and duration of the videos.

For commonsense reasoning questions, annotators must also specify the commonsense knowledge being tested, in addition to the question and its answer.

It is not permissible for all questions to be answerable using only audio. A reasonable proportion of questions must be vision-centric, requiring understanding of visual content in the video.

Redundant questions within the same video are not allowed. For instance, asking "Describe David’s appearance" and "Describe Alice’s appearance" would be considered repetitive.

Questions that can be answered solely based on a brief moment or a short clip should be avoided. Specifically, the context required to answer a valid question should span more than 10 seconds of video content.

Videos must not contain sensitive, offensive, or NSFW content.

Avoid asking questions that rely solely on commonsense knowledge and do not require viewing the video. Such questions do not meaningfully test video understanding.

Avoid questions that are too easy to guess based on social priors or language bias alone. For example, a question like "Did the teacher appear impatient when students repeatedly interrupted the class?" may be too easily answered with "No" due to cultural expectations of teacher behavior, regardless of the actual video content. This undermines the goal of evaluating visual understanding.

Do not directly convert characters’ spoken lines into questions. These are typically answerable via simple string matching or keyword retrieval, which again does not effectively test video comprehension.

Balance the number of questions with answer "Yes" and "No".

9.2 Quality Control

The annotation process includes the following quality control stages:

Stage 1: Candidate annotators complete a trial task, collecting one video and labeling corresponding QA pairs. The authors review the submission and provide feedback. Once the annotator demonstrates a clear understanding of the annotation guidelines, they proceed to formal annotation.

Stage 2: The annotator submits a batch of 10 videos with corresponding QA pairs. The authors randomly review 2 of them and provide feedback. The annotator revise the entire batch accordingly. If the qualified rate of the submitted questions is below 90%, the authors re-sample the revised batch for further inspection. Otherwise, the batch is accepted. Annotators who pass this stage on the first attempt can proceed to Stage 3.

Stage 3: The annotator submits a batch of 30 videos with QA pairs. The authors randomly inspect 5 of them and provide feedback. The annotator revises the full batch as needed. If the QA qualified rate is below 90%, a follow-up review of the revised batch is conducted. Otherwise, the batch is accepted.

Two authors are involved in the quality control process.

9.3 Annotator Information

All annotators are from a commercial data annotation company. We have a contract with this company and compensate them at market rates for the annotation work. All annotators are college graduates with strong English proficiency. A total of ten annotators participated in the annotation of M3-Bench-web.

10 Implementation Details of Tools

Here, we provide the implementation details of the tools for representation extraction introduced in Section 4.2.

Facial Recognition To perform facial recognition, we uniformly sample video frames at a rate of 5 frames per second. For each sampled frame, we employ the buffalo_l predefined model suite from the InsightFace222https://github.com/deepinsight/insightface library to extract facial attributes, including bounding box coordinates, identity embeddings, and detection/quality scores. Low-quality detections—such as those with abnormal aspect ratios or extremely low confidence scores—are discarded. We then apply HDBSCAN clustering on the embeddings of the remaining high-quality faces to group them by character identity. This yields a set of reliable facial representations, clustered by character.

Voice Identification For speaker identification, we use Gemini-1.5-Pro to extract audio segments corresponding to distinct speaker voices, while simultaneously performing automatic speech recognition (ASR) on each segment. Segments shorter than 2 seconds are filtered out to ensure reliability. We then apply voice embedding model ERes2NetV2[3] to encode each segment into a speaker-specific representation. Based on the resulting voice embeddings, we cluster and merge segments that correspond to the same speaker—i.e., those with similar vocal characteristics. This process produces a set of high-quality speaker representations, also grouped by character. The prompt used for voice processing is shown in Table LABEL:tab:prompt_voice_identification.

Search
All memory-based retrieval is implemented via Maximum Inner Product Search (MIPS), with modality-specific adaptations.

Each face and voice node maintains a set of representative feature snapshots. When new face or voice features are extracted from a video clip, we compute the average cosine similarity between each extracted feature and all stored snapshots per node. The node with the highest similarity exceeding a pre-defined threshold (0.3 for image, 0.6 for voice) is considered a match; otherwise, a new node is created. Matched nodes are updated with the new features to refine their representations over time.

For textual memory, we apply MIPS between the input query and all existing text nodes, using OpenAI’s text-embedding-3-large333https://openai.com/index/new-embedding-models-and-api-updates/ as the embedding model. To support multi-entry retrieval, we apply a top-kk retrieval with a similarity threshold tt. Specifically, we return the kk most relevant nodes whose similarities exceed tt.
To ensure retrieval coherence, we also perform clip-level retrieval: each clip is scored by the highest similarity among its memory entries, and we return the top-ranked clips accordingly. For all experiments, we adopt a relatively strict hyperparameter setting (k=2k=2, t=0.5t=0.5) to reduce retrieval randomness and enable consistent evaluation across models.

11 Demonstration Data Synthesis for Memorization

During memorization, the multimodal model takes inputs including: video, audio, facial identifications (via facial recognition), and voice identities (via voice identification). It generates two outputs, episodic memory and semantic memory. To construct training data, we segment training videos into 30-second clips. For each clip, we then synthesize the corresponding episodic memory, entity identity relationships in semantic memory, and other semantic memory, as detailed below. In total, we synthesize 10,752 training samples for 200 validation samples.

11.1 Episodic Memory Synthesis

We employ a hybrid synthetic strategy that integrates the complementary strengths of Gemini-1.5-Pro and GPT-4o. Gemini-1.5-Pro supports audio inputs and excels at generating high-level, event-based descriptions, whereas GPT-4o provides more fine-grained visual details. To leverage both models effectively, we first prompt GPT-4o to generate a detailed visual description of the video using frames sampled at 0.5 fps. This output serves as contextual input for Gemini-1.5-Pro, which is then prompted to generate the final episodic memory. The prompt explicitly instructs Gemini-1.5-Pro to incorporate information from GPT-4o’s description when it deems it accurate. We find that using GPT-4o’s detailed visual output as context significantly enhances the richness of the final memory produced by Gemini-1.5-Pro. The full prompt template is shown in Table LABEL:table:prompt_episodic_memory_synthesis.

11.2 Entity ID Relationship Detection

There is a special type of semantic memory, extracting cross-modal identity equivalences from video. This remains a challenging task, even for advanced models like Gemini-1.5-Pro, particularly in scenes with multiple faces and voices [15]. To address this, we propose a progressive annotation algorithm. The key idea is to identify meta-clips, segments containing exactly one face identity and one voice identity, from the raw long video. These meta-clips are used to build a meta-dictionary that maps voice IDs to face IDs across the entire video. This dictionary enables automatic annotation of any 30-second clip extracted from the original video.

Meta-Clip Extraction First, for a long video, we can use facial recognition tools and voice identity tools introduced in Appendix 10 to construct a corresponding global ID for each face and voice that appears in the video. Next, we segment the video into a series of short clips, each no longer than 5 seconds in duration, using keyframe-based division. This method ensures that each clip is visually stable, with minimal changes in characters or scenes. Then, we apply facial recognition and voice identity tools to each short clip individually to extract the faces and voices present, along with their global IDs. If a clip contains only one face ID and one voice ID, we refer to it as a meta-clip. In this case, it is highly likely that the face and voice in the clip belong to the same person. Therefore, we can use the meta-clip as a high-confidence sample for establishing the association between faces and voices.

Meta-Dictionary Construction Based on all meta-clips extracted from the long video, we construct a set of mappings between face IDs and voice IDs. However inconsistencies may arise due to a small number of clips where the speaker is not visible. To address this issue, we employ a voting mechanism to generate the final meta-dictionary. The detailed algorithm is described in Algorithm 2.

New-Clip Annotation After obtaining the meta-dictionary, we can use it to annotate arbitrary clips from the full-length video. Specifically, for each 30-second clip, if both a face ID and a voice ID appearing in the clip and also found in the meta-dictionary, we generate a semantic memory in the form: "Equivalence: <face_id>, <voice_id>". Since not all IDs can be found using the meta-dictionary, we reject any clip containing a voice ID that is not present in the meta-dictionary from the final training dataset for memorization. In total, we collected 10,952 30-second clips with valid identity equivalence annotations. We manually review 48 randomly sampled mappings, and found the accuracy to be 95.83%.

11.3 Semantic Memory Synthesis

To construct semantic memory, we adopt a hybrid strategy similar to that used for episodic memory. We define several key dimensions that semantic memory should address, as outlined in Table 10. Specifically, we first prompt GPT-4o to generate preliminary semantic memory based on video frames and episodic memory. Next, we provide the video, episodic memory, and GPT-4o-generated semantic memory to Gemini-1.5-Pro, prompting it to produce the final semantic memory. Detailed prompts are provided in Table LABEL:table:prompt_semantic_memory_synthesis.

11.4 Quality of the Synthetic Data

Although the demonstration data is synthetic, it is of high quality. Our synthetic memory averages 245.7 words for episodic memory and 276.2 words for semantic memory, compared to 151.3 and 81.4 words respectively for Gemini-1.5-pro, indicating our memory captures more detail. For content accuracy, we randomly sampled 10 clips from different videos, totaling 353 memory items. Manual review showed an accuracy of 95.5%. Most errors stemmed from the speaker recognition tool: background noise and overlapping speech occasionally caused minor omissions or misidentifications in extracting speaker dialogue for episodic memory.

12 Evaluation of Memorization

we evaluate the memorization model during training using a held-out validation set of 200 samples and select the best checkpoint. Two evaluation metrics are used. First, AutoDQ [47] assesses memory description quality by comparing generated outputs to reference descriptions, measuring episodic and semantic memory excluding identity equivalence. Second, for identity equivalence, we compute precision, recall and F1 score against ground-truth in the validation set. Based on the results in Table 13, we select the checkpoint obtained after training for 3 epochs. For additional comparison, we also report results from two baseline models, memory-gemini-prompt and memory-7b-prompt, on the same validation set. Our model, memory-7b-sft, significantly outperforms both baselines.

13 RL Training Details

13.1 Details of DAPO Training

Table 14 lists the hyperparameters used during the training process.
Figure 4 depicts the RL training curves, which show a steady increase in score with the training steps.

13.2 GRPO Training

We also use Group Relative Policy Optimization (GRPO)[39] to optimize the policy model in the ablation study. GRPO optimizes the policy model πθ\pi_{\theta} by maximizing the following objective:

where ϵ\epsilon and β\beta are set to 0.2 and 0.01 respectively, and the other hyperparameters are the same as those in DAPO training.

14 Case Study

Table LABEL:table:case_study_memory_web and Table LABEL:table:case_study_memory_robot present two examples illustrating the episodic and semantic memories generated during memorization.

Table LABEL:control_trajectory presents a complete generation trajectory in the control.

15 Prompt Templates

15.1 Prompt for Automatic Evaluator of M3-Bench

Table LABEL:prompt_gpt4o_evaluation presents the prompt used by GPT-4o to assess M3-Bench.

15.2 Prompts for Socratic Models

Table LABEL:baseline_socratic_prompt presents the prompt used in Socratic Models baselines. Through prompt engineering, we find that placing the question after the long context (e.g., video detailed descriptions) enhances the model’s ability to retain the question and focus on relevant information, leading to improved answer accuracy. Accordingly, in our Socratic Models experiments, we adopt this approach by appending the question to the end of the retrieved clip descriptions during the RAG-based QA stage.

15.3 Prompts for M3-Agent

Table LABEL:prompt_memory_generation_baseline shows the prompt used by Gemini-Agent and Gemini-GPT4o-Hybrid during memorization. Table LABEL:prompt_generate_action shows the prompt used by Gemini-Agent and Gemini-GPT4o-Hybrid during control.

Table LABEL:search_agent_prompt shows the prompt used by M3-Agent during the control process. The system prompt at the beginning of each session specifies the overall task objectives. The instruction prompt appended at the start of each round provides the question and detailed guidance. The last-round prompt, used only in the final round, signals the agent that it is the final opportunity to respond.