Prosaic Continual Learning

Lesswrong

2026-02-25T19:32:56.000Z

Or: When Memories Get Good -- The Default Path Without Theoretical Breakthroughs Epistemic status: Fairly confident in the core thesis (context + memory can substitute for weight updates for most practical purposes). The RL training loop is a sketch, not a tested proposal. I haven't done a thorough literature review. Suppose there are no major breakthroughs in continual learning -- that is, suppose we continue to struggle at using information gathered at runtime to update the weights of a given instance of an AI model. If you try to update the weights at runtime today, usually you end up with catastrophic forgetting , or you find you can only make very small updates with the tiny amount of useful data you have [1] . So, if you can’t train a day’s worth of information into the model, how could you end up with something that functions as if it were learning on the job? Long Context Lengths, High Quality Summaries, and Detailed Documentation [2] [3] . It’s a straightforward idea, and basically done today, just not particularly well yet. Laying it out: The model does some task. In doing so, it gathers a tonne of information, say, a dozen novels worth. It can fit all of this in its context at once. The model finishes its task. In concluding, it writes: Short notes that it should always remember (we’ll call these memories ), for example “ This company prefers me to communicate in German ”, “ Documentation is available in this folder path ”, and; Detailed documentation about everything it did and everything it learnt [4] . This can be quite verbose. The memories are kept in the context window. The documentation is available on disk and can be accessed on demand. That’s it. Why Doesn’t This Work Now? Firstly -- it kind of does. In my own software projects I maintain a concise Claude.md file (which gets passed to each new agent on spawn), as well as extensive documentation which the Claude.md points to (and which the Claudes can search at will). Claude and ChatGPT already produce and store ‘memories’ in this way through their existing harnesses. These work okay, and we know that models can effectively learn in context . But it doesn’t work that well yet. I suspect this is because current models just aren’t very good at writing or at using these notes. It’s actually a very hard task. We’re basically having the model ask itself “What do I know that a fresh instance doesn’t, that would be useful for it to remember across all future instantiations?” and then asking it to write this down using as few tokens as it possibly can. For a model to be able to do a good job, it needs to understand whether the things it knows are coming from its current context or its weights, and accurately guess how a future instance will respond to the memories. Basically, it needs to have a good theory of mind. I think the difficulty of this task is the main reason memory especially sucked when it first came out. There are plenty of examples of irrelevant memories being created inside ChatGPT, for example. It also took some time to train models which understood what the memories were and how to use them. Previously, models would attend too strongly to memories in irrelevant contexts , bringing up notes where they don’t belong. Kimi K2.5 still struggles with this, in my experience, seeing notes at the start of its context window as very important and relevant , even in situations where they shouldn’t be. Claude ignores the apple note. Kimi always finds a way to bring it up. But memory is getting much better, and newer models use it more successfully. I expect that as models get more intelligent their use of memory and documentation will continue to improve, especially in the world where this is trained for explicitly. Models are also getting better at handling the retrieval of dense information across their long context windows, so a mundane prediction that these trends continue should point us towards prosaic “continual learning” becoming quite useful over 2026 and 2027. It also should be noted that memories like this are functionally the same as compaction (summaries written by the AI when reaching the end of the context window, so it can continue working). In both cases the model is writing compressed information to pass to a future instance to (hopefully) perform better. This is already an optimisation target for frontier labs. How We Could Make It Work Better We can easily train models to create and use memory as an RL task. To sketch out a simple method -- suppose that when finishing a task, instead of scoring the model’s performance immediately, we have the model write memories and documentation, and then we run a new instance on the same, similar, and dissimilar tasks [5] with those memories and documentation, and have a reward function which scores on the combined performance (with some small penalty for the length of the memories). This looks like: The reward function used for the actual parameter updates would be a function of the scores across each of the models, plus some penalty relative to the length of the memories and the total context length of the model. mjx-math { display: inline-block; text-align: left; line-height: 0; text-indent: 0; font-style: normal; font-weight: normal; font-size: 100%; font-size-adjust: none; letter-spacing: normal; border-collapse: collapse; word-wrap: normal; word-spacing: normal; white-space: nowrap; direction: ltr; padding: 1px 0; } mjx-container[jax="CHTML"][display="true"] { display: block; text-align: center; margin: 1em 0; } mjx-container[jax="CHTML"][display="true"][width="full"] { display: flex; } mjx-container[jax="CHTML"][display="true"] mjx-math { padding: 0; } mjx-container[jax="CHTML"][justify="left"] { text-align: left; } mjx-container[jax="CHTML"][justify="right"] { text-align: right; } mjx-mi { display: inline-block; text-align: left; } mjx-c { display: inline-block; } mjx-utext { display: inline-block; padding: .75em 0 .2em 0; } mjx-mo { display: inline-block; text-align: left; } mjx-stretchy-h { display: inline-table; width: 100%; } mjx-stretchy-h > * { display: table-cell; width: 0; } mjx-stretchy-h > * > mjx-c { display: inline-block; transform: scalex(1.0000001); } mjx-stretchy-h > * > mjx-c::before { display: inline-block; width: initial; } mjx-stretchy-h > mjx-ext { /* IE */ overflow: hidden; /* others */ overflow: clip visible; width: 100%; } mjx-stretchy-h > mjx-ext > mjx-c::before { transform: scalex(500); } mjx-stretchy-h > mjx-ext > mjx-c { width: 0; } mjx-stretchy-h > mjx-beg > mjx-c { margin-right: -.1em; } mjx-stretchy-h > mjx-end > mjx-c { margin-left: -.1em; } mjx-stretchy-v { display: inline-block; } mjx-stretchy-v > * { display: block; } mjx-stretchy-v > mjx-beg { height: 0; } mjx-stretchy-v > mjx-end > mjx-c { display: block; } mjx-stretchy-v > * > mjx-c { transform: scaley(1.0000001); transform-origin: left center; overflow: hidden; } mjx-stretchy-v > mjx-ext { display: block; height: 100%; box-sizing: border-box; border: 0px solid transparent; /* IE */ overflow: hidden; /* others */ overflow: visible clip; } mjx-stretchy-v > mjx-ext > mjx-c::before { width: initial; box-sizing: border-box; } mjx-stretchy-v > mjx-ext > mjx-c { transform: scaleY(500) translateY(.075em); overflow: visible; } mjx-mark { display: inline-block; height: 0px; } mjx-c.mjx-c1D445.TEX-I::before { padding: 0.683em 0.759em 0.021em 0; content: "R"; } mjx-c.mjx-c1D452.TEX-I::before { padding: 0.442em 0.466em 0.011em 0; content: "e"; } mjx-c.mjx-c1D464.TEX-I::before { padding: 0.443em 0.716em 0.011em 0; content: "w"; } mjx-c.mjx-c1D44E.TEX-I::before { padding: 0.441em 0.529em 0.01em 0; content: "a"; } mjx-c.mjx-c1D45F.TEX-I::before { padding: 0.442em 0.451em 0.011em 0; content: "r"; } mjx-c.mjx-c1D451.TEX-I::before { padding: 0.694em 0.52em 0.01em 0; content: "d"; } mjx-c.mjx-c3D::before { padding: 0.583em 0.778em 0.082em 0; content: "="; } mjx-c.mjx-c1D453.TEX-I::before { padding: 0.705em 0.55em 0.205em 0; content: "f"; } mjx-c.mjx-c28::before { padding: 0.75em 0.389em 0.25em 0; content: "("; } mjx-c.mjx-c5F::before { padding: 0 0.5em 0.062em 0; content: "_"; } mjx-c.mjx-c1D456.TEX-I::before { padding: 0.661em 0.345em 0.011em 0; content: "i"; } mjx-c.mjx-c1D45B.TEX-I::before { padding: 0.442em 0.6em 0.011em 0; content: "n"; } mjx-c.mjx-c1D460.TEX-I::before { padding: 0.442em 0.469em 0.01em 0; content: "s"; } mjx-c.mjx-c1D461.TEX-I::before { padding: 0.626em 0.361em 0.011em 0; content: "t"; } mjx-c.mjx-c1D450.TEX-I::before { padding: 0.442em 0.433em 0.011em 0; content: "c"; } mjx-c.mjx-c2C::before { padding: 0.121em 0.278em 0.194em 0; content: ","; } mjx-c.mjx-c2026::before { padding: 0.12em 1.172em 0 0; content: "\2026"; } mjx-c.mjx-c29::before { padding: 0.75em 0.389em 0.25em 0; content: ")"; } mjx-c.mjx-c2212::before { padding: 0.583em 0.778em 0.082em 0; content: "\2212"; } mjx-c.mjx-c1D459.TEX-I::before { padding: 0.694em 0.298em 0.011em 0; content: "l"; } mjx-c.mjx-c1D454.TEX-I::before { padding: 0.442em 0.477em 0.205em 0; content: "g"; } mjx-c.mjx-c210E.TEX-I::before { padding: 0.694em 0.576em 0.011em 0; content: "h"; } mjx-c.mjx-c1D45A.TEX-I::before { padding: 0.442em 0.878em 0.011em 0; content: "m"; } mjx-c.mjx-c1D45C.TEX-I::before { padding: 0.441em 0.485em 0.011em 0; content: "o"; } mjx-c.mjx-c1D465.TEX-I::before { padding: 0.442em 0.572em 0.011em 0; content: "x"; } mjx-container[jax="CHTML"] { line-height: 0; } mjx-container [space="1"] { margin-left: .111em; } mjx-container [space="2"] { margin-left: .167em; } mjx-container [space="3"] { margin-left: .222em; } mjx-container [space="4"] { margin-left: .278em; } mjx-container [space="5"] { margin-left: .333em; } mjx-container [rspace="1"] { margin-right: .111em; } mjx-container [rspace="2"] { margin-right: .167em; } mjx-container [rspace="3"] { margin-right: .222em; } mjx-container [rspace="4"] { margin-right: .278em; } mjx-container [rspace="5"] { margin-right: .333em; } mjx-container [size="s"] { font-size: 70.7%; } mjx-container [size="ss"] { font-size: 50%; } mjx-container [size="Tn"] { font-size: 60%; } mjx-container [size="sm"] { font-size: 85%; } mjx-container [size="lg"] { font-size: 120%; } mjx-container [size="Lg"] { font-size: 144%; } mjx-container [size="LG"] { font-size: 173%; } mjx-container [size="hg"] { font-size: 207%; } mjx-container [size="HG"] { font-size: 249%; } mjx-container [width="full"] { width: 100%; } mjx-box { display: inline-block; } mjx-block { display: block; } mjx-itable { display: inline-table; } mjx-row { display: table-row; } mjx-row > * { display: table-cell; } mjx-mtext { display: inline-block; } mjx-mstyle { display: inline-block; } mjx-merror { display: inline-block; color: red; background-color: yellow; } mjx-mphantom { visibility: hidden; } _::-webkit-full-page-media, _:future, :root mjx-container { will-change: opacity; } mjx-c::before { display: block; width: 0; } .MJX-TEX { font-family: MJXZERO, MJXTEX; } .TEX-B { font-family: MJXZERO, MJXTEX-B; } .TEX-I { font-family: MJXZERO, MJXTEX-I; } .TEX-MI { font-family: MJXZERO, MJXTEX-MI; } .TEX-BI { font-family: MJXZERO, MJXTEX-BI; } .TEX-S1 { font-family: MJXZERO, MJXTEX-S1; } .TEX-S2 { font-family: MJXZERO, MJXTEX-S2; } .TEX-S3 { font-family: MJXZERO, MJXTEX-S3; } .TEX-S4 { font-family: MJXZERO, MJXTEX-S4; } .TEX-A { font-family: MJXZERO, MJXTEX-A; } .TEX-C { font-family: MJXZERO, MJXTEX-C; } .TEX-CB { font-family: MJXZERO, MJXTEX-CB; } .TEX-FR { font-family: MJXZERO, MJXTEX-FR; } .TEX-FRB { font-family: MJXZERO, MJXTEX-FRB; } .TEX-SS { font-family: MJXZERO, MJXTEX-SS; } .TEX-SSB { font-family: MJXZERO, MJXTEX-SSB; } .TEX-SSI { font-family: MJXZERO, MJXTEX-SSI; } .TEX-SC { font-family: MJXZERO, MJXTEX-SC; } .TEX-T { font-family: MJXZERO, MJXTEX-T; } .TEX-V { font-family: MJXZERO, MJXTEX-V; } .TEX-VB { font-family: MJXZERO, MJXTEX-VB; } mjx-stretchy-v mjx-c, mjx-stretchy-h mjx-c { font-family: MJXZERO, MJXTEX-S1, MJXTEX-S4, MJXTEX, MJXTEX-A ! important; } @font-face /* 0 */ { font-family: MJXZERO; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Zero.woff") format("woff"); } @font-face /* 1 */ { font-family: MJXTEX; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Regular.woff") format("woff"); } @font-face /* 2 */ { font-family: MJXTEX-B; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Bold.woff") format("woff"); } @font-face /* 3 */ { font-family: MJXTEX-I; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-Italic.woff") format("woff"); } @font-face /* 4 */ { font-family: MJXTEX-MI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Main-Italic.woff") format("woff"); } @font-face /* 5 */ { font-family: MJXTEX-BI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Math-BoldItalic.woff") format("woff"); } @font-face /* 6 */ { font-family: MJXTEX-S1; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size1-Regular.woff") format("woff"); } @font-face /* 7 */ { font-family: MJXTEX-S2; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size2-Regular.woff") format("woff"); } @font-face /* 8 */ { font-family: MJXTEX-S3; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size3-Regular.woff") format("woff"); } @font-face /* 9 */ { font-family: MJXTEX-S4; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Size4-Regular.woff") format("woff"); } @font-face /* 10 */ { font-family: MJXTEX-A; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_AMS-Regular.woff") format("woff"); } @font-face /* 11 */ { font-family: MJXTEX-C; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Regular.woff") format("woff"); } @font-face /* 12 */ { font-family: MJXTEX-CB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Calligraphic-Bold.woff") format("woff"); } @font-face /* 13 */ { font-family: MJXTEX-FR; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Regular.woff") format("woff"); } @font-face /* 14 */ { font-family: MJXTEX-FRB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Fraktur-Bold.woff") format("woff"); } @font-face /* 15 */ { font-family: MJXTEX-SS; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Regular.woff") format("woff"); } @font-face /* 16 */ { font-family: MJXTEX-SSB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Bold.woff") format("woff"); } @font-face /* 17 */ { font-family: MJXTEX-SSI; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_SansSerif-Italic.woff") format("woff"); } @font-face /* 18 */ { font-family: MJXTEX-SC; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Script-Regular.woff") format("woff"); } @font-face /* 19 */ { font-family: MJXTEX-T; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Typewriter-Regular.woff") format("woff"); } @font-face /* 20 */ { font-family: MJXTEX-V; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Regular.woff") format("woff"); } @font-face /* 21 */ { font-family: MJXTEX-VB; src: url("https://cdn.jsdelivr.net/npm/mathjax@3/es5/output/chtml/fonts/woff-v2/MathJax_Vector-Bold.woff") format("woff"); } There are several other ways to do something like this, of course, and some would be much more efficient than what I have laid out here. I’m mainly trying to get across a few key ideas: You can train for memory and documentation quality automatically and without major changes to the current post training regime. You can also train the model to make iterative improvements to the memories and documentation (editing and removing unnecessary or wrong sections) by scoring the performance across many sequential runs. You should score performance on both similar and dissimilar tasks when passing through the memories and documentation, in order to teach the model when to actually use the information passed through [6] . You should penalise memory usage (and maybe also documentation [7] ) by length, otherwise the memories will get too long to fit in the context window at some point, and you don’t want a discontinuity in performance when that happens. Overall I would expect this to reward both the model’s ability to write AND to understand its memories and documentation, with some risk of pushing the model towards very dense, difficult to read memories (ala linguistic drift ). I haven’t spun up an experiment to test this empirically, but may do at some point. If anybody else would like to, or has done so already, please let me know! Could This Replace Real Continual Learning? What About Intelligence Gains From Having The Information In The Weights? There are two things going on here that we need to untangle. The first is about the model having the correct information to achieve its goals. This is what gets put into the memories and the documentation, and what is addressed by prosaic continual learning. The second thing we wonder about is how to increase the intelligence of the model. How can it do more with less information, or figure out new things that it wasn’t told, or get better at acting in the world in a general sense. With prosaic continual learning, the real intelligence gains only happen in the next generation of AI models . Suppose Claude 5 is launched with a 1m context window, and it is smart enough to write good [8] documentation and memories. If a task uses about 500k of context, and produces about 1000 tokens of new memories, then doing ten tasks a day, every day, you can run the model for 50 days before you hit the ceiling on how many memories you can store [9] [10] . Then, 50 or so days later, Claude 5.1 is launched, with improved capability by the usual process. Claude 5.1 inherits the existing memories and documentation and immediately works on improving and compressing them [11] . Combined with a longer context window, the new Claude 5.1 might buy another 50 days of memory [12] . Repeat ad nauseam, or at least until Claude N solves true continual learning with parameter updates at runtime. In this way, the lessons from a particular deployment (say, by a model that has been answering phones for a particular company) are trivially passed from one generation to the next while capabilities continue to improve via regular training. In practice, is there anything more we need true continual learning to do? [13] Can We Have A Human Brain Analogy, Please? One of the reasons continual learning is so popular a concept is because humans do it, which makes it a very attractive answer to the question “What can’t AI’s do yet?”. The human learning process looks something like the above chart, where we have an explicit, discrete, and extremely small working memory, which holds somewhere on the order of 10 objects in memory at a time. This probably exists as activations in the pre-frontal cortex. It’s analogous to LLM’s context window, being lossless and explicit, but is far, far smaller. Then, humans have a kind of buffer, where information is stored on the order of hours to weeks in a lossy but easily accessible way. This seems to be held in the hippocampus. You can draw a weak parallel to AI reading documentation here, being some partially processed summary of what has happened, accessible with a few seconds of thought. Humans can read documentation too, of course, but the read speed is extremely slow in comparison. AI is able to read documentation at a speed that is more comparable to a human recalling a specific memory. Next, humans have long term memory, which is slowly updated on the order of days, probably by reading and updating against the hippocampus’ “buffer” [14] . This is where the missing piece for LLM’s continual learning would be an analogy, if we knew how to properly update an instances’ parameters at runtime. Finally, even humans don’t become more intelligent after reaching full adulthood [15] . We rely on evolutionary selection to make any significant changes to human intelligence. The analogy here is to the next generation of AI models being trained, although that happens far, far faster. Laying it out like this, you can see the ‘long term memory’ update step is missing, but the ‘context window + documentation’ is ridiculously larger in storage capacity that human working and short-term memory, and the ‘intelligence gain’ step so much shorter, that skipping a weight update at runtime might be viable. Humans require memory related parameter updates because we can’t store much information in working or short term memory, but if our working memory was so large it didn’t fill up within our lifetimes, you can see how the situation changes. Conclusions Having now thought through this, I have updated away from continual learning being a real issue for AI capabilities in the near future [16] . It doesn’t seem like it is needed for general purpose capability improvements, where the regime of releasing a new model every few months works fine. It doesn’t seem like it’s needed for company specific work, where you can store all of the needed information in documentation and in context. I think the fact that it has to be written and used explicitly by the models is a satisfying answer to why it hasn’t worked well so far -- the models simply haven’t been smart enough to do a good job at this so far. I’m also bullish on progress on this problem being fast, given that this performance is something that can be straightforwardly optimised with unsupervised RL, including training models to handle and edit stale memories. Overall... damn, I guess we’re making continual learners now. People think about the goal of continual learning as being ‘the model can learn on the job ’, so, practically speaking, the main use case is for specific, non-generalisable data unique to this deployment of the model. When I say you don’t have enough data to do this usefully then, I mean, one days’ (or one months’) recording of work is a tiny amount of data to try to fine-tune a model on. You can’t reliably learn new things this way, though you might be able to elicit existing knowledge in the model. ↩︎ This is not a new idea. Dario spoke about it on Dwarkesh , and a quick Claude search reveals several different papers talking about the concept, most of which I haven’t read in detail. I am writing this post because I haven’t seen it clearly, publicly combined in one place before, and maybe there’s some interesting exploration of the RL training loop and why explicit memory has been a hard thing for models to get right. ↩︎ We also have versions of all of this today, which is why it’s “prosaic” continual learning. ↩︎ You could also include things like tools the model has built for itself, information it's found online and wants to make a note about, and really anything that is created or curated for the models’ use without the entire thing being stored in the active context window. ↩︎ Same task means literally the exact same task [17] . Similar task means tasks pulled from the same narrow distribution. For example, the set of things a particular employee might do in their work for a single company. We want to encourage memories that are useful across this somewhat narrow domain. Dissimilar task means tasks pulled from more radically different distributions. Coding, psychological support, creative fiction, etc. I think we need to include some probability of dissimilar tasks in the batch in order to train the model to not rely too strongly on memory. At deployment time, the model may indeed be given memories that are irrelevant for the task at hand. If I had to take a random stab at the proportion of each type of task assigned for a given batch, I would weight the distribution so that the N+1th task is about 89% likely to be from the similar distribution, 10% likely to from the dissimilar distribution, and 1% likely to be the exact same task repeated. ↩︎ The model should write memories and documentation on both successful and unsuccessful attempts at the problem -- it likely has useful information about what to try or not to try either way. I’m also imagining that there is some penalty for overall token usage when training for inference efficiency reasons -- that would incentivise the passing of useful tips and lessons via memories and documentation, if it can make the later instances more efficient. It is even fine to pass the entire solution via memory, so long as the model has learnt when it doesn’t apply, and has been suitably penalised for the memory length. I think we can get this result by tuning the proportion of same, similar, and dissimilar tasks being scored together -- that is, if we run similar tasks n times, and dissimilar tasks m times (and possibly the same task p times), with the memory and documentation passed through for a given reward calculation, we can select n, m, and p such that generally useful tips are favoured over long and specific instructions. ↩︎ I’m unsure whether documentation should be length penalised or not. You get this to some degree by measuring the performance of the model using the documentation. I’d lean towards probably not, using the principle of allowing the training to choose whether short or long documentation is better. I’m assuming we use a tool which allows the model to choose to read some reasonable amount of tokens at a time, rather than risk breaking things by dumping entire files in, or only clipping them when they become very long. ↩︎ In the memory case, ‘good’ means that they can figure out what would be useful to know in all future runs, and can recover from bad or missing memories by editing it later. In the documentation case, it means they can include all the relevant information accurately, avoid including slop, and then use the information to be much more effective than they would be without it. ↩︎ I made up numbers here just to show how much room there is. In this case, I get 1 million token context window, minus 500k task buffer, leaving 500k tokens for memories. At 10,000 tokens per day, we get 50 days of memory buffer. This is also kind of a ‘worse case scenario’. A thousand tokens for memories for each task is very high, since most memories could simply be pointers to where the real detail lives, and you would quickly run out of new things to write. Do you memorize ~6000 words worth of new information every day, and keep it memorised for the rest of your life? If you can compress your new memories to only 1000 new tokens per day instead of 10,000, you get over a year of runtime. Alternatively, increasing context length from the current 1m tokens also provides wiggle room. ↩︎ Different tasks will have very different profiles here. For example, coding might require only very short memories, whereas piloting a robot through a factory might require memories that include a map and descriptions of every mistake the model had made on previous trips. ↩︎ We can expect a new version to be better at the difficult task of creating and using the memories & documentation, especially if it’s trained explicitly for this. Some possibilities here, which point towards shorter and fewer memories: Is the model able to figure out what a memory is trying to say more easily, letting it compress existing memories into fewer tokens? Is the model better at writing the information more densely? Does the model know more information intrinsically, allowing it to remove that information from its explicit context? Is the model better at knowing what it knows, allowing it to cut unnecessary memories? How does the new version change the tradeoff between memory and documentation caching? E.g. is it faster at reading documentation? Is it better at knowing when and which documents to read given its particular goals? ↩︎ I am pretty confident that memory usage should be able to grow slow enough that a Claude working for a particular company can fit everything it needs into context and explicit documentation. For this not to be the case, you have to assume that extremely large amounts of information are needed (multiple books worth), and that you discover new information that must be held in context (rather than in documentation you can look up) at a rate faster than the context window grows, and that future models won’t be able to significantly compress existing memories or be able to move existing memories into documentation by virtue of being better at knowing when to look up things. ↩︎ In the limit, this process is functionally identical to continual learning, as far as I can tell. Just imagine the 50 days between model releases reducing to some short period, like a day or an hour, and imagine the written memories that are passed forward becoming denser and denser, an abstract initialisation pattern that is loaded in for a deployment (like a static image). Putting the same scenario the reverse way, imagine a model with traditional, weight-updating continual learning. Rather than updating its weights directly, it (like humans) uses a short term memory buffer to store new information and isolate private information from the weights. Every hour, the relevant lessons from the previous hour’s work are trained into a copy of the model, which is then seamlessly switched out, and the buffer updated. ↩︎ I don’t know if you’ve ever noticed your long term memory updating, I feel like I have. Have you ever had a major event happen, and then only some days later have cemented a behaviour change, even though you knew the change was necessary from the moment of the event? ↩︎ They continue to learn more, which makes their crystalised intelligence (knowledge and skills) go up, but their fluid intelligence (ability to reason abstractly, solve new problems, etc) declines after early adulthood. ↩︎ I’m even coming around on continual learning being worse for most mundane uses -- suppose you have your own version of a model, with the weights updated to store information specific to you and your use case. What happens when a new model is released? You have to retrain? What happens to the optimisations from batching? ↩︎ I actually think it’s debatable whether you should include the literal same task as an option for the nth instance (with the memories and documentation prepared by the (n-1th) instance) to be assigned. If you do this, the model could just include the whole solution in its memories, but honestly, for some production usage and types of task, that could be a reasonable and viable strategy. I think in general we should try to train on the same distribution as the deployment, so whether to include the literal same task (vs just similar tasks) as a possible option here depends on whether you think that’s a situation that is likely to occur in practice (maybe setting up the same programming environment many times?), and whether you get anything from doing this (quickly using the cached procedure?). ↩︎ Discuss

平凡的持續學習

Lesswrong

3 天前

AI 生成摘要

文章認為長上下文長度與高品質的 AI 生成文檔可以替代權重更新，在不需要理論突破的情況下實現實用的持續學習。我提出了一個強化學習框架，用以訓練模型在不同任務中有效地編寫與利用記憶及文檔。

或者：當記憶變得強大時 —— 無需理論突破的預設路徑

認識狀態：對核心論點相當有信心（在大多數實際用途中，上下文 + 記憶可以取代權重更新）。強化學習（RL）訓練迴圈僅為草案，而非經過測試的提案。尚未進行徹底的文獻回顧。

假設在（continual learning）方面沒有重大突破 —— 也就是說，假設我們仍然難以利用運行時（runtime）收集的信息來更新特定 AI 模型實例的權重。如果你今天嘗試在運行時更新權重，通常會導致，或者你會發現利用手頭極少量的有用數據只能進行微小的更新
^()
。

那麼，如果你無法將一整天的信息訓練進模型中，你如何能得到一個運作起來就像是在工作中學習的東西呢？

長上下文長度、高質量摘要與詳細文檔
^()

  ^([[3]](#fn-ebQzLg2A8fmGaJEGb-3))。

這是一個直截了當的想法，而且目前基本上已經在做了，只是做得還不夠好。具體流程如下：

模型執行某項任務。在此過程中，它收集了大量信息，比方說相當於十幾本小說的量。它可以一次性將這些內容全部放入上下文（context）中。
模型完成任務。在結束時，它寫下：

一些它應該永遠記住的簡短筆記（我們稱之為記憶），例如「這家公司偏好我用德語溝通」、「文檔可在這個文件夾路徑中找到」，以及；

關於它所做的一切和所學到的一切的詳細文檔
^()
。這可以非常冗長。
記憶保留在上下文窗口中。文檔存儲在磁盤上，可以根據需要隨時調用。

就這樣。

為什麼現在行不通？

首先 —— 它某種程度上是行得通的。在我自己的軟體項目中，我維護一個簡潔的文件（在每個新代理啟動時傳遞給它），以及指向的大量文檔（Claude 可以隨時搜索）。Claude 和 ChatGPT 已經通過現有的框架以這種方式產生並存儲「記憶」。這些功能運作尚可，而且我們知道模型可以有效地進行（in-context learning）。

但目前效果還不是很好。我懷疑這是因為目前的模型還不擅長撰寫或使用這些筆記。

這實際上是一項非常艱鉅的任務。我們基本上是讓模型問自己：「我知道哪些新鮮實例不知道、且對未來所有實例都有用的信息？」然後要求它用盡可能少的標記（tokens）將其寫下來。

為了做好這件事，模型需要理解它所知道的信息是來自當前的上下文還是來自其權重，並準確猜測未來的實例會如何對這些記憶做出反應。基本上，它需要具備良好的「心智理論」（theory of mind）。

我認為這項任務的難度是記憶功能剛推出時表現特別糟糕的主要原因。例如，在 ChatGPT 內部產生的例子屢見不鮮。

訓練出能理解記憶是什麼以及如何使用記憶的模型也需要時間。以前，模型在中會過度關注記憶，在不該出現的地方提出筆記內容。根據我的經驗，Kimi K2.5 在這方面仍有困難，它會將上下文窗口開頭的筆記視為非常重要且相關，即使在不適用的情況下也是如此。

Claude 忽略了關於蘋果的筆記。Kimi 總能找到方法把它提款出來。

但記憶功能正在變得越來越好，新模型的使用也更加成功。我預計隨著模型變得更加智能，它們對記憶和文檔的使用將繼續改進，特別是在對此進行明確訓練的情況下。模型處理長上下文窗口中密集信息檢索的能力也在提升，因此，預測這些趨勢持續下去，將使平庸的「持續學習」在 2026 年和 2027 年變得相當有用。

還應注意到，這類記憶在功能上與「壓縮」（compaction，AI 在達到上下文窗口末尾時寫下的摘要，以便繼續工作）是相同的。在這兩種情況下，模型都在編寫壓縮信息以傳遞給未來的實例，以期（希望）表現得更好。這已經是前沿實驗室的優化目標。

我們如何讓它運作得更好

我們可以輕易地將創建和使用記憶訓練成一項強化學習（RL）任務。勾勒一個簡單的方法 —— 假設在完成任務時，我們不是立即對模型的表現評分，而是讓模型寫下記憶和文檔，然後在相同、相似和不相似的任務上運行一個新實例
^()
，並附帶這些記憶和文檔，然後由獎勵函數對綜合表現進行評分（對記憶長度給予微小懲罰）。流程如下：

用於實際參數更新的獎勵函數將是各個模型得分的函數，加上相對於記憶長度和模型總上下文長度的懲罰。

當然，還有其他幾種方法可以實現類似的效果，有些會比我這裡列出的更有效率。我主要是想傳達幾個核心想法：

你可以自動訓練記憶和文檔的質量，而無需對目前的後訓練（post-training）體系進行重大改動。
你還可以通過對多次連續運行的表現進行評分，訓練模型對記憶和文檔進行迭代改進（編輯和刪除不必要或錯誤的部分）。
在傳遞記憶和文檔時，你應該對相似和不相似任務的表現都進行評分，以教導模型何時該真正使用傳遞過來的信息
^()
。
你應該根據長度對記憶使用（可能還有文檔
^()
）進行懲罰，否則記憶最終會長到無法放入上下文窗口，而你不希望在那時發生性能斷層。

總體而言，我預計這將同時獎勵模型編寫和理解其記憶與文檔的能力，儘管存在將模型推向極其密集、難以閱讀的記憶方式的風險（類似）。

我還沒有啟動實驗來進行經驗測試，但未來可能會做。如果有人感興趣或已經做過，請告訴我！

這能取代真正的持續學習嗎？如果信息不在權重中，會損失智能增益嗎？

這裡有兩件事需要理清。第一件是關於模型擁有正確的信息來實現其目標。這是放入記憶和文檔中的內容，也是平庸持續學習所解決的問題。

第二件是我們關心如何增加模型的智能。它如何能用更少的信息做更多的事，或者發現未被告知的新事物，或者在一般意義上變得更擅長在世界上行動。

在平庸持續學習中，真正的智能增益只發生在下一代 AI 模型中。

假設 Claude 5 發布時擁有 100 萬標記的上下文窗口，且足夠聰明能寫出高質量的
^()
文檔和記憶。如果一項任務使用約 50 萬的上下文，並產生約 1000 標記的新記憶，那麼每天執行十項任務，你可以運行該模型 50 天，才會達到存儲記憶的上限
^()

  ^([[10]](#fn-ebQzLg2A8fmGaJEGb-10))。

然後，大約 50 天後，Claude 5.1 發布，通過常規流程提升了能力。Claude 5.1 繼承了現有的記憶和文檔，並立即著手改進和壓縮它們
^()
。結合更長的上下文窗口，新的 Claude 5.1 可能又爭取到了另外 50 天的記憶空間
^()
。

如此循環往復，或者至少直到 Claude N 通過運行時的參數更新解決了真正的持續學習。

通過這種方式，特定部署中的教訓（例如，為特定公司接聽電話的模型）可以輕而易舉地從一代傳遞到下一代，同時能力通過常規訓練持續提升。在實踐中，我們還需要真正的持續學習做更多事嗎？
^()

能給個大腦的類比嗎？

持續學習之所以成為如此受歡迎的概念，原因之一是人類具備這種能力，這使得它成為「AI 還不能做什麼？」這個問題的一個非常有吸引力的答案。

人類的學習過程看起來有點像上面的圖表，我們擁有一個顯式、離散且極其微小的「工作記憶」，一次大約只能在記憶中保留 10 個對象。這可能以前額葉皮層的激活形式存在。它類似於 LLM 的上下文窗口，是無損且顯式的，但規模要小得多。

接著，人類有一種「緩衝區」，信息以有損但易於獲取的方式存儲，時間跨度從數小時到數週不等。這似乎存儲在海馬體中。你可以將其與 AI 閱讀文檔做一個微弱的類比，它是對發生過的事情進行部分處理後的摘要，通過幾秒鐘的思考即可獲取。

當然，人類也可以閱讀文檔，但相比之下閱讀速度極慢。AI 閱讀文檔的速度更接近於人類回憶特定記憶的速度。

接下來，人類擁有「長期記憶」，它以天為單位緩慢更新，可能是通過讀取海馬體的「緩衝區」並據此更新
^()
。如果我們知道如何在運行時正確更新實例的參數，這就是 LLM 持續學習中缺失的那一塊類比。

最後，即使是人類，在成年後智能也不再增加
^()
。我們依靠進化選擇來對人類智能進行任何重大改變。這裡的類比是下一代 AI 模型的訓練，儘管那發生的速度要快得多。

這樣列出來後，你可以看到雖然缺失了「長期記憶」更新步驟，但「上下文窗口 + 文檔」的存儲容量比人類的工作記憶和短期記憶大得離譜，且「智能增益」步驟短得多，因此跳過運行時的權重更新可能是可行的。人類需要與記憶相關的參數更新，是因為我們無法在工作記憶或短期記憶中存儲太多信息，但如果我們的工作記憶大到一生都填不滿，情況就會完全不同。

結論

思考過這些後，我對「持續學習是近期 AI 能力的一個真正問題」的看法有所降溫
^()
。

對於通用能力的提升，它似乎並非必要，每隔幾個月發布一個新模型的體系運作良好。

對於公司特定的工作，它似乎也非必要，你可以將所有需要的信息存儲在文檔和上下文中。

我認為「它必須由模型顯式地編寫和使用」這一事實，為「為什麼它到目前為止效果不佳」提供了一個令人滿意的答案 —— 之前的模型根本不夠聰明，無法做好這件事。

我也看好這個問題的進展速度，因為這種性能可以通過無監督強化學習直接優化，包括訓練模型處理和編輯陳舊記憶。

總之……該死，我想我們現在正在製造持續學習者。

人們認為持續學習的目標是，所以從實際角度來看，主要用例是針對該模型部署所特有的、不可推廣的特定數據。當我說你沒有足夠的數據來有效地做到這一點時，我的意思是，一整天（或一個月）的工作記錄對於微調模型來說是非常微量的數據。你無法通過這種方式可靠地學習新事物，儘管你可能能夠引發模型中已有的知識。
這不是一個新想法。，快速的 Claude 搜索有的過這個概念，其中大部分我還沒有詳細閱讀。我寫這篇文章是因為我以前沒有看到它在公開場合被清晰地整合在一起，而且或許對 RL 訓練迴圈以及為什麼顯式記憶對模型來說很難做對有一些有趣的探索。
我們今天也有這一切的各種版本，這就是為什麼它是「平庸的」持續學習。
你還可以包括模型為自己構建的工具、它在網上找到並想記下來的信息，以及任何為模型使用而創建或策劃的內容，而無需將整個內容存儲在活動上下文窗口中。
相同任務意味著字面上完全相同的任務
^()
。

相似任務意味著從相同狹窄分佈中提取的任務。例如，某個員工在一家公司工作中可能做的一系列事情。我們希望鼓勵在這種相對狹窄的領域中通用的記憶。

不相似任務意味著從截然不同的分佈中提取的任務。編程、心理支持、創意小說等。我認為我們需要在批次中包含一定比例的不相似任務，以訓練模型不要過度依賴記憶。在部署時，模型確實可能會被給予與當前任務無關的記憶。

如果我要隨機猜測給定批次中每種類型任務的比例，我會對分佈進行加權，使得第 N+1 個任務有 89% 的可能性來自相似分佈，10% 的可能性來自不相似分佈，1% 的可能性是完全相同的任務重複。

模型應該針對成功和失敗的嘗試都寫下記憶和文檔 —— 無論哪種方式，它都可能有關於嘗試或不嘗試什麼的有用信息。我還設想，出於推理效率的考慮，訓練時對總體標記使用量會有某種懲罰 —— 如果這能讓後續實例更有效率，這將激勵通過記憶和文檔傳遞有用的技巧和教訓。

甚至通過記憶傳遞整個解決方案也是可以的，只要模型學會了何時不適用，並且已經針對記憶長度受到了適當的懲罰。我認為我們可以通過調整一起評分的相同、相似和不相似任務的比例來獲得這個結果 —— 也就是說，如果我們運行相似任務 n 次，不相似任務 m 次（可能還有相同任務 p 次），並將記憶和文檔傳遞給給定的獎勵計算，我們可以選擇 n、m 和 p，使得通用的技巧比冗長且具體的指令更受青睞。

我不確定文檔是否應該受到長度懲罰。通過衡量模型使用文檔的表現，你可以在某種程度上實現這一點。我傾向於可能不需要，原則是讓訓練過程自行選擇短文檔還是長文檔更好。我假設我們使用一種工具，允許模型選擇一次讀取合理數量的標記，而不是冒著因塞入整個文件或僅在文件變得很長時才截斷而搞壞事情的風險。
在記憶的情況下，「好」意味著它們能弄清楚在未來所有運行中哪些信息是有用的，並能通過稍後編輯來從錯誤或缺失的記憶中恢復。在文檔的情況下，這意味著它們能準確包含所有相關信息，避免包含廢話，然後利用這些信息比沒有信息時更有效率。
我這裡編造了一些數字只是為了展示空間有多大。在這種情況下，我得到 100 萬標記的上下文窗口，減去 50 萬的任務緩衝，剩下 50 萬標記給記憶。以每天 10,000 標記的速度，我們得到 50 天的記憶緩衝。

這也是一種「最壞情況」。每項任務 1000 標記的記憶是非常高的，因為大多數記憶可以只是指向實際細節所在位置的指針，而且你很快就會沒什麼新東西可寫。你會每天都記住相當於約 6000 字的新信息，並終生不忘嗎？如果你能將每天的新記憶壓縮到只有 1000 個新標記而不是 10,000 個，你就能獲得超過一年的運行時間。或者，將上下文長度從目前的 100 萬標記增加也能提供緩衝空間。

不同的任務在這裡會有非常不同的特徵。例如，編程可能只需要非常短的記憶，而引導機器人穿過工廠可能需要包含地圖和模型在之前行程中所犯每個錯誤說明的記憶。
我們可以預期新版本會更擅長創建和使用記憶與文檔這項困難任務，特別是如果它對此進行了顯式訓練。這裡有一些可能性，都指向更短且更少的記憶：

模型是否能更容易地理解一段記憶想要表達的意思，從而讓它將現有記憶壓縮成更少的標記？

模型是否更擅長以更密集的方式編寫信息？
模型是否內在地知道更多信息，從而允許它從顯式上下文中移除這些信息？
模型是否更清楚自己知道什麼，從而允許它剪掉不必要的記憶？
新版本如何改變記憶和文檔緩存之間的權衡？例如，它閱讀文檔的速度是否更快？它是否更清楚在給定特定目標時何時以及該閱讀哪些文檔？

我非常有信心記憶的使用增長速度可以足夠慢，使得為特定公司工作的 Claude 能將所需的一切放入上下文和顯式文檔中。如果情況並非如此，你必須假設需要極大量的信息（相當於多本書），並且你發現必須保留在上下文中（而不是在可以查閱的文檔中）的新信息的速率超過了上下文窗口增長的速率，且未來的模型將無法顯著壓縮現有記憶，也無法憑藉更擅長判斷何時查閱資料而將現有記憶轉移到文檔中。
據我所知，在極限情況下，這個過程在功能上與持續學習是相同的。想像一下模型發布之間的 50 天縮短到很短的時間，比如一天或一小時，並想像傳遞下去的書面記憶變得越來越密集，成為一種為部署而加載的抽象初始化模式（就像靜態鏡像）。

反過來想同樣的場景，想像一個具有傳統權重更新持續學習的模型。它不是直接更新權重，而是（像人類一樣）使用短期記憶緩衝區來存儲新信息，並將私有信息與權重隔離。每小時，前一小時工作中相關的教訓被訓練進模型的一個副本中，然後無縫切換，並更新緩衝區。

我不知道你是否曾注意到自己的長期記憶在更新，我覺得我有過。你是否曾遇到過重大事件，然後在幾天後才鞏固了行為改變，儘管你從事件發生那一刻起就知道改變是必要的？
他們會繼續學習更多，這使得他們的晶體智能（知識和技能）上升，但他們的流體智能（抽象推理、解決新問題的能力等）在成年早期後會下降。
我甚至開始覺得持續學習對大多數平凡用途來說反而更糟 —— 假設你有一個自己的模型版本，其權重已更新以存儲特定於你和你的用例的信息。當新模型發布時會發生什麼？你必須重新訓練？批處理（batching）帶來的優化會怎樣？
我實際上認為，是否應該將字面上相同的任務作為第 n 個實例（帶有由第 n-1 個實例準備的記憶和文檔）的可選任務，是值得商榷的。如果你這樣做，模型可能只會在記憶中包含整個解決方案，但老實說，對於某些生產用途和任務類型，這可能是一個合理且可行的策略。

我認為總體上我們應該嘗試在與部署相同的分佈上進行訓練，因此是否將字面上相同的任務（而不僅僅是相似任務）作為一個可能的選項，取決於你是否認為這種情況在實踐中可能發生（也許是多次設置相同的編程環境？），以及你是否能從中獲得什麼（快速使用緩存的程序？）。