我们为什么需要持续学习 Why We Need Continual Learning —— A16Z
2026/5/6 10:26:37 网站建设 项目流程

Why We Need Continual Learning

我们为什么需要持续学习

https://www.a16z.news/p/why-we-need-continual-learning

In Christopher Nolan’s Memento, Leonard Shelby lives inside a fractured present. After a traumatic brain injury, he suffers from anterograde amnesia, an affliction that prevents him from forming new memories. Every few minutes, his world resets, leaving him stranded in a perpetual now, untethered from what just happened and uncertain of what comes next. To cope, he survives by tattooing notes on his body and snapping Polaroids that are basically external props to remind him of what his brain cannot retain.

Large language models live in a similar perpetual present. They emerge from training with vast knowledge frozen into their parameters but they cannot form new memories - cannot update their parameters in response to new experience. To compensate, we surround them with scaffolding: chat history as short-term sticky notes, retrieval systems as external notebooks, system prompts as guiding tattoos. The model itself never fully internalizes the new information.

在克里斯托弗·诺兰的《记忆碎片》中,莱昂纳多·谢尔比生活在一个支离破碎的当下。遭受创伤性脑损伤后,他患上了顺行性遗忘症,这种病症使他无法形成新的记忆。每隔几分钟,他的世界就会重置,让他永远困在"此刻"——既与刚发生的事割裂,也对即将发生的事茫然。为应对这种情况,他依靠纹身笔记和宝丽来照片生存,这些外在道具帮他记住大脑无法留存的信息。

大型语言模型同样活在永恒的当下。它们通过训练获得固化在参数中的海量知识,却无法形成新记忆——无法根据新经验更新参数。为此我们为它们搭建辅助系统:用聊天记录作短期便利贴,用检索系统当外部笔记本,用系统提示作指引性纹身。模型本身永远不会真正内化这些新信息。

There’s a growing belief among some researchers that this is not enough. In-context learning (ICL) is sufficient for problems where the answer, or pieces of the answer, already exist somewhere in the world. But for problems that require genuine discovery (like novel mathematics), for adversarial scenarios (like security), or for knowledge too tacit to express in language, there’s a strong argument that models need a way to update their knowledge and experience directly into their parameters after deployment.

ICL is transient. Real learning requires compression. Until we let models compress continuously, we may be stuck in Memento’s perpetual present. Conversely, if we can train models to learn their own memory architectures - rather than offloading to bespoke harnesses - we may unlock a new dimension of scaling.

一些研究人员日益认为这还不够。上下文学习(ICL)适用于答案或答案片段已存在于世界某处的问题。但对于需要真正发现(如新颖数学)的问题、对抗性场景(如安全),或对于难以用语言表达的隐性知识,有强烈观点认为模型需要一种在部署后直接更新其参数中知识和经验的方式。

ICL是短暂的。真正的学习需要压缩。除非我们让模型持续压缩,否则可能被困在《记忆碎片》般的永恒当下。反之,如果我们能训练模型构建自己的记忆架构——而非依赖定制化外部工具——或许能开启规模扩展的新维度。

The name for this field of research iscontinual learning.And while the idea is not new (see: McCloskey and Cohen, 1989!), we think it’s some of the most important work happening in AI right now. With the astounding growth in model capabilities over the past 2-3 years, the gap between what models know and what theycouldknow has become increasingly obvious. So our goal with this post is to share what we’ve learned from top researchers working in this field; help disambiguate different approaches to continual learning; and advance this topic in the startup ecosystem.

Note: This article was shaped by conversations with an extraordinary group of researchers, PhD students, and startup founders who have shared their work and perspectives on continual learning openly with us. Their insights from the theoretical foundations to the engineering realities of post-deployment learning made this piece sharper and more grounded than anything we could have written on our own. Thank you for your generosity with your time and ideas!

这一研究领域的名称是持续学习。尽管这一概念并不新鲜(参见:McCloskey和Cohen,1989年),但我们认为这是当前人工智能领域最重要的工作之一。随着过去2-3年模型能力的惊人增长,模型已知内容与其可能掌握内容之间的差距日益明显。因此,我们撰写本文的目的是分享我们从该领域顶尖研究人员那里学到的经验;帮助厘清持续学习的不同方法;并在初创企业生态系统中推进这一主题。

注:本文的成形得益于与一群杰出的研究人员、博士生和初创企业创始人的对话,他们向我们公开分享了他们在持续学习方面的工作和观点。从理论基础到部署后学习的工程现实,他们的见解使本文比我们独自撰写的任何内容都更加犀利和扎实。感谢你们慷慨地贡献时间和想法!

First, Let’s Talk About Context

首先,让我们谈谈上下文

Before making the case for parametric learning - i.e., learning that updates the model’s weights - it’s important to acknowledge that in-context learning absolutely does work. And there is a compelling argument that it will keep winning.

Transformers are, at their core, conditional next-token predictors over a sequence. Give them the right sequence, and you get surprisingly rich behavior, without touching the weights. That is why context management, prompt engineering, instruction tuning, and few-shot examples have been so powerful. The intelligence lives in the static parameters, and the apparent capabilities change radically depending on what you feed into the window.

在为参数化学习(即更新模型权重的学习方式)辩护之前,必须承认上下文学习确实行之有效。更有说服力的观点认为,这种学习方式将持续占据优势。

Transformer模型的核心本质上是基于序列的条件化下一词元预测器。只要输入正确的序列,无需调整权重就能产生令人惊叹的丰富行为。这正是上下文管理、提示工程、指令微调和少量示例技术如此强大的原因所在。智能存在于静态参数之中,而模型展现的能力会随着输入窗口内容的改变发生根本性变化。

Cursor’s recent deep-dive on scaling autonomous coding agents gives a nice example of this point:“A surprising amount of the system’s behavior comes down to how we prompt the agents. The harness and models matter, but the prompts matter more.”The model weights were fixed. What made the system work was careful orchestration of context: what to include, when to summarize, how to maintain coherent state across hours of autonomous operation.

OpenClaw is another great example. It broke out not because of special model access (the underlying models were available to everyone) but because of how effectively it turns context and tools into working state: tracking what you’re doing, structuring intermediate artifacts, deciding what to re-inject into the prompt, maintaining persistent memory of prior work. OpenClaw elevates agent harness design to a discipline in its own right.

When prompting first emerged, many researchers were skeptical that “just prompting” could be a serious interface. It looked like a hack. Yet it was native to the transformer architecture, required no retraining, and scaled automatically with model improvements. So as models got better, prompting got better. “Janky but native” interfaces often win because they couple directly to the underlying system rather than fighting it. And so far, that’s exactly what’s happening with LLMs.

Cursor最近对自主编码代理扩展的深度探讨为此观点提供了很好的例证:"系统中出人意料的大量行为都取决于我们如何提示这些代理。框架和模型固然重要,但提示语的影响更为关键。"模型权重是固定的,真正让系统运作起来的是对上下文的精心编排:包含哪些内容、何时进行总结、如何在数小时的自主运行中保持连贯状态。

OpenClaw则是另一个绝佳范例。它的脱颖而出并非因为特殊的模型权限(底层模型对所有人开放),而是源于其将上下文和工具转化为工作状态的卓越能力:追踪用户操作、构建中间产物、决定哪些内容重新注入提示语、保持对先前工作的持久记忆。OpenClaw将代理框架设计提升为一门独立的学科。

当提示工程最初出现时,许多研究者怀疑"仅靠提示"能成为严肃的交互方式。这看起来像是一种取巧手段。但它本就内生于Transformer架构,无需重新训练,并能随模型改进自动扩展。因此随着模型性能提升,提示效果也水涨船高。"粗糙但原生"的交互方式往往能胜出,因为它们直接与底层系统耦合而非对抗。迄今为止,这正是大语言模型领域正在发生的现实。

State Space Models: Context On Steroids

状态空间模型:强化背景

As the dominant workflow moves from raw LLM calls to agentic loops, pressure is building on the in-context learning model. It used to be relatively rare to fill up context completely. This usually happened when LLMs were asked to do a long sequence of discrete work, and the app layer could prune and/or compress chat history in a straightforward way. With agents, though, one task can consume a significant portion of total available context. Each step in the agent’s loop relies on context passed from prior iterations. And they often fail after 20–100 steps because they lose the thread: their context fills up, coherence degrades, and they stop converging.

As a result, the major AI labs are now contributing significant resources (i.e., large training runs) to develop models with very large context windows. This is a natural approach to take because it builds on what’s working (in-context learning) and maps cleanly to the broader industry shift toward inference-time compute. The most common architecture is to intersperse fixed memory layers with normal attention heads, i.e., state space models and linear attention variants (we will refer to all these as SSMs for simplicity). SSMs offer a fundamentally better scaling profile than traditional attention for long contexts.

随着主导工作流从原始大语言模型调用转向代理循环,情境学习模型面临的压力正不断增大。过去完全耗尽上下文长度的情况相对罕见,通常仅发生在要求大语言模型处理长序列离散任务时,此时应用层可直接对聊天记录进行剪枝或压缩。但在代理机制下,单个任务就可能消耗大部分可用上下文空间——代理循环中的每个步骤都依赖于先前迭代传递的情境信息,导致系统通常在20-100步后失效:上下文空间耗尽、连贯性退化、最终停止收敛。

为此,顶尖AI实验室正投入大量资源(如大规模训练)开发超长上下文窗口模型。这种策略顺应当前有效范式(情境学习),并完美契合行业向推理阶段计算迁移的趋势。主流架构方案是在常规注意力机制中嵌入固定记忆层,即状态空间模型及其线性注意力变体(为简化统称为SSM)。相较于传统注意力机制,SSM在长上下文场景中展现出更优越的扩展特性。

The goal is to help agents maintain coherence over longer loops by several orders of magnitude, from say ~20 steps to ~20,000, without losing the breadth of skills and knowledge afforded by traditional transformers. If it works, this will be a major win for long-running agents. And you could even consider this approach a form of continual learning: while you’re not updating the model weights, you’ve introduced an external memory layer that rarely needs to be reset.

So, these non-parametric approaches are real and powerful. Any assessment of continual learning has to start here. The question is not whether today’s context-based systems work - they do. The question is whether we are looking at the ceiling, and if new approaches can take us further.

目标是帮助智能体在更长循环中保持连贯性,规模提升数个数量级(例如从约20步扩展到约20,000步),同时不丢失传统Transformer模型提供的广泛技能与知识。若实现这一目标,将成为长效运行智能体的重大突破。甚至可将该方法视为持续学习的一种形式:虽然不更新模型权重,但引入了几乎无需重置的外部记忆层。

因此,这些非参数化方法确实存在且效果显著。任何关于持续学习的评估都应由此出发。问题不在于现有基于上下文的系统是否有效——它们确实有效。真正的问题是:我们是否已触及性能天花板?新方法能否带我们突破极限?

What Context Misses: The Filing Cabinet Fallacy

背景所遗漏的:文件柜谬误

“The thing that happened with AGI and pre-training is that in some sense they overshot the target… A human being is not an AGI. Yes, there is definitely a foundation of skills, but a human being lacks a huge amount of knowledge. Instead, we rely on continual learning. If I produce a super intelligent 15-year-old, they don’t know very much at all. A great student, very eager. You can say, ‘Go and be a programmer. Go and be a doctor.’ The deployment itself will involve some kind of a learning, trial-and-error period. It’s a process, not dropping the finished thing.”

— Ilya Sutskever

“AGI与预训练的发展在某种程度上超越了目标……人类并非AGI。诚然,我们具备基础技能,但人类仍缺乏海量知识。相反,我们依赖持续学习。如果我培养出一个超级聪明的15岁少年,他们实际所知甚少。这是位优秀且求知若渴的学生。你可以说‘去当程序员,去当医生’,但实际部署过程必然包含某种学习与试错阶段。这是渐进的过程,而非直接交付成品。”

——伊利亚·苏茨克沃

Imagine a system with infinite storage. The world’s biggest filing cabinet, every fact perfectly indexed, instantly retrievable. It can look up anything. Has it learned?

No. It has never been forced to do the compression.

This is the centerpiece of our argument, and it draws on a point that Ilya Sutskever has made before: LLMs are, at their core, compression algorithms. During training, they compress the internet into parameters. The compression is lossy, and that is precisely what makes it powerful. Compression forces the model to find structure, to generalize, to build representations that transfer across contexts. A model that memorizes every training example is worse than one that extracts the underlying patterns. The lossy compression is the learning.

想象一个拥有无限存储的系统。世界最大的文件柜,每条信息都被完美索引,瞬间可检索。它能查找任何内容。但它学会了吗?

没有。因为它从未被迫进行压缩。

这是我们论点的核心,也呼应了伊利亚·苏茨克弗(Ilya Sutskever)曾提出的观点:大语言模型本质上是压缩算法。训练过程中,它们将互联网压缩成参数。这种压缩是有损的,而这恰恰是其强大之处。压缩迫使模型发现结构、进行泛化,构建可跨情境迁移的表征。一个死记硬背每个训练样本的模型,远不如能提取潜在规律的模型。正是有损压缩的过程实现了真正的学习。

The irony is that the very mechanism that makes LLMs powerful during training (e.g. compressing raw data into compact, transferable representations) is exactly what we refuse to let them do after deployment. We stop the compression at the moment of release and replace it with external memory. Most agent harnesses, of course, compress context in some bespoke way. But wouldn’t the bitter lesson suggest that the models themselves should learn to do this compression, directly and at scale?

One example Yu Sun shares to illustrate the debate is math. Consider Fermat’s Last Theorem. For over 350 years, no mathematician could prove it - not because they lacked access to the right literature, but because the solution was highly novel. The conceptual distance between established mathematics and the eventual answer was simply too vast. When Andrew Wiles finally cracked it in the 1990s, after seven years of working in near-total isolation, he had to invent powerful new techniques to reach the solution. His proof relied on successfully bridging two distinct branches of mathematics: elliptic curves and modular forms. While earlier work by Ken Ribet had shown that proving this connection would automatically resolve Fermat’s Last Theorem, no one possessed the theoretical machinery to actually construct that bridge until Wiles. A similar argument can be made about Grigori Perelman’s proof of the Poincaré conjecture.

讽刺的是,使大语言模型在训练时如此强大的机制(例如将原始数据压缩为紧凑、可迁移的表征),恰恰是我们在其部署后禁止它们继续执行的。我们在模型发布时中止了这种压缩能力,转而用外部记忆系统替代。当然,大多数智能体框架会通过定制化方式压缩上下文。但"苦涩的教训"理论不正是暗示着:模型自身应该直接学习这种大规模压缩能力吗?

孙宇分享的数学案例生动诠释了这一争论。以费马大定理为例:350多年间,数学家们始终无法证明它——并非因为文献匮乏,而是答案需要突破性创新。现有数学体系与最终解法之间的概念鸿沟实在过于巨大。当安德鲁·怀尔斯在1990年代闭关七年后终于破解时,他不得不发明全新的数学工具。其证明关键在于架起椭圆曲线与模形式这两个数学分支的桥梁——虽然肯·里贝特早前已证实该联系能自动推导出费马大定理,但在怀尔斯之前无人掌握构建这座桥梁的理论工具。格里戈里·佩雷尔曼证明庞加莱猜想的过程同样印证了这一观点。

The central question is:do these examples prove that something is missing from LLMs, some ability to update their priors and think in truly creative ways? Or, does the story prove the opposite - that all human knowledge is just data available for training/recombination, and Wiles and Perelman simply show what LLMs could do at even greater scale?

This question is empirical, and the answer is not known yet. But we do know there are many classes of problems where in-context learning fails today and where parametric learning could have an impact. For example:

核心问题是:这些例子是否证明了大型语言模型(LLMs)缺失了某些能力——比如更新先验知识或以真正创造性的方式思考?或者说,这些故事是否恰恰证明了相反的观点——即所有人类知识都只是可供训练/重组的数据,而怀尔斯和佩雷尔曼的成就仅仅展示了LLMs在更大规模上也能做到的事?

这是个实证性问题,目前尚无定论。但我们确实知道,当今存在许多情境学习失效的领域,而参数化学习可能会产生影响。例如:

What’s more, in-context learning is limited to what can be expressed in language, whereas weights can encode concepts that someone’s prompt cannot relay in text. Some patterns are too high-dimensional, too tacit, too deeply structural to fit in a context. For example, the visual texture that distinguishes a benign artifact from a tumor in a medical scan, or the micro-fluctuations in audio that define a speaker’s unique cadence, are patterns that do not easily decompose into exact words. Language can only approximate them. No prompt, no matter how long, can transfer either; that kind of knowledge can only live in the weights. They live in the latent space of learned representations, not words. No matter how long the context window grows, there will be knowledge that cannot be described in text and can only be held in the parameters.

This may help explain why explicit “the bot remembers you” features, such as ChatGPT’s memory, often trigger user discomfort rather than delight. Users don’t actually want recall per se. They want competence. A model that has internalized your patterns can generalize to novel situations; a model that merely recalls your history cannot. The difference between “Here is what you responded to this email before” (verbatim) vs. “I understand how you think well enough to anticipate what you need” is the difference between retrieval and learning.

此外,上下文学习仅限于能用语言表达的内容,而权重可以编码那些无法通过文本提示传递的概念。某些模式过于高维、过于隐性、结构过于深层,难以融入上下文。例如,医学扫描中区分良性伪影与肿瘤的视觉纹理,或定义说话者独特节奏的音频微观波动,这些模式很难精确分解为文字描述。语言只能近似表达它们。无论多长的提示都无法传递这类知识;这种知识只能存在于权重中。它们栖身于学习表征的潜在空间,而非词语之中。无论上下文窗口扩展得多长,总有些知识无法用文本描述,只能储存在参数里。

这或许能解释为何"机器人记住你"这类显性功能(如ChatGPT的记忆功能)往往引发用户不安而非欣喜。用户真正需要的并非记忆本身,而是能力。一个内化了您行为模式的模型能泛化至新情境;仅能回忆历史的模型则做不到。"这是您之前对这封邮件的回复"(逐字记录)与"我足够了解您的思维方式,能预判您的需求"之间的差异,正是检索与学习的本质区别。

A Primer on Continual Learning

持续学习入门

There are various approaches to continual learning. The dividing line is not “memory features” vs. “no memory features.” It is:where does compaction happen?The approaches cluster along a spectrum from no compaction (pure retrieval, weights frozen), to full internal compaction (weight-level learning, model gets smarter), and one important middle ground (modules).

持续学习有多种方法。关键区别不在于“记忆特征”与“无记忆特征”,而在于:压缩发生在哪里?这些方法分布在从无压缩(纯检索,权重冻结)到完全内部压缩(权重级学习,模型变得更智能)的连续谱系上,其中模块化方法是重要的中间地带。

Context

On the context end, teams build smarter retrieval pipelines, agent harnesses, and prompt orchestration. This is the most mature category: the infrastructure is proven and the deployment story is clean. The limitation is depth: the context length.

One emerging extension worth noting here: multi-agent architectures as a scaling strategy for context itself. If a single model is bounded by a 128K-token window, a coordinated swarm of agents, each holding its own context, specializing on a slice of the problem, and communicating results, can collectively approximate unbounded working memory. Each agent performs in-context learning within its window; the system aggregates. Karpathy’s recent autoresearch project + Cursor’s example of building a web browser are early examples. It is a purely non-parametric approach (no weights change) but it dramatically extends the ceiling of what context-based systems can do.

在上下文处理方面,团队构建了更智能的检索管道、智能体工具和提示编排系统。这是最成熟的类别:基础设施经过验证且部署流程清晰。其局限性在于深度:上下文长度。

这里值得注意的一个新兴扩展方向:将多智能体架构作为上下文本身的扩展策略。如果单个模型受限于128K标记的窗口,那么一组协同工作的智能体——每个智能体持有自己的上下文、专注于问题的某个片段并相互通信结果——就能共同模拟出无限的工作记忆。每个智能体在其窗口内进行上下文学习;系统负责聚合结果。Karpathy最近的自主研究项目与Cursor构建网页浏览器的实例就是早期范例。这是一种纯粹的非参数化方法(不改变权重),但极大拓展了基于上下文的系统能力上限。

Modules

In the modules space, teams build attachable knowledge modules (compressed KV caches, adapter layers, external memory stores) that specialize a general-purpose model without retraining it. An 8B model with the right module can match 109B performance on targeted tasks using a fraction of the memory. The appeal is that it works with existing transformer infrastructure.

在模块化空间领域,各团队构建可附加的知识模块(压缩键值缓存、适配器层、外部记忆存储),这些模块能针对特定任务优化通用模型而无需重新训练。配备合适模块的80亿参数模型,在目标任务上能以十分之一的内存消耗实现1090亿参数模型的性能。其优势在于完全兼容现有Transformer架构体系。

Weights

On the weight updates, researchers are pursuing genuine parametric learning such as sparse memory layers that update only the relevant fraction of parameters, reinforcement learning loops that refine models from feedback, and test-time training that compresses context into weights during inference. These are the deepest approaches and the hardest to deploy, but ones that actually allow models to fully internalize new information or skills.

There are multiple parametric mechanisms on how to do the update. To name a few research directions:

在权重更新方面,研究者们正在探索真正的参数化学习方法,例如仅更新相关参数部分的稀疏记忆层、通过反馈优化模型的强化学习循环,以及在推理过程中将上下文信息压缩进权重的测试时训练。这些是最具深度也最难部署的方法,但它们真正能让模型完全内化新信息或技能。

关于如何进行更新存在多种参数化机制。列举几个研究方向:

The weight-level research landscape spans several parallel lines of work.Regularization and weight-space methodsare the oldest: EWC (Kirkpatrick et al., 2017) penalizes changes to parameters in proportion to their importance for previous tasks, and weight interpolation (Kozal et al., 2024) blends old and new weight configurations in parameter space, though both tend to be brittle at scale.Test-time training, pioneered by Sun et al. (2020) and since evolved into architectural primitives (TTT layers, TTT-E2E, TTT-Discover), takes a different approach: run gradient descent on test-time data, compressing new information into parameters at the moment it matters.Meta-learningasks whether we can train models that learn how to learn, from MAML’s few-shot-friendly parameter initialization (Finn et al., 2017) to Behrouz et al.’s Nested Learning (2025), which structures the model as a hierarchy of optimization problems operating at different timescales, with fast-adapting and slow-updating modules inspired by biological memory consolidation.

权重层面的研究格局涵盖多条并行路线。正则化与权重空间方法最为古老:EWC(Kirkpatrick等,2017)根据参数对先前任务的重要性比例惩罚其变更,权重插值(Kozal等,2024)则在参数空间混合新旧权重配置,但二者在大规模应用时往往表现脆弱。测试时训练由Sun等(2020)开创,后发展为架构原语(TTT层、TTT端到端、TTT发现机制),其采取不同路径:对测试时数据运行梯度下降,在关键时刻将新信息压缩至参数中。元学习探究能否训练出"学习如何学习"的模型——从MAML适用于小样本的参数初始化(Finn等,2017),到Behrouz等的嵌套学习(2025),后者受生物记忆巩固启发,将模型构建为跨时间尺度的优化问题层次结构,包含快速适应与慢速更新模块。

Distillationpreserves prior-task knowledge by matching a student to a frozen teacher checkpoint. LoRD (Liu et al., 2025) makes this efficient enough to run continuously by pruning both model and replay buffer. Self-distillation (SDFT, Shenfeld et al., 2026) flips the source, using the model’s own expert-conditioned outputs as the training signal, sidestepping the catastrophic forgetting of sequential fine-tuning.Recursive self-improvementoperates in a similar spirit: STaR (Zelikman et al., 2022) bootstraps reasoning from self-generated rationales, AlphaEvolve (DeepMind, 2025) discovered improvements to algorithms untouched for decades, and Silver and Sutton’s “Era of Experience” (2025) frames agents learning from a continuous, never-ending experience stream.

These research directions are converging. TTT-Discover already fuses test-time training with RL-driven exploration. HOPE nests fast and slow learning loops inside a single architecture. SDFT turns distillation into a self-improvement primitive. The boundaries between columns are blurring — the next generation of continual learning systems will likely combine multiple strategies, using regularization to stabilize, meta-learning to accelerate, and self-improvement to compound. A growing cohort of startups is betting on different layers of this stack.

蒸馏学习通过将学生模型与冻结的教师检查点相匹配来保留先前任务的知识。LoRD(Liu等人,2025年)通过剪枝模型和回放缓冲区,使其效率足以持续运行。自我蒸馏(SDFT,Shenfeld等人,2026年)则转变了知识来源,使用模型自身基于专家条件的输出作为训练信号,从而规避了顺序微调中的灾难性遗忘问题。递归自我改进遵循相似理念:STaR(Zelikman等人,2022年)通过自生成推理依据实现自我提升,AlphaEvolve(DeepMind,2025年)发现了数十年未改进算法的优化方案,而Silver与Sutton提出的"经验时代"(2025年)则构建了从持续不断经验流中学习的智能体框架。

这些研究方向正在融合。TTT-Discover已将测试时训练与强化学习驱动的探索相结合;HOPE在单一架构内嵌套了快慢学习循环;SDFT将蒸馏转化为自我改进的基本单元。各方法间的界限逐渐模糊——新一代持续学习系统很可能融合多种策略:利用正则化实现稳定、元学习加速进程、自我改进形成累积效应。越来越多的初创公司正押注于这个技术栈的不同层级。

The Continual Learning Startup Landscape

持续学习初创企业格局

The non-parametric end of the spectrum is the most familiar. Harness companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding that manage what goes into the context window. External storages and RAG infrastructure (e.g. Pinecone, xmemory) provide the retrieval backbone. The data exists, the challenge is getting the right slice of it in front of the model at the right time. As context windows expand, the design space for these companies grows with them, particularly on the harness side, where a new wave of startups is emerging to manage increasingly complex context strategies.

The parametric side is earlier and more varied. Companies here are attempting some version of post-deployment compression, letting models internalize new information in the weights. The approaches cluster into a few distinct bets abouthowmodels should learn after release.

Partial compaction: learning without retraining.Some teams are building attachable knowledge modules (compressed KV caches, adapter layers, external memory stores) that specialize a general-purpose model without touching its core weights. The shared thesis: you can get meaningful compaction (not just retrieval) while keeping the stability-plasticity tradeoff manageable, because the learning is isolated rather than distributed across the full parameter space. An 8B model with the right module can match far larger model performance on targeted tasks. The upside is composability: modules work with existing transformer architectures out of the box, can be swapped or updated independently, and are far easier to experiment with than retraining.

RL and feedback loops: learning from signals.Other teams are betting that the richest signal for post-deployment learning already exists in the deployment loop itself — user corrections, task success and failure, reward signals from real-world outcomes. The core idea is that models should treat every interaction as a potential training signal, not just an inference request. This is a close analog to how humans improve at a job: you do the work, you get feedback, you internalize what worked. The engineering challenge is converting sparse, noisy, sometimes adversarial feedback into stable weight updates without catastrophic forgetting but a model that genuinely learns from deployment compounds in value over time in a way that context-only systems cannot.

Data-centric approaches: learning from the right signal.A related but distinct bet is that the bottleneck isn’t the learning algorithm but the training data and surrounding systems. These teams focus on curating, generating, or synthesizing the right data to drive continual updates: the premise being that a model with access to high-quality, well-structured learning signal needs far fewer gradient steps to meaningfully improve. This connects naturally to the feedback-loop companies but emphasizes the upstream question: not justwhetherthe model can learn, butwhatandto what degreeit should learn from.

Novel architectures: learning by design.The most radical bet is that the transformer architecture itself is the bottleneck, and that continual learning requires fundamentally different computational primitives: architectures with continuous-time dynamics and built-in memory mechanisms. The thesis here is structural: if you want a system that learns continuously, you should build the learning mechanism into the substrate.

All the major labs are also active across these categories. Some are exploring better context management and chain-of-thought reasoning. Others are experimenting with external memory modules or sleep-time compute pipelines. Several stealth startups are pursuing novel architectures. The field is early enough that no single approach has won, and given the range of use cases, none should.

Why Naive Weight Updates Fail

Updating model parameters in production introduces a cascade of failure modes that are, so far, unsolved at scale.

The engineering problems are well-documented. Catastrophic forgetting means models sensitive enough to learn from new data destroy existing representations - the stability-plasticity dilemma. Temporal disentanglement is the fact that invariant rules and mutable state get compressed into the same weights, so updating one corrupts the other. Logical integration fails because fact updates don’t propagate to their consequences: changes are local to token sequences, not semantic concepts. And unlearning remains impossible: there is no differentiable operation for subtraction, so false or toxic knowledge has no surgical remedy.

But there is a second set of problems that gets less attention. The current separation between training and deployment is not just an engineering convenience - it is a safety, auditability, and governance boundary. Open it, and several things break at once. Safety alignment can degrade unpredictably: even narrow fine-tuning on benign data can produce broadly misaligned behavior. Continuous updates create a data poisoning surface - a slow, persistent version of prompt injection that lives in the weights. Auditability breaks down because a continuously updating model is a moving target that can’t be versioned, regression-tested, or certified once. And privacy risks intensify when user interactions get compressed into parameters, baking sensitive information into representations that are far harder to filter than retrieved context.

These are open problems, not fundamental impossibilities, and solving them is as much a part of the continual learning research agenda as solving core architectural challenges.

From Memento to Memory

Leonard’s tragedy inMementoisn’t that he can’t function: he’s resourceful, even brilliant within any given scene. His tragedy is that he can never compound. Every experience remains external - a Polaroid, a tattoo, a note in someone else’s handwriting. He can retrieve, but he cannot compress the new knowledge.

As Leonard moves through this self-constructed maze, the line between truth and belief begins to blur. His condition does not just strip him of memory;it forces him to constantly reconstruct meaning, making him both investigator and unreliable narrator in his own story.

Today’s AI operates under the same constraint. We have built extraordinarily capable retrieval systems: longer context windows, smarter harnesses, coordinated multi-agent swarms, and they work! But retrieval is not learning. A system that can look up any fact has not been forced to find structure. It has not been forced to generalize. The lossy compression that makes training so powerful, the mechanism that turns raw data into transferable representations, is exactly what we shut off the moment we deploy.

The path forward is likely not a single breakthrough but a layered system. In-context learning will remain the first line of adaptation: it is native, proven, and improving. Module mechanisms can handle the middle ground of personalization and domain specialization. But for the hard problems such as discovery, adversarial adaptation, knowledge too tacit to express with words, we may need models that compress experience into their parameters after training. That means advances in sparse architectures, meta-learning objectives, and self-improvement loops. It may also require us to redefine what “a model” even means: not a fixed set of weights, but an evolving system that includes its memories, its update algorithms, and its capacity to abstract from its own experience.

The filing cabinet keeps getting bigger. But a bigger filing cabinet is still a filing cabinet. The breakthrough is letting the model do after deployment what made it powerful during training: compress, abstract, and learn. We stand at the cusp of moving from amnesiac models to ones with a glimmer of experience. Otherwise, we will be stuck in our own Memento.

----

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询