Context Engineering Our Way to Long-Horizon Agents: LangChain’s Harrison Chase
Harrison Chase, cofounder of LangChain and pioneer of AI agent frameworks, discusses the emergence of long-horizon agents that can work autonomously for extended periods. Harrison breaks down the evolution from early scaffolding approaches to today's harness-based architectures, explaining why context engineering - not just better models - has become fundamental to agent development. He shares insights on why coding agents are leading the way, the role of file systems in agent workflows, and how building agents differs from traditional software development - from the importance of traces as the new source of truth to memory systems that enable agents to improve themselves over time. Hosted by Sonya Huang and Pat Grady
- Published
- Published Jan 21, 2026
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] People use traces from the start to just tell what's going on under the hood. And it's way more impactful in agents than in single LLM applications. Because in single LLM applications, you get some bad response from the LLM. You know exactly what your prompt is. You know exactly what the context that goes in is because that's determined by code. And then you get something out. In agents, they're running and repeating. And so you don't actually know what the context at step 14 will be because there's 13 steps before that that could pull arbitrary things in. So like what exactly is everything's context engineering? [00:30] I wish I came up with that term. Like it actually really describes like everything we've done at Langchain without knowing that that term existed. But like traces just like tell you what's in your context. And that's so important. [00:40] Thank you. [00:57] Welcome to Training Data. [00:58] Harrison, you were our very first guest on Training Data, and the AI space has moved so quickly in the 18 months or so since we originally interviewed you. And so I'm delighted to get you on the show today. Topics of the moment, I think there's nobody better than you to talk about some of these topics. We're going to talk first about long horizon agents and agent harnesses. Pat and I had this blog post on this yesterday. I know this is something that you are... [01:20] deeply fluent in. And then we're going to talk about what's the difference between building long horizon agents versus building software and the role that you see LinkedIn playing in that ecosystem. And then finally, I just want to chat with you about the future. I think you single handedly, you know, kind of saw the agent opportunity, I think before anybody, you know, we were back in the GPT three days and I think you see the future for what's happening with agents. And so I'm just excited to chat with you open-endedly about the future as well. I am really excited as well. Thank you guys for having me back. It's quite an honor. I'll
[01:50] tell my mom again that I'm on top of it. Wonderful. Okay. Let's start with long horizon agents. Yes. That was a great term. You guys wrote a great article. So he's good at naming things. We're not going to get into the backstory there. What do you think? What do you agree with? What do you disagree with? I mean, I agree that they're starting to finally work. I think like the idea of running an LLM in a loop and just having it go was always the idea of agents from the start. [02:20] off and captured so many people's imagination because it was just an LLM running a loop, completely deciding what to do. The issue is the models weren't really good enough and the scaffolding and harnesses around them weren't really good enough. And I think the models got better. We learned more about what makes a good harness over the past few years. And now they start to like really, really work. And you see this in coding first. And I think that's the domain where they're taking off the most and that's spreading to other domains. But you can give a task to an agent and you still need to communicate to it what you want it to do. And it needs to have [02:50] but it can actually operate for longer and longer periods of time. And so that, yeah, the long horizons kind of like framing of it, I think is really, really apt and really, really good. [02:59] Awesome. What are your favorite examples of long horizon agents? And I guess what shapes do you see them taking? So coding is the place where there is the most. I think that's the one that I probably use. Yeah, that's the one that I use the most. Adjacent to that, I think like really good ones are AISREs. So Traversal, I think, is a Sequoia company and they have an AISRE that operates over longer time horizons.
[03:29] logs, like research in general is a really, really good task because it ends up producing like a first draft of something. And the issue with agents is they aren't like reliable to nine nines of reliability, but they can do a ton of work and more and more work over longer time horizons. So if you can find these framings where they run for a long period of time, but produce like a first draft of something, those to me are like the killer applications of long horizon agents right now. So like coding is an example of that. Like coding, you usually put up a PR. [03:59] also starting to get better and better. AISREs usually surface it to a human who comes in and then reviews it. Report generation, you don't send it out to all of your followers right away. You look at it, you edit it, it creates like a first draft of something. So we see this in finance a bunch. This is a huge research opportunity. Customer support, we see a lot of things pivoting from kind of like the initial customer support was like first line response. Like someone messages, you just respond really quickly and there's still that and that's going [04:29] a great example of this where it's like humans and ai working together when the first line fails you escalate to a human you don't just have the human handle it you have this [04:38] Long horizon agent run in the background, produce a report of everything that happened and then hand it off to the to the to the agent there, to the human agent there. Agent starts to get confusing in customer support. [04:50] So I think the killer use case of all of these is places where you have like this first draft type of concept. [04:56] And then how much of the why now do you think is the models themselves are just so good versus people are doing really smart things on the harness side?
[05:05] And maybe even before we get to that, can you say a word for our listeners on how you frame the harness versus the model in terms of the actual composition of an agent? Yeah. And I'll maybe bring in framework as well, because I think early on, I mean, that's how we describe Langchain. That's what Langchain is. It's an agent framework. And now we have deep agents, which I'd call an agent harness, and we get asked about, what's the difference? [05:35] abstractions around that. So making it easy to switch between models, adding abstractions for other things like tools and vector stores and memory and things like that, but pretty like unopinionated about what actually goes in there. The value is more in abstractions, which can be good, can be bad. Harnesses are more like batteries included. So when we talk about deep agents, we're talking about, we actually give it a planning tool by default. So it has a tool that comes built into the harness. That's pretty opinionated that like, this is the right way to [06:05] So you have these long horizon agents that are running for longer periods of time. Context windows are larger, but they're still not infinite. And so at some point you need to compact that. How do you do that? There's a lot of research going on there right now. One of the other sets of tools that we and a lot of people are giving to these agents are tools for interacting with the file system, whether directly or via bash. And it's [06:27] It's kind of tough to separate from the models because the models are being trained on a lot of this data as well. And so there's this kind of evolution between like
[06:36] I don't know if we could have known that like these file system based harnesses are the best thing. Like if we fast, if we go back two years ago, I don't think we could have known that because models weren't really being trained on that as much as they are now. And so they're kind of evolving together. So I think it's like a combination of things. It's the models absolutely are getting better. Reasoning models are helping, helping a lot. But it's also the fact that we're figuring out all these primitives around compaction and and planning and these file system tools being really useful. [07:06] I do think it's a combination of both. [07:09] I remember in that very first episode we did together, you described... [07:14] Thank you. [07:15] laying graph, I think, as almost the cognitive framework of the agent. [07:22] Is that the right way to think about what the harness is? [07:25] Yeah, I think that's right. Yeah. So we build deep agents on top of LandGraph. It's one particular kind of like LandGraph instance. It's very opinionated. It's more general purpose. And so I think early on, we talked about general purpose architectures and more specific architectures. And what we've seen is that a lot of the... [07:44] specificity for tasks, [07:46] Previously, that might have been in lane graph because you need to put more structure on the models. Now that specificity is moving into the tools and the instructions. So there's still the same level of complexity. It's just in natural language. And so prompting and editing those prompts and and and automatic, maybe automatically updating those is becoming a part. But the harness is remaining a little bit more fixed. What's the hardest thing to get right on the harness side?
[08:16] harness engineering side of things? Who do you admire there? I think a lot of the companies that are doing the best harness engineering are coding companies, honestly. I think that's the place where it's taken off a bunch. I mean, you look at CloudCode, I would argue a big reason for the popularity of CloudCode is the harness itself. Does that, by the way, imply that harnesses are better built by foundation model companies than by [08:46] company I was going to mention is Factory, which is another coding company. And I think you look at the hardest they've done there. Amp is another coding company, has a really good harness. And [08:56] I think there's... [08:58] Pros and cons... [08:59] they... [09:00] There definitely is some aspect of the harness [09:04] being tied to a model. And maybe not just not a specific model, but a family of models. So like all Claude models, like Anthropic fine tunes on some specific tools, OpenAI fine tunes on different ones. So like, I think probably, probably, probably when we were doing this last time, we maybe talked about how prompts need to be different for one model versus another. Harnesses also need to be slightly different for one family of things versus the other. But there are similarities. All of them use the file system in some sense. So I think this is, [09:34] really interesting thing. We see that a lot of the coding, everyone who's building a coding company is basically building their own harness right now. [09:44] And there's all these leaderboards and you can see it's actually kind of interesting. If you go to Terminal Bench 2, which I think is probably one of the more kind of like popular coding benchmarks right now, you can actually see they have like the...
[09:56] the agent harness and then the model. And so you can see the variation in performance and cloud code is not at the top of that. So there's there's differences, but I think it. [10:05] it doesn't necessarily mean that the model labs are... [10:09] better at it, it just means that you have to understand how the models work and people who [10:14] look at the [10:15] at what makes a harness tick around the model can get some performance gains there. What do you think goes into making the harness tick? What do you think the guys at the top of the leaderboard are doing exceptionally well? I think part of it is definitely understanding what tools the model's trained on. So I think OpenAI trains really heavily on Bash. I think Anthropic has some explicit kind of file editing tools. And so I think leaning into that is part of it. Compaction's becoming more and more of a thing. [10:45] horizon tasks, like you start to fill up the context window. And so what do you do there is a really big question. And there's a bunch of strategies for kind of like approaching that. I'd argue that's part of a harness. I mean, so all of these harnesses also, this is where like skills and MCPs and sub agents start to come into play as well. And you can use those in like different ways and... [11:07] I don't know how, I don't think a ton of like skills or sub agents are trained into the models yet. Like those are still pretty new. And so like, [11:15] One of the things that we see in our harness is like when you have a sub agent, the main model needs to communicate with it like, well, it needs to give it all the appropriate information. It needs to let the sub agent know that it needs to like give it its final response out. So like we would see some failure modes where the sub agent because basically what happens is you kick off a sub agent and then only the final response is passed back to the main agent. And so we'd see some failure modes where the sub agent would do a bunch of work and then it would be basically like, look at my work above. And then, you know, pass that back to the main agent and it can't see. And it's like, what are you talking about?
[11:45] that type of prompting to get these pieces to work together is a big part of it. So skills, subagents, MCP, there are... [11:53] prompts in all of these harnesses that make them work well or don't make them work well. And there are hundreds of lines long if you look at some of the ones that are out there. Can I ask you a question on how this has evolved? And since you've always been... [12:10] really kind of on the bleeding edge of what are people doing around the models to make them work in the real world, right? If we think about [12:17] in our simplistic view on like what the big inflection points over the last five years have been. It feels like there's a big inflection point around pre-training when ChatGPT came out. It feels like there's a big inflection point around reasoning when O1 came out. It feels like just recently there's been a third big inflection point around these long horizon agents with Cloud Code and Opus 4.5. In your world, the world of all the stuff around the models that makes them work in the real world, would you have a different set of inflection points? Like what [12:47] a couple of years ago, and now we're talking about frameworks and agent harnesses. Like, what are the major... [12:52] leaps in sort of the design around the model yeah what have they been so I think there's maybe like three eras I would say I'd say like early on this is when Langchain was just started like these were still the raw like text in text out like not even chat based models and so they didn't have any of the tool calling they didn't have any content blocks any reasoning at all they were really just like
[13:17] really really basic uh and so the the things that people were doing mostly like single prompts or like chains um and and it wasn't even possible to do anything like that complicated then a lot of the model labs started training in a lot of like the tool calling into the models and they got really good at kind of they tried to make them good at like thinking and planning and they still weren't [13:38] They still weren't good yet. They sort of weren't good as they are today. But they were good enough to like [13:44] decide what to do and this is where like the custom cognitive architectures would come in more into play because you'd ask it explicitly like what do i do here but it was like a very like point in time and then you go down this branch and i'm like what do i do here and maybe there's a loop and there started to be some loops but it's still a little bit more kind of like scaffolding around it [13:59] And then there was an inflection point. And I don't know where exactly that was. I would say, I think we noticed it probably in like June, July of this year, where we saw Cloud Code taking off, Deep Research taking off, Manus taking off. And these all use the same architecture under the hood of just the LLM running in the loop. But like cleverly, like a lot of a lot of harness is context engineering, like everything around contraction. [14:20] Context engineering, sub-agents, context engineering, skills, context engineering. So we basically saw them using the same core algorithm, but making just like improvements on context engineering. And we're like, oh, that's interesting. That's pretty different than before. And so that's when we started working on deep agents. I think for a lot of people in the coding community. [14:37] I think probably like Opus 4-5 was when they started to like really feel this. It might have also just coincided with winter break when everyone went home and started using Claude code. Yeah, yeah. That's how good it was. But I think like around like November, December, like I think there's has been this like
[14:53] At least I sense a pretty big like vibe shift and like people just like, yeah, you throw hard problems at these things and you get long horizon agents. And so I don't know whether it was early 2025 or late 2025, but at some point the models got good enough. And that's when we moved from like scaffolds to harnesses. And what's next on this arc? [15:11] I wish I could tell you. I mean, I do think that like this algorithm of just running the LM in a loop and letting it orchestrate its own, letting it really choose what to pull into context and doing stuff there. That is like, it's so simple and so general purpose. Like, I mean, that was the core idea of agents and all along and we're finally there. [15:41] where they let the model decide when to compact things. [15:44] We don't really see a ton of people using that. Maybe that'll be a part that's next. Part of what we're really interested in is memory as well. If you think about memory in the context of this, that's also context engineering, right? It's context engineering over longer time horizons, and it's a slightly different set of context, but it's still giving that to the LLM. And I think... [16:05] uh, [16:06] I think like the... [16:07] The core algorithm is [16:09] is [16:09] is pretty... [16:11] It's pretty simple. It's run the LLM in a loop and we're finally there and it kind of works. And so I think there'll be a bunch of context engineering tricks around it. And maybe some of that is giving the context engineering actually to the LLM, like the anthropic thing. Maybe some of that is just pulling in new types of context. The models will probably get better. They'll probably get better and better at these types of longer horizon tasks. That'll be great as well.
[16:41] where we first started to really see these long horizon agents take off. And even for non-coding tasks, I think you can make an argument that writing code is really useful and can be general purpose. I was going to ask you, are coding agents, is that a subcategory or are coding agents just agents? Meaning the job of an agent is to figure out how to get a computer to do useful stuff. And code is a pretty good way to get a computer to do useful stuff. [17:08] I don't know. This is one of the big things. So I very, very strongly believe that right now, if you're building a long horizon agent, you need to give it access to a file system. There's so many things you can do with a file system in terms of context management. When we talk about compaction, one strategy is to summarize, but put all the messages in the file system so that if it needs to look it up, it can. Another strategy is when you have big tool call results, don't pass it all to the model, put it in the file system and let it look it up. [17:38] file system, actually, without letting it write code. So we have a concept of like a virtual file system where it's just backed by Postgres or something like that. It's more scalable. But there are obviously things you can do with code that you can't do with a virtual file system. You can't run code in a virtual file system. So like writing scripts is like really useful for that. And I think a coding agent can be general purpose, but I don't know if that means that today's coding agents are, if that makes sense, because I think a lot of the coding agents today
[18:08] that a general purpose agent is a coding agent. [18:11] But I don't know if the reverse is true, if that kind of makes sense. Yeah, yeah, yeah. We're thinking about that a lot as well. Are all agents coding agents? Yeah, that's one of the biggest things that we're thinking about right now. Yeah. Maybe can we transition into talking about what goes into building... [18:26] a long horizon agent versus building software. Can you maybe describe the software development stack for 1.0 code development and what's different now? I thought you had a really good [18:36] x article on this maybe maybe just summarize the the punchline i've been sure i need to think about this a bunch because we like to say that built and i think a lot of people would agree that like building agents is different than building software but like what exactly is different because i think it's it's easy and lazy to say that it's different but what actually is different these might sound obvious but hopefully that's good and they're not controversial but like um when you're building software [19:00] all of the logic is in the code in the software, and you can see it there. When you're building an agent, the logic for how your applications works is not all in the code. A large part of it comes from the model. And so what this means is that you can't just look at the code and tell exactly what the agent would do in a specific scenario. You actually have to run it. And so what does that mean? And I think that's the biggest difference, by the way. We're introducing these non-deterministic systems into it, and it's a black box, and it lives outside. And I think all that's true. That's the biggest difference. [19:30] does that mean? I think like one thing that that means is that in order to tell what [19:35] the [19:36] application is actually doing, you can't look at the code, you have to look at actually what it does in real life. And so I think one of the
[19:43] the [19:44] one of the things that [19:46] One of the things that we do that is most popular is Langsmith. One of the core parts of that is tracing. Why are traces so popular? Because they tell you exactly what goes on inside your agent at every step. And it's different than software traces where in software, you kind of have your system over here and it emits a bunch of like stuff. And you look at it when maybe there's some errors, but you don't need like everything. And you usually only turn that on when you put it in production because if it's local, you just put a break point or something like that. [20:15] In agents, like... [20:16] people use traces from the start to just tell what's going on under the hood. And it's way more impactful in agents than in single LLM applications. Because in single LLM applications, you get some bad response from the LLM. You know exactly what your prompt is. You know exactly what the context that goes in is because that's determined by code. And then you get something out. In agents, they're running and repeating. And so you don't actually know what the context at step 14 will be because there's 13 steps before that that could pull arbitrary things in. [20:46] is such a good term. I wish I came up with that term. Like it actually really describes like everything we've done at Langchain without knowing that that term existed. But like traces just like tell you what's in your context. And that's so important. And so then and so like, what does that mean? That means that the source of truth for software is in code. And for agents, it's a combination now of code and traces are where you can see the source of truth. It's technically in, you know, all those millions, billions of parameters, but like you can't really do anything
[21:16] that traces become a place where you start to think about testing because now you can't, you can test, you can set some parts still of the harness and you can do some unit testing offline. But like in order to get the, what the test cases are, you probably want to use the traces to construct that. You probably want to be testing online. That's probably more important in agents than it is in software is online testing because behavior doesn't emerge until it's actually being used with real world inputs. We see traces becoming a point of collaboration for teams. [21:46] Because if something goes wrong, it's not, oh, let's go look at the code in GitHub. It's let's go look at the trace. We see this in our open source as well. When people are being like, hey, deep agents went off the rails here. What happened? Our response is like, send us a Langsmith trace. We can't really help you debug if it's not that. Previously, it would be like, show me the code, right? So I think there's a transition there. [22:04] And I think the other thing that's and so that was the blog post that I wrote on next, which got a lot of good feedback on them and still kind of figuring out how to like phrase it. But I think that's that's a big part of it. The other thing which I'm still trying to think through as well is I think building agents is more iterative and. [22:21] We used to say that and I would kind of roll my eyes because building software is iterative as well, right? You ship it, you get feedback and it's constant iteration. That's like what it is. I think the difference is that in software, you're... [22:34] kind of like iterating based on what you want the software to do. Like you have some idea, you ship it, you get feedback. Oh, maybe this, you know, button is confusing. Maybe this, maybe users actually want to do X instead of Y, but you know what the software does before you ship it. With agents, you don't know what the agent does before you ship it. You have an idea, but you don't really know what it does before you ship it. And so I think there's way more iteration involved in order to get it like accurate, get it like right and passing like
[23:02] conceptual unit tests basically um [23:05] And... [23:07] Building upon that, like, this is actually why I think memory is really important as well, because memory is like learning from those interactions. And so if now you have a process that's like way more iterative. And so now you have to, like, it's way harder to build as a developer, because I have to like change the system prompt, like way more than I would have to change code in order to get it just [23:25] perform like correctly. Yeah. So that's where memory comes in because if there's a way where the system can kind of like learn by itself, that cuts down the iteration that you have to do as a developer and makes it easier to build these types of agents. So that's another kind of like [23:39] angle that I like, I absolutely think agents are different than building software. I think it's also a little cliche to say that. And so I've tried to think about what exactly is different. And those are like the two things that I've kind of come up. Well, and I'm curious on that, too. [23:51] and [23:52] One of the questions, this is a big public market debate right now, is are the existing software companies going to make it? And if you analogize to an on-prem software went to cloud, very few actually did make it because it turned out that building cloud software was actually quite different than building on-prem software. And since you're in the middle of kind of how people are building with AI, you know, you're not going to be able to do that. [24:14] What's your take on, not necessarily the public market question, but... [24:19] How different is it? Like, do you see... [24:22] Have you seen a lot of people who kind of like [24:24] we're good at building software the old way, and now they're good at building software the new way? Or is it more just you either grow up building it the new way or you never get it? Like, do you think people can make the leap? A lot of young founders out there right now, which makes me think that certainly it seems like the...
[24:39] younger people without a lot of preconceived notions on how to build software, you know, have the blank slate that has allowed them to like pick up on a lot of this stuff. I do think we have consistently heard that a lot of the people who are on these agent engineering teams are more junior developers, more junior builders even, who, yeah, don't have those preconceived notions. Our applied AI team internally definitely skews on the younger side. [25:09] I do think, I mean, in terms of kind of like, I think there's like a, there's like a person aspect to this. There's also like a company aspect to this. Like, I do think that like data is still really, really, really valuable. [25:39] the prompt and the instructions and then the tools that it's connected to. And I think one thing that this is more at the company level now, but like one thing that existing companies have is all the data and all the APIs. If you've done a good job at that, then I think it will actually be pretty easy to plug those in and get real value out of things. We were talking to someone in the finance space and they are saying, yeah, like the value of data is just going up and up and up and up. So if you're a previous software vendor and you have this data that is valuable, like you should
[26:09] though, is the instructions on what to do with that data. And that's probably like more net new in terms of like how to use that data. That's probably, you probably had some ideas about that as a software vendor, but you didn't kind of like consolidate it [26:24] You didn't have it because that was something that humans would still do, like a lot of what agents are doing or humans would still do. So you'd give them the tools to do it, but you wouldn't have tried to like automate that or you wouldn't have successfully automated it before kind of like agents. [26:54] knowledge and and and and and not like world knowledge but like knowledge on how to do specific patterns um so kind of yeah i think there's like [27:02] Are the people who are building software the right people to build agents? [27:06] I think we saw a lot of really senior developers adopt agentic coding. And so I think it's a mindset thing. But like, yeah, there is maybe a younger skew there. And then for companies depends on the data. [27:18] Even Pat's on Cloud Code. Yeah. Even the old guys can get it. Sonya got me on there. [27:25] Okay, so it seems like the trace is a core artifact, you think, in kind of this new world of agent development. And it's something that Langsmith helps a lot with. What other core artifacts do you think are there? And specifically, I'm wondering about evals. [27:38] Yeah, I think maybe artifact is the wrong word. Components. Yeah. I mean, I think one other thing that is different between building software and building agents is that to evaluate software, you could pretty reliably you could rely on tests and assertions of things programmatically. With agents, a lot of what they're doing is things that humans would do. So in order to judge them, you need to bring human judgment into that. And that's another thing that we try to do in Lanesmith is how can you bring you've got these traces?
[28:08] And so that like one obvious way to do that is to bring humans into the equation. And so we see data labeling startups doing really well. We have a concept of annotation cues in Langsmith to bring people in there. And so that actual like actual human judgment is a big part of it. And this is humans annotating the actual trace. So like, oh, the agent did this and this and this and that was good or bad. Yeah. Yeah. And sometimes giving like natural language feedback on it, like this is good. This is bad. Should have done this. Sometimes just like correcting it, like actually like laying out what the, [28:38] what the correct steps were kind of depends on the use case. And it's probably different for model companies doing RL than it is for, for agent companies, building, building agents. Yeah. But it's bringing that human judgment to it. But then another thing we see is trying to build proxies for this human judgment. And this is where LLM as a judge type things come in, where you can run an LLM or something else that, you know, has, [29:00] some semblance of human judgment in it to grade the thing that requires human judgment. And so one of the things that we think a lot about is how to make building these elements of judges easy, because a big part of them is making sure that they're aligned with your human judgment and human preferences. And so and because if they're not, you know, then you're [29:18] then your greater is just bad. And so we have a concept in Langsmith called align evals, where a human goes in, labels some traces, and then that builds an LLM as a judge that kind of like is calibrated against those traces. Because a big part of it is bringing this human judgment. And you just want to make sure that if you're bringing a proxy of it, it's well calibrated. Interesting. I remember when we first got into business with you, we were emailing about LLM as judge. Is it a viable idea or not? So it seems like it's come a long way. Okay. So there's a few different aspects of LLM as a judge, right?
[29:48] like the immediate, like, so what most people use them for in evals is like taking this trace and give it a score of like one to zero or zero to 10 or something like that. And yeah, I think that's viable and people are doing that. They're doing it offline. They're also doing it online because some of these judgments you don't need ground truth for. But I think the other area where this comes into is, I mean, you kind of see this in the coding agents themselves. Like the coding agents will, they'll work, [30:15] up until something, then they hit an error and they get an error and then they have to correct there. And so they're kind of judging their previous work. And so, and we also see this in memory, like a big part of memory is like reflecting on traces and then updating something. And so like, can LLMs reflect on traces that are either like their own or their own from a previous session or someone? Yeah, absolutely. I think they can. We see this all across evals and just like error correcting and memory. It's all kind of the same thing. I see. And then maybe [30:43] Okay, so you have all this... [30:45] You have all the traces. Yep. You have the evals. Yep. Um... [30:49] I think the natural question that comes to mind for me is, is the eval like a reward signal for reinforcement learning or is it a feedback mechanism for a human engineer to improve the harness? [31:00] Or for agent engineers to improve the harness because no one's coding manual anymore. They're all using these. So, yeah, one big thing that we've seen is like... [31:09] Um, we, we have a Langsmith MCP and we have Langsmith fetch, which is a CLI because coding agents are actually great at using CLIs. Um, you give that to an agent and it can pull down traces and diagnose what went wrong. And then, and then it brings those traces into the code base where it can then fix it. But that's absolutely a pattern that we are seeing. And, and we really, really, really want to support that pattern. Oh, that's crazy. Yeah, I know. And it's good. Yeah, yeah, yeah, yeah. It's good. Like it, it, yeah.
[31:39] learning, at least for like the agent app kind of like companies right now. [31:44] That seems like real recursive self-improvement, though. Yeah, I think, again, there's still a human in the loop. So back to the point around things are good when you can do something as a first draft, it changes the prompt and then the human reviews it and it keeps it on the rails. But one of the things we launched was Langsmith Agent Builder, which is a no-code way to build agents. One of the cool things that we have in there is memory. [32:14] when you interact with an agent, so it's not in the background yet, it's not like pulling down its traces, but when you interact with the agent, if you say, oh, you should, instead of X, you should have done Y, it will go to its own instructions, which are just files, and it will edit those files. So then in the future, and so that's also kind of like a version of this. One thing we do want to add is like the thing that runs every night, looks at all the traces for the day, upstates its own instructions. And so I do think- The dreaming thing? Yeah, yeah. Sleep time compute. Sleep time compute, is that what it's called? That's a term. Yeah, [32:44] That was good. I love it. [32:45] Awesome. Okay, let's talk more about the future. [32:49] What are you most excited about? Sounds like you're talking a lot about memory here. I like memory a bunch, yeah. I mean, I think asking the agents to improve themselves is... [32:57] I mean, I think... [32:59] Very, very cool and can be useful in a lot of situations. Not useful in all situations, by the way. Like if I'm chatting... So ChatGPT added memory... [33:06] I don't actually really use that feature that much. And I don't think it's created any more stickiness for me to use the product or anything like that. And I think part of the reason is when I go to chat GPT, I do like...
[33:18] So everything's a one-off thing. Like I don't really repeat myself that much. I'm asking about software. I'm asking about food, trips, like everything. [33:26] In Agent Builder, you build kind of like specific workflows for specific things. So I have an email agent. [33:33] I know. It's been emailing me for two years. [33:37] Well, so, okay. So I had an email agent outside of Agent Builder. And it had this like memory as part of it. We then built Agent Builder and I wanted to move into it. [33:48] my memories. And that was a big, even though it had the same starter prompt and the same tools, and that was actually, I still haven't fully switched over because it kind of sucks now compared to what it was before, like compared to the other one. And if I just interact with it, then it will get better and it will stop sucking. But like, that's where memory I think can be like a real moat. And I absolutely think that we're at a point right now where LLMs can... [34:09] look at traces and change things about [34:13] their code. And I think the question then becomes, how do you do that in a way that's safe and acceptable to users? But I think that's absolutely something that we'll see more for specific scenarios, not all of them. I still don't know if this would be useful in chat GPT, in this form, at least. How do you think the UI around working with long horizon agents will evolve? [34:37] I think there probably needs to be like a sync mode and an async mode. So long horizon agents running for a long time, probably default would be some sort of like async way to manage them. Like if it runs for like a day, you're not just going to sit there and wait for it to finish. You're probably going to kick off another one and another one and do a bunch of work. And so I think this is where like async management of things comes into play. I think things like linear and JIRA and Kanban boards and maybe even email are interesting to look at for inspiration about like what it looks to like to
[35:07] agents. But I think for a lot of these, at some point, you're going to want to switch into synchronous communication with these agents because they come back with a research report and you want to give it feedback that it wrote something wrong. And I actually think chat's reasonably good at that. The only thing that I'll maybe say there is that so many of these agents are now modifying other things like files in a file system that having some way to view that [35:37] is... [35:38] IDs are still used when you want to go in and manually kind of like change code. And, and even when I kick off a cloud code, when it finishes, I, I, [35:46] Sometimes I pull it up and look at the code that it actually wrote. And so I think having a way to view that state is interesting. One of the really cool things that Anthropic did with their Claude co-work launch, when you set it up, you choose the directory that it's kind of like working in. And you're basically saying like, this is your environment. And obviously, like, that's what you do in coding as well. You open your ID to a particular directory. But I think that's a nice mental kind of like framing is like, this is your workspace. That workspace could be a Google Drive. [36:16] page. It could be anything that stores state and then you and the agent are collaborating on that state. You kick it off. You manage maybe a bunch of these running asynchronously. Then you go into sync mode where you chat with it, but you also view the state. And so that's kind of what I see right now. And this is like your agent inbox idea then of, you know, to enable the sync mode, your agent's going to have to need a way of reaching you. Yeah, exactly. And yeah, so the agent inbox is something we launched that about a year ago and had this idea of like ambient agents
[36:46] ground and pinged you. And the first version of that didn't have a sync mode. And so it would ping you and then you'd give a response, but then you'd kind of just wait for it to ping you again. But oftentimes, like when I was switching in to email you and respond to you, I would say very small things and I didn't want to switch out and wait. Like you're really important. So I wanted to like be in the sync mode in this conversation with the agent. And so one of the things we added was now when you open the inbox, you're brought into chat and chat is very synchronous. And that [37:16] in async mode. [37:18] I don't think that really works right now. Maybe in the future, if they get so good that you don't really need to correct them as much, it gets more viable. But at least right now, I think we see people switching from async to sync and back and forth. [37:29] What do you think of code sandboxes? Is every agent going to have access to a sandbox? Is every agent going to have access to a... [37:37] or a computer, is every agent going to have access to a browser? [37:40] Really good question. Something we're thinking a bunch about. I think coding has clearly worked more than browser use so far. So at least in the short term, it seems like if any of those are going to be a key part there, it's going to be this code execution part. [37:56] File systems, I'm completely file system pilled. I think in some form, Adrian should have access to some file system coding. I'm maybe not as pilled, but I'm probably like, I'm like, maybe like 90% there. Like, yeah, I think like it is definitely possible. There are. [38:10] It's maybe for the longer... [38:13] um tail of use cases so maybe there's something where if you're doing something repeated you need code less but i think file systems are still useful because that repeated thing could be generating a lot of context and you need to do context engineering um but for the long tail of things coding is great and there's really no replacement for that browser use um i think the malls just aren't good enough at it right now from what we've seen um
[38:35] you could probably give like a coding agent, a CLI to do browser use. And there's probably some approximation there. There's probably some people doing some, I think I have seen some cool stuff there. And then computer use is like a weird hybrid of the two. So if it, yeah, [38:47] Code sandboxes, I really like code sandboxes. Yeah. Cool. [38:50] Harrison, thank you so much for joining us today. You have consistently seen the future on Agents, and it was really cool to have this conversation and talk about how context engineering has evolved. [39:01] to the current point in time with harnesses and long horizon agents. And so thank you for driving that future. And thank you for always chatting with us about it. Thank you for having me on. I look forward to being back on sometime in the future and being completely wrong about everything I said today. So it's very hard to predict the future. [39:16] Music
Want to learn more?