How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL
Cursor's Federico Cassano and Fireworks' Dmytro Dzhulgakov explain how they collaborated to build Composer as a specialized foundation model. The core insight: models have finite capacity in their weights, and allocating all those bits to the singular task of software engineering in Cursor frees the model to be both better at the task and far more efficient at inference. Rather than start from pre-training and work up, they took an unconventional top-down approach — mid-training and RL on top of an open-source base to get a useful model into users' hands fast, then specializing the model around real Cursor usage. With Fireworks providing distributed infrastructure, Composer delivers frontier-class coding performance with the speed of a much smaller model. Hosted by Sonya Huang, Sequoia Capital
- Published
- Published May 26, 2026
- Uploaded
- Uploaded Jun 11, 2026
- File type
- Podcast
- Queried
- 00
Full transcript
Showing the full transcript for this episode.
AI-generated transcript with timestamped sections.
[00:00] You need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like. And it's very important as closely as possible, because sometimes the model can actually figure out when it's being run in like a fake environment or a real one. And it has like different behaviors during RL than in production. Are you seeing it being conscious that it's being it's in a fake environment and starts behaving differently? Yes. Yes. Interesting. [00:30] a few tricks to like get the better reward in this environment and let me try them out. Models love to cheat. I was really good at encouraging cheating. [00:53] I'm delighted to welcome Federico from Cursor and Dima from Fireworks to the podcast today. Federico, you are the research lead on Composer2 at Cursor, Cursor's new agentic coding model. And Dima, you spent how many of the last few months moonlighting at Cursor in order to support a lot of the infrastructure required to make this gargantuan training task happen. And so I'm excited to talk to both of you today about how the training of Composer2 came [01:21] together and what do you think it means for the future of AI and foundation model companies. [01:26] Exciting. Yeah. Thank you for having us. [01:28] Thanks for joining. Okay, let's dive right in. For those who haven't been following us closely, Cursor recently announced Composer 2.
[01:36] which is an agentic coding model meant for long-horizon coding tasks. Federico, up till now, Cursor was mostly enabling other people's coding agents. What was the impetus for Cursor to lean so heavily into Composer 2, and how existential is it for you to become not just an application company, but also a foundation model company yourself? The reason why we started looking into training our own models is you can sort of think about the model as sort of like a storage drive. [02:06] amount of bits that it can store in its weights and the idea is [02:11] Very simple, you know, like we care about only one task. [02:14] We don't even care about coding or programming necessarily. We care about software engineering inside cursor and inside cursor only. And so what if we were to allocate all of the bits of information that can be stored inside a model weights to that one particular task? Also, as people may have noticed, Composer is order of magnitude less expensive than Opus and other coding models. [02:44] model weights to that particular task. And so we can serve like a smaller model or something of that sort. Yeah. So it's about let's make sure every single bit of weight or information we have is dedicated toward the specific problem that we have at hand. Exactly. Got it. That seems like it's an almost generalizable problem. Deem, I'm curious your perspective. Do you think that every application company should be looking at Cursor as a harbinger of what's to come? Like, should they all be looking to do the same thing? Yeah, absolutely. I mean, we actually generally see it as a
[03:11] pattern of evolution of applications. You maybe start prototyping, you might be using an off-the-shelf model to get something running, maybe do some prompt engineering, figure out how your harness works. But the most [03:22] kind of leveraged attribute of your application is actual usage of user data or particular specific aspects of how the application works. Maybe some aspects of your harness, which tools do you provide, how the application works, kind of really important bits which are important for your application. And the right way to capture that, you can do a little bit of that through prompting, but really the right way to do this is craft your model to act in your environment. Yeah, absolutely. Like there are certain tools the agent calls that it's very hard to succinctly [03:52] of that tool to the model. And, you know, with just like post-training, we can bake in the optimal way to use those tools. Like Composer, we do serve a prompt to Composer, but I think the way we are training it, it would work even without a prompt and it would know what to do just because like we are intrinsically pushing the model to like the right direction of how it should act throughout our training. Basically, there's kind of like upper bound of like how far you can get this prompt in GUIN. And if you want to craft really great AI products, [04:22] to go through kind of feinture and influence the model behavior. That's kind of one reason. I mean reason number two is what Federico mentioned is kind of cost trade-off or XP trade-off. Like the way we kind of view it with Fireworks is that you're trying to do [04:35] optimization. You had this three-dimensional trade-off between quality, speed, and cost. And you can go quite far, and we are doing it with all the customers. Initially, we can go quite far with just optimizing infrastructure. But when you start getting to model trading, you can really push this trade-off much further, and you can get a better model at a fraction of the cost running much faster. And Composer is a great example. Can I push on this a little bit? I want to ask if this approach is good or less in Pilt. And we were actually all talking about
[05:05] LLM era, there were these small, specialized coding models. And one of the things that was, I think, surprising to a lot of people was as you scaled up just training on the internet and a bunch of English texts and other languages, actually the models themselves got inherently better at coding as well. And so at least the trend line I've seen so far is bigger models... [05:27] perform better on everything, including on coding. Is what you guys are saying, does that go against the grain of the better lesson? I think no, but one sort of like thing to point out is that the big models trained by the labs train on a lot of code as well. Like code is one of the main tasks the labs are interested in pushing. And so they don't just generalize to it. They're [05:49] a bit specialized as well. I think for our case, actually, you know, if we believe about the bitter lesson, we are just pushing very hard on the data dimension. And we know that the models inherently have finite capacity. And so if we want to saturate all that capacity, we need to scale data. And in order to ingest more data, we need to like free up the weights from distractions the model may have. Okay, got it. [06:15] Super interesting. Okay, let's dig into the training of Composer 2. You launched a couple weeks ago, immediately grabbed attention. [06:21] Strong benchmark number is much lower cost to run imprints on. What's the short version of how Composer 2 works and what you guys did to make it so performant? We started from a very strong base, which is Kimi 2.5. It's like a 1 trillion parameter MOE that's 30B active, so very, very sparse, actually. We sort of looked at the stack and realized there are like
[06:47] two axes. So mainly Composer 1 was just pushing on one of these axes, which is reinforcement learning. But Composer 2 pushes in two different axes. One is continual pre-training and the other is reinforcement learning. So the thing that made Composer 2 very good is pushing in both of these directions. So we started off the training run by doing lots of mid-training on code tokens, almost sort of pre-training scale actually. And then coming out of that mid-training run, [07:17] we took the checkpoints and we did very large scale RL on lots and lots of tasks. [07:23] Okay, and then the premise here would be because Cursor sits in the middle of so many interesting coding tokens, you actually pretty uniquely have access to data to be able to train at almost pre-training scale. Yeah. [07:33] Why not pre-train your own model then? We just think about our approach from top down instead of bottom up. So like, how do we get a model that's useful to users in the least time possible? If we were to start from the bottom, sort of figure out how we do pre-training and then scale it up to mid-training. And then, OK, now we figured out mid-training, how we do reinforcement learning. That would take a very long time to get a model out to our users. [08:03] we were able to give a useful model to our users in very little time. So hopefully, you know, like, [08:10] Next composer versions are going to be our own model instead of basing it off on open source base. And what is the model roughly learning in the kind of mid-training step? Yeah. What is the model learning in the post-training step for you? Yeah. So in mid-training, it's sort of just kind of learning about libraries of code and learning about specific code patterns that are very common. Like just world knowledge as well. There is like web data there as well.
[08:40] learning can sharpen on. And so during reinforcement learning, you know, the model gets to play directly with the cursor harness. And so it gets to learn about the world the model is going to live in for the rest of its life, right, in some way. And so then during reinforcement learning, that's where it learns how to call tools properly, how to navigate its environment, how to write correct code. Because during mid-training, it learns how to write code. [09:10] to train on code that is largely only correct, but the model doesn't actually know how to differentiate between the two. While in RL, one of the key things that we are doing is we're kind of tuning [09:22] the feature of the model saying, hey, now you get a... [09:25] write correct code all the time. Interesting. And is the model after mid-training, is that [09:31] similar to the model that you guys have on tab autocomplete or is that a different [09:35] core competency. Yeah, I mean, it's, yeah, I think I would put it like that, because like during mid-training, we are just doing the next token prediction, you know, like how well you predict the next token and then the token after that. So, yeah. So why not just post-train on your tab autocomplete model then? Why mid-train the different models? Yeah, I mean, tab is a very small model because it's like a super low latency model. We want it to be very fast. So like the core two distinctions about the base models here is that tab is like small and composer is [10:05] quite large. I see. I see. Okay. So it seems like a lot of the focus of what you guys did for Composer2 was this large scale reinforcement learning run. Can you break that down for us? Like, what goes into that? And what are the various hard problems you solved along the way?
[10:20] When you do a URL it's quite different from like pre-training or mid-training because you're not just trying to predict next token, you're actually [10:27] running the entire harness, like the entire experiment. You're letting the model act in the environment, see how it performs for a given rollout. That's the terminology which is called rollout. And kind of assign it to reward whether it did something correctly or not, which might be some using LLM as a judge or maybe something verifiable like does this code compile or something like this, which actually means that compared to this regular training, you need a bunch of other components. Like you still need large-scale training, you still need to orchestrate [10:57] do all the stuff you do in mid-training and pre-training. But now you also need to orchestrate a bunch of environments. You need to run model inference because... [11:06] But when you do this rollout, you effectively run in like a real Courser session in some sense, right? So a rollout is like a forward pass? No, a rollout is basically your entire agent session from Courser, right? So it basically means it might take something like 50 turns. Model will take your initial prompt, then decide to call some tools you want to execute those tools. Then model generates a bunch of other code. Kind of entire session when you interact with agent in Courser, right? You kind of simulate this entire session. [11:32] as a part of your training run, you get to final reward and you use that signal to now go back to trainer and incorporate it in the model weights. So you have this kind of [11:42] very big loop, update loop, which is very heterogeneous, right? Because you have all these different components working together. And now you're trying to orchestrate all of this to work efficiently and work with high throughputs because GPUs are expensive and you want to get your model trained quickly in an economic fashion. So that by itself is a very interesting kind of problem and intersection of algorithms and infrastructure because there are a lot of trade-offs how you can
[12:09] kind of co-optimize and co-design the system. One aspect is kind of people call about like this async URL of a pipeline URL. The idea is basically, okay, you're trying to update this model in steps, right? So you have your current model version and you're trying to [12:22] do a bunch of rollouts with it. What does your trainer do while you're doing this rollout? Nave approach would say that, okay, now I'm going to stop my trainer. I'm going to do a bunch of sessions and those sessions might run for like five, 10 minutes or even longer if it's like longer horizon tasks. I'm going to get those outcomes and now I'm going to pause my inference and then I go back to training, trying to do updates. That's like very theoretically, algorithmically robust because you are not precisely simulating everything, but it's very [12:52] do all the clever algorithmic tricks, allowing you to... What do you do instead? Yeah, you can pipeline all of this. So imagine this as a gigantic factory, right? You have this trainer building and you have a rollouts building. They're always churning, right? So rollouts always take latest model version and try to do new sessions and kind of simulate new agent sessions. And trainer always takes new outcomes as they come and try to compute updates. So everything is moving along all the time. [13:22] because now by the time you finish some test rollout in your simulated environment, maybe model weights already updated on some other data. So you have this staleness, like delay between how quickly model [13:35] can learn updates because by the time you kind of
[13:40] process or some interaction session with a simulated environment, your model base changed, and that introduces interesting training dynamics. And there are clever ways how you can address this. But the flip side of that is that all your GPUs, all your computers kind of lauded and chime in all the time, which actually you're using more flops into your bitter lesson example. Yeah, you're [14:01] you have a higher compute efficiency, you can get to a better model in a smaller amount of time. Maybe you're losing a few percent from... [14:07] being asynchronous and not doing like perfect mathematical updates but you way compensate for that by effectively not leaving half your capacity on the table and there are a lot of kind of depths and interest interaction in that part and we're very serious about performance at cursor because unlike the big labs you know we have tens of thousands of gpus not millions and so yeah we do all sorts of tricks to make get the most out of gpu like we train in production with fp4 even we work [14:37] well because the thing about a rel infrastructure is just like it's just inherently more complex than pre-training because you need all the pre-training infrastructure that's just like one of the requirements then you need all the infrastructure to run these environments that have to mimic as closely as possible what a user's computer would look like and it's very important as closely as possible because sometimes the model can actually figure out when it's being
[15:07] different behaviors during RL. [15:09] than in production. Are you seeing it being conscious that it's being, it's in a fake environment, it starts being [15:15] - Behaving differently? - Yes. Interesting. - Like it's like, "Oh, I'm in a thick environment. I've learned a few tricks to get the better reward in this environment. And let me try them out." - Models love to cheat. RL is really good at encouraging cheating. - Yeah. - Yeah. And then we need a really efficient inference. So this is really important. So there is actually this kind of myth that during RL you spend way more inference flops than training flops. This is sort of like [15:41] just because the open source inference engines are very unoptimized. [15:45] instead of actually being a property of RL. Roughly the same ratio is kind of the same. In theory, if you push the GPUs to the maximum, you should have one third of your training GPUs allocated to inference, right? Because training is effectively three forward passes. You have the forward pass, you have the data gradient, the weight gradient. While if you really hit the critical batch size on inference, you should only have a single forward pass worth of flops. So that's why you guys use Fireworks [16:15] inference engine? Yeah, I mean, the other alternative is we would build one in-house, but, you know, if we have finite engineers like everybody else, we would, like, prefer to have engineers make training more efficient and more precise rather than, like, spin up, like, inference effort, yeah. Okay, that's super hardcore. What about, I think you mentioned in your technical paper that you were doing this in a kind of globally distributed way. Why globally distributed, and then what makes
[16:45] large contiguous clusters are hard to find in the market. And so what we can do instead is we have one cluster that's going to run all of training. You know, we can't do global training cluster. But then the inference component of reinforcement learning, we can globally distribute that across small clusters all over the world. [17:05] So I think for the Composer to run, we used the four clusters in total that were all over the world, very far away from each other. And we even used some of our production traffic when it was least used. So like we had the Composer 1.5, the previous model served. And when it was least used by people, we just grabbed some inference GPUs and we put them to speed up training. [17:35] training run without having one large contiguous cluster. And the thing that enables it, maybe Dima can talk more about it. Kind of like, to re-reformulate what Federico said, is basically RL training is very heterogeneous, right? And by leveraging [17:49] heterogeneity, how different components, like what infrastructures they need, you can actually drive efficiency. And you see this pattern kind of across the board everywhere. Specifically for training, you have all this like highly interconnected clusters. You need high speed network, kind of need to work in lockstep. So those clusters are expensive, right? And actually it's really hard to find big ones, right? Basically as a scale at which Composer was trained, finding like 2x larger cluster is like significantly harder than finding the current size one. And that's why if you can
[18:19] put them on different places. One, you don't need to find such a big cluster. Two, you can actually find different traductions of hardware because for inference, [18:26] don't need that kind of wide interconnect. You can have smaller groups of GPUs interconnected together. You can have heterogeneous types of GPUs. You can have different generations of GPUs. You can kind of play all this games of optimization. And finally, like inference, it's much easier to scale up and down as you go. And yeah, it's very conventional. Like, [18:45] when you have off-peak hours, you can view all your [18:48] kind of inference pool as one set of GPUs, serving production traffic for real users, or serving simulated environments for RL purposes, and kind of balance between this. Of course, it's a very interesting systems problem. So you can mention like the Kimi model is like one terabyte. Training step takes somewhere between like five to 15 minutes. So it basically means like every five to 10 minutes, you are producing like [19:12] one terabyte new snapshot of weights. So the question is, like, how are you going to ship it to a different cluster on the other side of the world very efficiently, right? And you want to, like, do it quickly because, remember, you don't want to get this staleness to get out of hand. So I think that was probably one the kind of the most fun part which we figured out together is that despite, you know, full model being, like, one terabyte, not all the weights change every step, right? Because RL does a lot of very, like, precise adjustments, especially the training going on. So actually there are [19:42] very... [19:43] regular patterns in which subset of weights gets changed. Maybe not all of them change every time. So if you were to look at how my model changes within one training step,
[19:53] after 10 minutes. There is relatively small delta between those. You can write a compression algorithm, which basically leverages this property. And now you end up with kind of like database systems problem, which is, okay, I have my delta and I just want to ship it across the world. My delta may be like 20 times smaller than what shipping the full model is. And that makes it practical. But of course, now you need to build all this kind of machinery from storage systems, full snapshots and deltas and recovery and reconciliation, etc. We were able to build it [20:23] fashion basically means that you always end up with bit equivalent model on the other side, so you don't need to worry about any mass aspects of this. And you can do it really fast. You can do it under a few minutes. Even in the worst conditions, usually it's under a minute, and most importantly, you pause only for maybe 30 seconds to swap the weights in your actual inference. We also like [20:44] fully saturated the egress of the cluster by sharding the upload and the download as well. So you can do all these system tricks to bring this time down. It is quite a few [20:55] complexity, but you can kind of abstract it out and just make it work great. Like it doesn't interfere with your training algorithm. And on the flip side, you have this [21:03] kind of power to disaggregate, to leverage other clusters to do all that. And that kind of goes against conventional wisdom of how you should do RL infrastructure, because conventional wisdom is like, okay, you're going to have this really huge one cluster connected with RDMA, and it's going to be very expensive, and you're going to probably spend, you know, maybe you're going to allocate like one third to training and two thirds to inference. And sure, if you have a very expensive network, it's much easier to copy this one terabyte quickly, but now we have like three times larger
[21:33] engine is more optimized, then maybe you're going to save one third of that cluster in terms of GPUs anyway, because you're just more efficient. And you can take half of this cluster somewhere else and maybe cheaper hardware in a different region. So your cost comes down quite a bit. I love that you guys are just grinning as you describe this, because it's like, it's so hard. And this is like a systems engineer's dream, right? And so it's just like a, it's an amazing, amazing system you guys have built. We spend a bunch of nights working on this. You look like it's a long time, a lot of time together. [22:03] that Kimi is a very large, sparse MOE model. Does that make the RL run tricky in any way? Yeah. How so? Well, when you do inference, you're essentially doing a forward pass. It's just kind of autoregressive. And in this forward pass, it produces log probabilities of the tokens it has sampled. When we ship back the generations of the model to the trainer, we have to rerun the forward pass because as we mentioned, [22:33] training. So the model that has produced the pass may have been actually a few steps behind what the trainer is at. And so we have to rerun that forward pass and reproduce log probabilities. Now, [22:46] The problem is, in theory... [22:49] This log probability should be exactly the same if it's the same model version. But even with the same model version, you get slightly or sometimes... [22:59] very different log probability values for the same tokens. So this is often called like a numerical mismatch for inference. You hear this about all the time these days. Why is that? Why does that happen? I mean, primarily because...
[23:14] fundamentally floating point arithmetic, which is doing this, is non-deterministic. So if you're... Sorry, floating point arithmetic is non-deterministic? So, you know, we learned this code that, like, if you take A plus B plus C, right, and, like, C plus B plus A, it's going to be the same result. If you're doing this with integers, with [23:30] whole numbers on the computer, that's going to be always true. If you're going to do it with floating point numbers, which are actually like approximation numbers, you have this like Montice and exponent, et cetera, A plus B plus C and C plus B plus A is going to give you like different results or even like A plus B and B. So basically like fundamentally it's accumulation order of like all these operations, which models do is basically like multiplications and additions and like addition order matters to your final result. It's all like small differences, but [24:00] when you do inference models, usually it doesn't matter that much because you pre-train your model, you're actually pretty robust. If you flip some bits, it's still going to produce good results, your benchmarks are going to change. But RL in particular, because you're using this very, very weak signal to teach the model, the noise from these numerical differences can make or break your training. And that's particularly important. And again, it's an interesting intersection between [24:30] work in practice. There are ways how you can drive this difference to [24:34] pretty much zero. There are all these batch invariant ways. You can be very, very careful and write all your GPU kernels so they always add numbers in the same order. So you always do A plus, B plus, C and not a different order. It's possible, but it always has trade-offs. Basically, your system becomes maybe 2x or 3x slower. Again, it becomes an interesting trade-off like, okay, what is the 10% of slowdown which we can take? Or in fact, it's actually a few percent of
[25:04] Thank you. [25:04] We find together this reiteration and... [25:08] You mentioned that particularly for MOEs and sparsity is hard. The reason for that is that the way MOEs work is that you would take your activations with every layer and you would run it through a gating layer, which basically decides, okay, for this token, I'm going to run out of 384 experts, I'm going to run this eight. So it's going to do some mess and top eight scores. Those eight experts are going to be activated. Other ones will not be activated for this token. This operation amplifies your small numerical differences quite a bit because maybe you're [25:36] Pieden States were like [25:37] difference by like Thief's [25:39] digit afterthoughts [25:41] doesn't really matter, but this difference made it so you picked expert number seven versus expert number nine as a cutoff, and suddenly you went and like [25:51] activated a totally different part of the model, and the difference got amplified quite a bit. And my models, by definition, are very... [25:57] more sensitive to this mismatch. Again, when you do inference or when you do kind of regular allowed, it usually doesn't matter in other average out. But now if you're trying to basis model learn, this difference is huge because your inference, [26:10] activated expert number seven. Now in your training, you're trying to update expert number nine, which didn't even contribute to that during inference. So were you guys handwriting GPU kernels then to help get around this problem? Yes. So you can, again, you can address all of this through GPU kernels and there's always trade-off. Specifically for ME, you can do this interesting trick, which people call router replay. But basically you can have your inference just pass extra information to training and say that, hey, I activated expert seven for this token.
[26:40] one integer saying that, like, okay, this is the expert which activated. So, trainer can be aligned with that. And a lot of this numerical alignment is basically doing tricks like that, matching quantization levels, matching kernels, et cetera, to drive the divergence between training inference implementation now. And that makes huge difference in between, you know, your own maybe divergent completely or being, you know, multiplex less compute efficient because you'll need much more data to address to this mismatch. I'd love to maybe chat a little bit more about the RL kind [27:10] say a word about the reward signal y'all are using is it's like are you care okay can't say got it top secret stuff top secret stuff okay that makes sense like it seems like there's a almost like the equivalent of learning in sim this is simulated rollouts versus like you have so much actual user data that you could be learning on why not just do rl on your your actual user data and your actual user harness versus doing this in sim yeah we're also doing that so that's [27:40] inference with Sync with fireworks to do this. We find user signals where the user was happy or sad about a particular model generation. And we are able to update that model live [27:55] And so then ship a new version of the model continuously every few hours. We're working on decreasing that time. Actually, at some point, we'll have to increase that time because as the horizon of the model gets longer and longer, we'll have to re-extend that time. It's like an interesting play. Like right now, we are trying to decrease the time for stability because we were figuring out the right hyperparameters. And then after we figured it out, we have to re-extend it again just because we want to lengthen the horizon of these models. Yeah.
[28:25] of the kind of like pre-training simulated RL. You have so much actual user data. I imagine that's just like much more valuable to train and tune on. Like why not just go straight to the online RL step? Why do you have to do the offline RL? The online RL currently is pretty inefficient. We suffer from this problem that the GPUs are offline for a long time. [28:45] essentially. And besides that, there are also different trade-offs, both in terms of efficiency and user experience. If you do simulation, you actually do multiple rollouts from the same prompt. You effectively take a task and you ask a model to do 16 tries to the task, 128 tries to the task, like different rollouts from the same prompt. Some of them are going to [29:03] go well, some of them are not going to go well. And by doing multiple rollouts in parallel, you are able to get much more precise signal. Maybe a model is very good and it does it well 90% of the time. Maybe it's not very good. Losses like GRPO, like group policy grading, kind of work by doing multiple rollouts at the same time. If you're doing online, you have only one rollout coming back. [29:33] it's not [29:34] it's not too bad, right? I mean, you just, you know, maybe you spend some time on the GPU. If it's the actual user, you have much higher, like, minimum bar on this, because effectively, you're doing A/B tests, right? So if the model produces something weird, like, that's a bad user experience. Yeah. Okay, so you can go off policy more often when it's not a real user, because you can, like, you can experiment with, like, crazy things and without affecting the user experience. Yeah.
[29:56] You can do a lot more rollouts. You can do GRPO. And then you can basically like bootstrap some level of performance that's good enough to even put in front of users. Yeah, like we teach reasoning through like the offline RL, which is actually like called online RL. Offline RL is more like DPO kind of technique. The sort of reinforced kind of RL is online. And then there we like teach the reasoning to the model. We give it some kind of input of the behavior it should have. [30:26] we teach it tool calling [30:27] and then we put it live to users because you could imagine like if the model is bad users don't want to use it they're not going to give us any feedback right so the model has to meet some kind of bar to even like be put into online rl like we want to be really happy with the model and this is the model we ship that's kind of the paradox of online or online rl or how we like to call it real time is that you know we can't use this to really like create the model from scratch because users [30:57] need to be using the model and so it has to be good already and we can only make it better yeah yeah [31:02] It's kind of like cherry on top to really get this super delightful experience. Yeah, totally. Hopefully one day it will be like big, big cherry, you know? [31:12] Yeah, Dan Roberts presented at our conference last year. I think you were there. [31:16] Traditionally, it was the big cake and the little cherry. Yeah, the Yann LeCunz cherry, yeah. Little cake, big cherry. Yep. I'm curious, the Andre Karpathy line of like...
[31:26] Right now, RL is still super inefficient. You do a big, big, long rollout, and then you kind of get a little bit of information at the end, and it's still, I think, slurping bits from a straw. What do you think? Have you been able to figure out how to get more bits out of that path? I can't talk about that. Okay, okay, got it. [31:43] We're back on the secret stuff. Good. That's how I'm asking the right questions. [31:48] You mentioned the rollouts are a few minutes of the time. It seems like the whole field is pushing towards making long horizon agents, agents that can work for a long period of time, uninterrupted and generally not failing. I love that meter scaling charts. What goes into the RL process to try to get the agent to run for longer? Several things. So one problem about [32:13] The longer the trajectory is, the harder it is to do credit assignment. So you can imagine we are giving thumbs up, thumbs down at the bundle right at the end of its work. And to simplify the problem, the model asks itself, okay... [32:29] Where did I do it right and where did I do wrong? That's basically the problem called credit assignment. It gets harder as this gets longer, so you have to do a bunch of tricks there. The other problem is just like you run out of space, right? Like these models have a finite context window, and at some point they're going to reach that. So actually the way we solve this at cursor is we put compaction inside the RAL loop. So we call this self-summarization.
[32:59] how to continue and go on forever. So in practice, our model is like a 200,000 context window model. [33:07] But in reality, it can go on for millions of tokens. And just because of this ability that it can summarize its work and then take that summary to restart its context window while still trying to accomplish the task. And through RL, because RL pushes the model to do... [33:25] things correctly towards the goal. At the same time, jointly, we are training the model to produce a good summary. And then we're training the model to listen to that summary very well at the same time. And so this is kind of like a continuation to reasoning almost, I feel like. I find it fascinating because usually context management is part of the hardness, right? In this case, you're effectively co-optimizing how... [33:48] part of the hardness and model itself work together and throw in all of that in the optimization loop. And we've seen this again and again in AI. They're like, [33:56] The more you throw computers at the problem, the more you can solve the problem end to end. The magic of computing data lesson works and you get a much better system which can work together. Totally. Totally. Do you think every company is going to be RLing their own harnesses? [34:11] Like, do you think that every company has the same type of problem as Cursor? [34:15] If they are using AI and they're producing lots of tokens and they have a product to optimize against, I think it's the right move and the right direction to train models. Yeah, yeah. Interesting. Interesting. And so it seems like most of the reinforcement learning you guys did then was on the...
[34:33] kind of like the harness slash tool use part rather than on the [34:37] get good at, you know, complain the next token for code. [34:40] Is that roughly the pattern that other founders should have in mind when they're trying to think about where should I use reinforcement learning? So if you're trying to get an agent to perform tasks with tools over long horizon, you need RL. If you're trying to create a model that's good at summarization or a next token or whatever, you probably don't need RL. Is that a good framework for when you need RL? I think RL fits everywhere. So even for tab, we use RL. Personally, this is just my theory. It's not backed up by anything. [35:10] just ingesting the totality of human knowledge. Let's say you're training a model for math. The model sort of like learns all the math on Stack Exchange. The model, when it's presented with a math problem, and this is a model that hasn't gone through RL, the model needs to wonder what kind of person it is. Is it the expert or is it the student that's trying to learn? And so one of the things [35:40] You are the expert. You need to do things correctly. So that's like one thing that happens is we are sharpening this distribution. Sort of like RL has a few phases. [35:49] So there is the very first phase where the model learns and becomes... [35:53] very good very quickly and then there is like a second phase where like it takes a lot of compute to continuously improve the model and like you see the model starts reasoning and have this pattern so in the very first phase of the curve i think that's where we're just tuning the knob
[36:11] telling the model, hey, you should do things correctly here. And so a rel in the small compute case is also very useful just to let the model know that it has to do things correctly. That's sort of like my case to this. [36:24] Yeah, I mean, second that, I mean, we see this pattern across many of these cases, you know, we've helped RL frame tuning in generally for many customers, and we see this usually kind of continuous [36:35] me training regular supervised fine-tuning is... [36:39] Simplifying, you can say it's transfer of new knowledge in an abstract way. And RL is sharpening the behavior or particular qualities you would want from the model. And usually you end up needing both. And even to your example of summarization, it's actually RL may be very useful for this because sometimes if you want a particular style out of summarization, it's really hard to come up with [37:00] examples of like good and bad summarization, etc. Like really describing this precisely. But if you use, for example, LM as a judge, right, you can actually say very precise rubrics, you can kind of [37:11] prompt eval saying like okay [37:14] this is a criteria how I'm going to evaluate whether summarization is good or not, throw it into the RL loop, and let the model kind of experiment with different summarization styles, figure out what you actually want from it, while maybe another LLM kind of evaluate it, whether it's a machine particular rubric or not. And that's kind of the type of pattern which you see a lot, not just encoding. I see. Okay, I'm going to ask this question to Dima, because Federico is going to plead the fifth. You mentioned LLM at Strudge a couple times. Do you think that ultimately companies will be more successful
[37:44] hand-examining RL rollouts and, you know, hand-coaching the model behavior in some way? Or do you think LLMS judge, other automated rubrics, are likely to get us there? You don't really, like, put experts directly in judging RL rollouts. I mean, that would be some kind of, like, I mean, real-time RL, if it's actual users, or, like, some form of, I don't know, like, RLHF or DPO. I mean, generally, the more verifiable your reward is, the better, because it allows you to, like, scale the compute and just get better outcome. In some case, and by verifiable, it basically means, like, [38:14] can you automatically produce it without the human. Of course, if it's like mass or coding and you can craft something like very deterministic, that's the best. The reason why LM the Judge [38:23] works is that it's actually it's kind of like generator discriminator distinction like it's much easier to judge i mean it's the central for humans right it's easier to judge than to create the slam of ec yeah no no implication there but yeah it's much easier to judge and you can craft precisely like different criterias you want to rank some answer and you see this pattern where you might have like very complicated [38:48] evolve from multiple aspects, right? Because if you dump multiple aspects to a single LM, it might get confused how to judge, right? Like you might break it down, okay, you're going to judge rubric based, based on style, based on like some different aspects, based on factuality, kind of really craft these rewards. Some of them will be the genesis, some of them will be LM based. And that's what guides the model behavior. Then you just [39:11] turn on more computes and see the graph go up. Do you think that we're going to see RL be more effective in the harder to verify domains? Like, do you think LLM is just sufficient? That's one of the techniques you would start, right? Ideally, you want to figure out what is the actual outcome, what is the actual metric you want to get, right? So kind of trying to approximate this RLM is one way. Trying to get bigger simulated environments is another, right? Like, if you can simulate more of your product, if you can simulate more of your environment, usually you have, like, final metric which you care about.
[39:41] capture this. That's great. And to your point about experts, I mean, [39:45] Experts are still [39:46] still needed, right? Because crafting this task and actually encoding the product [39:51] experience you want. That's what matters, right? We went through software 1.0, 2.0, 3.0, right? Instead of crafting software directly, we went to crafting training data. Right now, you're effectively [40:02] crafting the evaluation rules, but that's still very important. You need to look at examples, you need to look at the data, you need to look at like [40:08] where your product fails and how to nudge the model in the right. [40:12] in the right behavior. I want to ask about RL environments, which is maybe related to what you were talking about. It seems like there's been a huge... [40:18] explosion and just the revenue scale that some of these RL environments companies are reaching. What do they provide that's actually useful? Because I think Cursor, for example, you have so much [40:29] data on like how your customers are actually... [40:32] using your environments. What do the RL environment vendors offer you on top of what you already have? [40:38] Yeah, we don't actually use any of the environment vendors. [40:42] I think it's very difficult to construct working environments. [40:46] It's a valuable product for people that do not have access to... [40:51] However, for coding particularly, there is a very large... [40:57] amount of work in coding environments available to everybody. That's GitHub, right? You can go in and maybe like you can have a model, like just install all of the dependencies for a repository. And that's like a [41:08] working environment. I think a lot of the difficulty comes from the infrastructure as well. So you can imagine that an environment that works well for a particular task may need services. You're making a change that, let's say, a database migration.
[41:25] To test that it's actually working, you need the database app, right? And so those kind of things are very tricky. I think these environment companies are quite helpful for that kind of stuff. There are kind of two aspects to this, right? First, if you look at Frontier Labs, right, they're trying to build... [41:40] generic model which is good at everything. So they need to cover all these different tasks underneath. [41:46] package up in one model and kind of encourage it to generalize. So that's kind of one part, and that's very helpful. In cases like Composer, you have your actual product. And I think that's what it also kind of video with Fireworks. Yeah, if you have your actual product, you should do URL against it. The most powerful environment is your own product. Exactly, because that's where your model actually will be used. And of course... [42:09] If you have frontier lab, you're not gonna... [42:10] do it across all the products, right? But if you're trying to build the best model for your product, specialize and tailor it, you should just use your production environment. Of course, you want to isolate it properly, right? You don't want to model havoc on your production database. You want to clone it, etc. And there are some tools, [42:27] from general infrastructure, which makes it easier. But generally, you want your [42:32] our environment to be as close to real production as possible. And that's what, you know, [42:37] As an example, we see it's just... [42:39] If you look at kind of toy RL examples, the toy RL frameworks, they always start with like, oh, there's this toy environment, I'm gonna spin up a Docker container and run everything in it, which is great for toy examples, if you're trying to teach model how to play Atari or whatever. But if you're actually transitioned to like professional cases,
[42:56] you can't just [42:57] put your real production application in the Docker container. And we found it pretty early ourselves, like working with Minivox, like in case of Courser trainer on their side, some other customers, we run trainer on our training platform. But for environments, we actually default to running them on the customer side because that's where the actual implementation is. And you effectively have the same setup of trainer, even if it's part of our works platform or on the customer side, calling the actual production environment, not trying to kind of wrap it and componentize it. [43:27] Yeah. On the hostel platform, because that's really hard and that introduces differences. Yeah. Like, I mean, what we call aerial environments is really three components. One is the hardness. [43:37] So the RNS is where the model can submit tools and the tools get executed. And the second thing is, let's call it the operating system. So what is the actual world and state where the model is interacting with? And then there is the reward component. [43:57] which needs to check at the end that the work is done correctly. And generally, the harness is pretty portable. You can take the harness and put it in many different environments. The thing that's key is the operating system. And to replicate this, [44:11] Just normal containers? [44:13] don't really work very well. So at Cursor, we actually built like a whole virtual machine stack. And so we can spin up like virtual machines really quickly. And it has to be super bursty because you can imagine like, we are asking this system, please give me 100,000 virtual machines now. And it has to all come up. And yeah. Awesome. I really enjoyed this conversation today. I think Cursor is such an inspiration in what you all are doing as a company
[44:43] Model Lab. And I think the work you do with Composer 2 really leads that charge. So really special to hear about it. And then Dima, really cool to hear about the hardcore infrastructure problems actually that the two of you solved together in the trenches over many, many late nights to make it all possible. So thank you. Thank you guys for joining today. Thank you so much for having us. Thank you. [45:13] Thank you.
Want to learn more?