Fireworks Founder Lin Qiao on How Fast Inference and Small Models Will Benefit Businesses

In the first wave of the generative AI revolution, startups and enterprises built on top of the best closed-source models available, mostly from OpenAI. The AI customer journey moves from training to inference, and as these first products find PMF, many are hitting a wall on latency and cost. Fireworks Founder and CEO Lin Qiao led the PyTorch team at Meta that rebuilt the whole stack to meet the complex needs of the world’s largest B2C company. Meta moved PyTorch to its own non-profit foundation in 2022 and Lin started Fireworks with the mission to compress the timeframe of training and inference and democratize access to GenAI beyond the hyperscalers to let a diversity of AI applications thrive. Lin predicts when open and closed source models will converge and reveals her goal to build simple API access to the totality of knowledge. Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned in this episode: Pytorch : the leading framework for building deep learning models, originated at Meta and now part of the Linux Foundation umbrella Caffe2 and ONNX : ML frameworks Meta used that PyTorch eventually replaced Conservation of complexity : the idea that that every computer application has inherent complexity that cannot be reduced but merely moved between the backend and frontend, originated by Xerox PARC researcher Larry Tesler Mixture of Experts : a class of transformer models that route requests between different subsets of a model based on use case Fathom : a product the Fireworks team uses for video conference summarization

Published: Published Aug 13, 2024
Uploaded: Uploaded Jun 11, 2026
File type: Podcast
Queried: 00

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:36

[00:00] We thought, [00:02] replacing [00:03] other frameworks as a library with PyTorch is to be simple. [00:07] Is it... [00:07] Just swap the library. How hard can that be? [00:10] But we realized it's, we thought it's just a six-month project. It turns out to be a five-year project for us to support entire Mata's AI workload, [00:21] building on top of PyTorch because we have to rebuild the whole entire stack. [00:26] from scratch, from ground up, [00:28] Because we have to think about how to load data efficiently, how to distribute inference in PyTorch efficiently, how to scale training efficiently. And then we end up rebuilding the whole entire inference and training stack on top of PyTorch. Okay. [00:46] When we left, it was sustaining more than 5 trillion influence per day. [00:50] So that's a kind of massive scale. [00:53] by two to five years. And the fireworks mission is to significantly accelerate time to market for the whole entire industry, compressing it from five years to five weeks or even five days time to market. So that's our mission. [01:07] *music* [01:24] Joining us today is Lin Tiao, founder and CEO of Fireworks. [01:28] Lynn is an AI infrastructure heavyweight who previously led PyTorch at Meta, which is the backbone of the entire global machine learning ecosystem.

1:36-3:08

[01:36] She's taken her experiences at PyTorch in order to build fireworks, an inference platform for generative AI. [01:42] We're excited to ask Lynn about the market trends behind AI inference and how she plans to support and even accelerate the market shift to compound AI systems at Fireworks. [01:53] We're thrilled to have Lynn, CEO and founder of Fireworks with us today. Thanks for joining us, Lynn. Thanks for having me. [02:00] We're really excited to talk about a lot of things with you today, from PyTorch to the small model stack that you're building. [02:07] to what you're seeing in terms of enterprises building production deployments. But before we get there... [02:11] Can you maybe say a sentence or two on what you're building at Fireworks? [02:15] Yeah, so we started Fireworks in 2022. And Fireworks is a SaaS platform, first and foremost, for general AI inference and high quality tuning. Especially using our small model stack, we can get to very low latency for real time applications, very low cost for sustainable business growth, and [02:37] customization, automated customization for tailored high quality for enterprises. [02:42] So that is fireworks. [02:44] Wonderful. [02:45] I want to maybe start with the PyTorch story. PyTorch is kind of the foundation upon which the entire AI industry runs today. [02:54] and you and Dima and some of your other co-founders [02:57] were integral and leaders of that project at MENA. [03:02] So think about PyTorch as a programming language for [03:05] digital brains.

3:08-4:39

[03:08] And it's designed for researchers to very easily create those digital brains and experiment with it. [03:17] But the challenge of PyTorch is [03:19] It's very fast for people to create various different deep learning models, the digital brains, but [03:25] the brains don't think fast enough. So that's a challenge I took on to address while I was head on PyTorch. [03:32] And you mentioned... [03:35] You've mentioned before that most of the companies that are trying to build something similar to what you're building in Fireworks have chosen to be framework agnostic. [03:42] whereas you very much made a big bet on PyTorch. [03:44] Can you say why make the big bet on PyTorch and what benefits that brings to your customers? [03:49] Thank you. [03:50] That is really based on... [03:53] what I see. [03:54] when I operate PyTorch and metal also across the industry. And I clearly see a fernal effect that PyTorch is, because it started as a tool for researchers, it started to dominate the top of the fernal [04:08] for model creation. [04:09] And then the next stage of fernal is people [04:13] doing applied research [04:15] production work, they take those [04:17] research models and test them out. [04:19] for production setting and try to validate that hypothesis and then feed into production. So that's clear for no effects that's happening. [04:28] And as PyTorch is the data for researchers, it take over the top of the funnel. And it's really hard for people to rewrite into other frameworks for production. And naturally, it just flowed.

4:39-6:25

[04:39] down towards the bottom of the funnel. And that's how PyTorch become dominant. And I start to see [04:45] more and more models, especially more nascent models, are all built in PyTorch and run in PyTorch in production, including the general AI models. That's why we only bet on PyTorch and we don't [04:57] want to distract us or to support other folks. So researchers like it and it flows downstream from there. What do researchers like so much about PyTorch? [05:04] Simplicity. [05:06] are simplicity scales. [05:08] And that's kind of the lesson learned. [05:11] Through the journey of PyTorch and Madag, [05:14] and also building other community. [05:16] It is a relentless [05:18] journey. [05:19] to focus on simplicity. And we have a constant seeking how to make the user experience simpler and simpler, and hiding more and more complexity in the back end. [05:28] For example, when I started this journey, [05:31] At Mata, there are three different frameworks. [05:34] Cafe2 for mobile, Onyx for server-side production, PyTorch for researchers, too complicated. [05:40] And the mission is to reduce three frameworks into one to simplify. [05:46] But it's actually a mission impossible. After I consulted all three teams, and there's no consensus to how to simplify and build this one stack. And we took a very idealistic approach. [05:59] and take the PyTorch front end and take the cafe to back end. And we said, we're going to zip them together. It seems simple, but it's very hard to do because these two frameworks were never designed to work together. And the integration complexity is even much higher than build a framework from scratch. So too complex. And then we said, forget about it. We're going to all young PyTorch.

6:25-7:55

[06:25] Keep its beautiful, simple frontend and rebuild the backend. [06:30] So we built TorchScript, that's PyTorch 1.0. So that's really like the key focus on simplicity wins over time. The other interesting thing is we thought... [06:41] Thank you. [06:42] Replacing other frameworks as a library with PyTorch is to be simple. [06:47] Is it? [06:48] just swap the library. How hard can that be? [06:50] But we realized it's, we thought it's just a six-month project. It turns out to be a five-year project for us to support entire Mata's AI workload, [07:01] building on top of PyTorch because we have to rebuild the whole entire stack. [07:06] from scratch, from ground up, [07:08] because we have to think about how to load data efficiently, how to distribute inference in PyTorch efficiently, how to scale training efficiently. And then we end up rebuild the whole entire inference and training stack on top of PyTorch. [07:26] When we left, it was sustaining more than 5 trillion influence per day. [07:30] So that's a kind of massive scale. [07:33] by 2.5 years. And the Fireworks mission is to significantly accelerate time to market. [07:40] for the whole entire industry, compressing it from five years to five weeks or even five days time to market. So that's our mission. [07:47] Maybe when you look at open source standards, [07:50] There's a lot of people that are trying to do it on using VLLM or... [07:54] tensor RTLLM

7:56-9:32

[07:56] How do you think about how Fireworks compares to what's in the open source? [08:00] I really like both projects. And because my heart is in open source based on, yeah, I have PyTorch experience. I would say both projects are great projects for the community. [08:12] I think our biggest... [08:14] differentiation is [08:15] first of all, fireworks off the shelf, [08:19] is fostering [08:20] than both of the offerings. And second is... [08:25] We're building a system, we're not just a library. And our system [08:29] can auto-tune towards [08:32] our developers or enterprise workload, [08:35] to be... [08:36] much, much faster. [08:38] and to be much, much higher quality. And that cannot be achieved by just a library. [08:43] and we are building all this complexity, [08:46] back again to [08:48] our journey of PyTorch, we are [08:50] providing a very simple API but hiding a lot of automation, the complexity of automation, complexity of auto-tuning behind the scene. For example, [09:01] When we deliver our inference with high performance, high performance here means low latency and low cost, we handwritten. [09:10] Kura Konos. [09:12] We'd [09:13] implemented distributed inference across nodes and disaggregated inference across GPUs, where we chop models into pieces and scale them differently. We also implemented semantic caching, where given the content, we don't have to recompute. And this is a very important thing.

9:32-11:04

[09:32] we capture application [09:34] workload patterns specifically and they'll build into our inference stack. We [09:40] Yeah, we have many other [09:44] optimization we are we have been specific design for different use cases that is not [09:53] like general purpose or horizontal. So that is being encapsulated. We also have... [09:59] complex [10:00] optimization for quantization. You can think about, oh, quantization is just one technology, how hard can that be? But you can quantize so many different things. You can quantize KVCache, you can quantize Waze, you can quantize communication across GPUs, across nodes, and EO different performance gain and quality trade-offs. [10:20] We also automate [10:23] like quality optimization. There are many things we are doing behind the scene to deliver a very simple experience to [10:30] the app developers, so they innovate, they concentrate their cognitive bandwidth to innovate on the application side. [10:37] I liked your comment earlier about simplicity scales. [10:41] And as you're talking through everything that you've built to make this such a simple and delightful experience for your customers, it reminds me of the idea of... [10:49] conservation of complexity, you know, like the amount of complexity required to deliver any given task. [10:54] can be neither created nor destroyed. It's just a question of [10:58] who takes the complexity. That's right. And it feels like yours is a business where you have embraced an enormous amount of complexity...

11:04-12:38

[11:04] to make life simple for your customers. [11:06] Actually, my question is about your customers. [11:09] Where in the AI journey of your customers [11:12] Where in their AI journey do they say, wait a minute, we need something better? And then what brings them to you? [11:18] Yeah, so we've seen [11:20] pretty consistent pattern that [11:22] Last year, [11:24] People all start from opening up. [11:27] Because they are in the heavy exploration mode, [11:31] Many start up there. [11:33] they have some creative ideas, application product ideas, [11:36] and they want to explore product market fit, [11:39] So they want to start with [11:41] the most powerful model where OpenAir provides. [11:45] and [11:47] And then when they... [11:49] they feel confident they hate the partner market, but they want to scale their business. [11:53] And then the problem comes in because, as I mentioned, most of the general application there are B2C, consumer, person, or developer facing. It requires very high responsiveness. Low latency is a critical part of product viability. [12:08] It's not a variable product. People are not patient enough to wait for it. [12:12] half a minute for a response. That's not going to work. So they are actively seeking law they can seek. [12:18] And then another key factor is they want to build a sustainable, viable business. They cannot bankrupt quickly. And the weird thing is... Not in this market they can't. The weird thing is if they hit a viable product, that means they can scale quickly. And if they're losing money at the end, they're not going to be able to do that.

12:38-14:10

[12:38] small scale, they're going to bankrupt quickly. Right? So... [12:42] bringing down the total cost ownership is critical for them. So that's why they come to us. So it sounds like, I remember you had this insight a year or so ago when we spoke about [12:52] you know, [12:52] training tends to scale in proportion to the number of researchers that you have, whereas inference tends to scale in proportion to the number of customers that you have. [13:01] And in the long term, probably going to be more customers of AI products than AI researchers out there, and therefore inference is the place to be. [13:08] It sounds like you're kind of... The customer journey sort of begins as people are going from training into inference. What... [13:15] What sort of applications, what sort of companies are at that point where they're starting to really go into production? [13:23] There are so many ways to answer this question. It's a very interesting question. So first of all, my hypothesis when I start a company is, [13:32] We're going to take our startups first. [13:34] because they're most... [13:36] Tech Advanced. [13:37] There will be a ton of startups built on top of JNI. Then we will go to [13:43] digital native enterprises because they are tech forward. And then we'll go to traditional enterprise because they are like tech conservative. They want to observe and, you know, adopt when the technology and the product ideas are mature. So that's kind of my hypothesis. [14:00] And it totally blew my mind what's happening right now, because we have a lot of inbound startups. We are working with digital native.

14:10-15:41

[14:10] enterprises will also simultaneously working with traditional enterprises, including health insurance companies, healthcare companies, and banks. [14:18] and especially for those traditional enterprise usually i adjust my pitch to be [14:23] very business oriented because hey, [14:26] And that's kind of maybe my bias and kind of want to strike a meaningful conversation with them. But they quickly dive into very low-level details, technical details with me. And it's very, very engaging. What are the people doing? Like at a traditional enterprise, who are the people that you're engaging with? [14:45] So, we... [14:47] Is it an innovation person, AI person, or is it more... [14:51] business line leader, somebody who owns a production application? Yeah. So I think it's start to shift. We are more engaging, start with CTOs. Hmm. [14:59] they kind of [15:02] I feel like this business is shifting towards... [15:06] innovation-driven [15:08] business transformation and that's why can we [15:11] we encounter more sitios than like the [15:13] CIOs or other CISOs. So that's kind of an interesting shift. [15:18] But yeah, across the board, I think there are multiple fundamental reasons that [15:23] why that's happening. That's my hypothesis. One is [15:27] Um, [15:28] all the leaders realize [15:29] current JNI wave is similar to the cloud first shift, or similar to the mobile first shift. [15:36] remap the landscape of industry [15:39] Startups are growing really fast,

15:41-17:14

[15:41] and the incumbents feel threatened, if they are not innovating fast enough, they will be obsolete. [15:47] irrelevant, but also across the [15:50] they are heavily competing with each other. They're competing how fast they kind of transition their business to create more revenue, to be more efficient. [15:58] using Jani. So that's one phenomenon. The second phenomenon is [16:03] General AI is different from traditional AI. I would say kind of this is... [16:08] very different. Traditional AI [16:11] is... [16:12] give a lot of power to hyperscalers, right? Because traditional AI is you always have to train from scratch. [16:19] there's no concept of foundation model you build on top of. And that means you have to go off and curate all the data, [16:26] And the data-rich company, usually they are hyperscalers. [16:29] and you need to have a lot of resource investment to train your own models and so on. [16:35] So that is before JNI. [16:38] It's... [16:39] less affordable. It's concentrated in hyperscalers. Post-G&I, because of the concept of foundation models, people build on top of foundation models, and you don't train from scratch. It's not meaningful. It's all the same data. It's all the internet data you can crawl. It's a more or less similar model architecture. It's a waste of resource if you train from scratch. Instead, you fine-tune. You tune based on your small data set, high-quality small data set. [17:09] Thank you. [17:09] Small model problem. [17:10] and it makes it so much affordable to everyone.

17:14-18:47

[17:14] to assess this technology. And that's why everyone is jumping in to embrace it. [17:22] How many of your customers are using you for fine tuning versus just using a base model? And what do you think goes into building a great fine tuning product? [17:29] It really depends on the problem they are trying to solve. We actually see the open source model [17:33] is becoming better and better. The quality difference between open source model and the closed source model are shrinking. And my prediction is going to converge at the same model size. If you go open and closed, it will converge? [17:48] The opening and the closing will converge. Do you think there will be a time lag where closed is always six months ahead, or do you think there will just be neck and neck? [17:55] For same model size, especially like... [17:59] between $7 to $70 billion. [18:02] or even within 100 billion model size, the quality will converge. That's my prediction. We'll see after a couple of years, and we will come back to this podcast and see how it goes. [18:17] is customization. [18:19] right given like [18:21] If this trend is true, [18:23] Then the key differentiation is how we customize those models towards individuals' use case, towards individuals' workload. And is it easier to customize an open source model than a closed source model? [18:35] So I would say it's easier. It's just the open source model tend to have a much richer community. And there are a lot more people working on... [18:43] building on top of those models. For example, a Lama 3 model

18:47-20:18

[18:47] It's a very, very good base model. It is a very strong instruction following. [18:53] it follows instructions very well. So it's very easy to align the model to solve a specific problem really well. [19:01] And for example, we have been investing in function calling strategically as a direction. [19:08] We can talk more about that. It's an old topic by self. But we find like fine-tier function calling model on top of Lama 3 is so much easier. [19:17] compared with fine-tuning based on Mixtro models or previous LAMA 2 and other models. So that's just kind of the base model, open source base model is becoming very, very strong in... [19:29] instruction following in logical reasoning, in many other base capabilities. So it's very easy to morph it [19:35] to become a high-performance model for solving specific business tasks. That's the power of small models. [19:42] If we think about just open source software, open source infrastructure software, [19:47] 20 years ago, open source... [19:50] was thought of as a fast follower sort of thing, you know, Red Hat being a canonical example. [19:54] And then more recently, [19:56] open source is not the fast follower, it's actually the innovator. Anything about Mongo or Confluent or some of these other great open source businesses that have been created [20:04] Do you think there are areas in the world of models... [20:08] where open source is actually going to lead [20:10] closed source and is actually going to be ahead of the proprietary models. [20:15] Thank you. [20:16] So I think the dynamics is very interesting right now because...

20:19-21:53

[20:19] The proprietary co-source model provider, they're betting on [20:23] very few models. [20:25] right so [20:26] OpenAI's LLM models are like maybe three models, right? [20:30] Or you can think about that as one model, because what is a model? Model is the model architecture. [20:35] and data, training data, right? That defines the model. [20:38] So I'm pretty sure they use [20:40] All the models, they have more or less similar training data. Model architecture is more or less similar, so it's kind of scaled all parameters and so on. [20:48] It's not just OpenAI, I think, and Topic... [20:51] or Mistral and also all these kind of model builders, they have to concentrate their effort [20:58] to focus on [21:01] specific like [21:03] model segment, that's kind of the best ROI. That's a best model. But open source push a different dynamics. [21:12] Because it enables so many... [21:15] researchers. [21:16] to build on top of it. [21:18] So... [21:19] So that's kind of the small model. [21:21] phenomenon, it's smaller and it's easier to tune, easier to [21:28] improve quality, easier to focus on specific problem space. So it enable [21:35] thousands of flowers to blossom. Thousands of flowers to blossom. So that's the direction we believe in is [21:42] To solve an enterprise problem, [21:44] A thousand flower blossom is much better for enterprise. [21:47] because you just have so many problems. And I bet at a given

21:53-23:33

[21:53] at any problem space [21:55] there is a solution for you. We further customize towards your use case in your workload. What you get is, [22:04] Better quality? [22:05] much lower latency for real-time application, much lower cost for business sustainability and growth. [22:13] So we believe in that direction. [22:17] Maybe to that point, [22:19] Have you seen your customers so far are able to match the quality that they got with OpenAI? [22:24] when they move over to the firework stack and like, how are you enabling the, what I call the small but mighty stack to compete? [22:31] Yeah, so it really depends on domain. So for some domain, [22:34] actually people don't even fine tune. They use an off-shelf model as is. And it's already very, very, very good. For example, in the domain of coding code palette, [22:45] called Generation. [22:47] Transcription? [22:48] translation, [22:49] OCR? [22:51] it's just phenomenal those models are really really good so yeah [22:56] that's kind of off-shelf and ready to go. [22:58] But for some areas, [23:01] it requires business logic because every company is defining what is good differently and then of [23:09] off-the-shelf model will not work off-the-shelf because they don't understand the business logic. For example, classification. [23:15] different company want to classify, hey, you know, some marketplace want to classify whether it's a furniture or it's a dish or it's something else. That's completely depends on their domain. Or summarization, you think summarization is a very general task, but

23:34-25:12

[23:34] For example, insurance company want to summarize into very special template. [23:40] and there are specific business tasks on... [23:47] Yeah, on many other things. We just kind of work with across the board, various different problems, and those require fine tuning. [23:55] And I want to call out [23:58] Fine-tuning sounds simple, but it's actually not simple at all. [24:02] So the end-to-end requires [24:06] enterprise or developers to collect data first thing to trace. After they trace into label, [24:13] After the label, they need to... [24:15] Pick and choose which fine-tuning. [24:18] algorithm to use. There's supervised fine-tuning, there's DPO, there's [24:22] slew of preference-based fine-tune, as in [24:25] They don't label absolute. [24:27] good result, they basically say, "I prefer [24:31] this over that. [24:33] they need to pick whether they want to use [24:37] parameter-efficient fine-tuning, like LoRa or full model fine-tuning. And for some tasks, they need to tune hyperparameters, not just the model weights itself. So among [24:50] these many technology, they have to kind of figure out when to use what and so on. It's very important [24:55] It's very deep and usually those app developers, they haven't even touched AI yet and there's a lot for them to pick up. [25:02] And then once they've tuned and they tested, it's still improving some dimension. It's still not good in some other cases, and then they need to...

25:12-26:53

[25:12] capture those failure cases and analyze [25:15] Should I collect more data and go through this cycle again, [25:18] or it's actually a product design, right? It's very interesting. Some failure cases, not really failure cases, it's just they haven't designed what the product should react. For example, people are... [25:28] building [25:30] um, [25:31] assistant to auto-generate content. [25:34] when people type. And if you're in a table and your cursor is in a cell, what does auto-generate mean? Do you auto-extend what you type in the cell? Do you generate more rows or do nothing? So it's actually a product design. So that requires a PM to be in the loop. [25:52] to think about the failure cases. [25:54] With all this complexity, what we want to do is take away the rudimentary stuff, take away the complexity of it out, [26:02] which tuning approach to use. Yeah. [26:05] how to automatically label data, how to automatically collect data from production. We want to take away all this and keep it simple. [26:12] simple API for people to use, but leave the design part to our end user. For example, how the product should respond should completely [26:21] in their realm to figure out and solve. So that we want to kind of create that separation. And we started working in this direction, and hopefully we'll announce our product there soon. [26:32] I love that you're kind of liberating people to... [26:34] to not have to think from the tech out and to actually think from the customer back. [26:39] and sort of use all the stuff that you've built to deal with the underlying technology and really focus on, to your point, the design patterns and the usability and making sure that they're actually solving an important problem end-to-end in a compelling way. What is your vision for the...

26:53-28:38

[26:53] fireworks platform and like [26:55] To Pat's point earlier on conservation of complexity, you know, we started this podcast talking about... [26:59] how you're conserving complexity for your customers on the inference stack, [27:02] You just now talked about how you're conserving complexity for your customers in terms of the fine-tuning workflows. [27:07] Like, what are the other pieces that have to come together and what is your ultimate vision for what Fireworks the product is? If everything works five years from now, what will you have built? [27:22] So what we... [27:24] Like the North Star for fireworks is... [27:27] the simple API access to [27:29] the totality of knowledge. [27:31] Right. So right now, we're building towards that direction. We already have more than 100 models. [27:38] we provide across [27:40] larger English models, image generation models, [27:43] audio generation models, video generation models, embedded models, and multimodal models as image as the input to extract information. So that's kind of [27:53] one side of the foundation model coverage, [27:58] put all the foundation model together, it still have limited knowledge, but because it's training, it's unlimited. [28:04] is training data, it has a starting time, ending time, [28:08] all the information they can crawl on the internet is still limited because there are a lot of knowledge that's hidden behind APIs. [28:15] hidden behind the popular APIs that you don't have access to or you just cannot get real-time [28:21] information. There are a ton of private APIs hosted with the enterprises. No way anybody will have access outside of the companies. So then how do we get access to totality of the knowledge for the enterprises is to...

28:38-30:14

[28:38] is to have a layer to blend across many different models. [28:44] and [28:45] public private APIs. So that's kind of the vision. And the tool to the vehicle to get there is function calling. [28:56] is the function calling model. Basically, this model is capable [28:59] of [29:01] Understanding here are the APIs you want to access, and for what reason, it can automatically be the router. [29:09] to [29:11] most precisely call out to those APIs, whether it's accessing models or accessing non-model APIs in the most accurate way. So think about strategically that's extremely important [29:24] important. [29:25] to build this simplified user experience because then [29:30] our customer, they don't need to figure out, they don't need to scratch their head and figure out, oh, I need to fine tune to be able to access those APIs and how to even do that myself. It's kind of a tall order for me to do that. So we want to basically [29:45] You can think about that because many people are familiar with a notion called mixture of experts. [29:50] So OpenAI is providing Mixer Expert, and Mixer Expert becomes a very popular model architecture. The concept is it has a router system. [29:59] sit on top of a few very big experts, and each is specialized in doing its own thing. And our vision is we want to build a mixture of experts that access hundreds of those experts. And each of those experts are much smaller.

30:15-31:45

[30:15] much agile, but with high quality of solving specific problems. In that vision, real quick, do those experts live... [30:24] in fireworks, [30:25] in AWS and hugging face? Like where do those experts go? [30:30] come from. [30:31] that get put together with fireworks as the overarching framework. Yeah, our ambition is those experts living. [30:38] fireworks. That's where we want to curate [30:41] Um, [30:42] curate models we serve towards that. That's why today we already have more than 100 models. It will take some time to build this layer. [30:54] in a very solid way, but we're going to release our... [30:57] next generation function calling model, it's really, really good. A little preview on that. [31:04] It has multiple layers of breakthroughs, and we're going to announce it together with demos and example, and people can leverage and build on top of. Very cool. [31:15] Do you see any viable competition for NVIDIA? [31:19] on the horizon. [31:21] that's a very interesting question first of all [31:27] I think NVIDIA is operating a very locative market. Any locative market invites competition. This is just the economics. [31:39] here. [31:40] And also from the whole entire industry point of view,

31:45-33:15

[31:45] In general, industry doesn't like monopoly. So that's kind of another trend, like a pressure coming from industry. [31:54] So I think it just... [31:57] It's not a question whether there will be competition to NVIDIA. It's just a question of when. Do you think it's coming soon? [32:04] I think it's coming soon. I think it's coming soon. I think, I mean, obviously... [32:09] We can look at NVIDIA's completion in multiple segments on the [32:15] like general purpose, [32:17] competition segments GPU that MD is coming up. [32:21] that's interesting and I think also in a specific AI segment where [32:27] the AI model space is stabilized. [32:30] there's no more innovation, like the problem is well-defined and this is the model, then customistic will have its own role. [32:38] So I think I will look at the market that way and I do think [32:44] there will be competition coming soon. Can I ask you about that, by the way? Because you guys are in this part of the market where... [32:50] Um... [32:52] You are model agnostic to some degree, and it's really about the optimization of those models when it comes to put them into production. [33:00] Do you... [33:01] Do you think that the returns to scale on the frontier, the models that are out on the bleeding edge, do you think the returns to scale are starting to slow down? [33:10] Do you think that we're going to go into a phase where... [33:12] capabilities have started to mature or asymptote.

33:16-34:49

[33:16] And the race is more about the optimization and tuning and application of those capabilities. [33:22] I think both will happen at the same time. One is [33:26] it will start to stabilize and plateau in the model applicability point of view. [33:33] and we'll heavily customize. Our strategy is heavily customized towards the use cases and workloads. So that's one direction. And the second is I want to caution that [33:45] because at MEDA I also think for certain people [33:49] period of time, that is a model for ranking recommendation, right? [33:55] and we should heavily... [33:58] like index on that assumption, but [34:02] But then after a few years... [34:05] it's not the case. There's a significant amount of model innovation in... [34:09] seemingly stabilized [34:12] modeling space and that pushed the S-curve for Meta. I think same phenomenon will happen in the JNI space that a new model architecture [34:24] will happen and we're kind of overdue. So you mentioned we've talked about competition from other other vendors other direct competitors [34:31] What about OpenAI? Does OpenAI keep you up at night? They drop prices on their APIs all the time. They're making their models... [34:39] They're also trying to win the better, faster, cheaper race. Like, how do you think about them? [34:45] And... [34:45] How do you think about, you know, ultimately what you're going to build that's different from where they are going?

34:50-36:20

[34:50] Right. So... [34:52] Again, I feel like for the... [34:55] They are actually going smaller and cheaper, I think for the same model size. [35:00] for the same model bucket... [35:02] whether it's closed source or open source, [35:05] the quality is going to converge. Again, that's my prediction. And the real meaningful thing is to push the boundary here is have a customization or automated customization tailored towards individual use case and individual workload. I'm not sure if OpenAI has the appetite to do it because their mission is AGI if they hold their mission, which is a great mission, actually. [35:35] than solving an enterprise problem, which basically means there are a lot of problems, a lot of [35:42] specific problems that is really good for the small models to customize towards. And that's where we want to focus on our energy and build on top of open source models, assuming the trend that they are going to converge in quality. [35:56] Love that. [35:57] Our partner Rulof, last time you were here, made the point that... [36:01] you know, in prior technology waves, internet, mobile. [36:04] It was the people that did all the hard work driving down the marginal cost of running the stuff that actually enabled... [36:09] all the application development on top and all the end use cases that we get to enjoy every day and [36:14] I love that you are taking that exact approach with AI, where it's still so cost prohibitive for most people to run in production.

36:21-37:53

[36:21] by just dramatically bringing down that cost curve. [36:24] You're actually in the whole industry blossom. So it's really wonderful. [36:28] Should we close out with some... [36:29] Rabbit fire questions? [36:31] Yeah, let's do it. Okay, can I go first? No, go for it. Okay, let's see. Favorite AI app? We do a lot of... [36:38] video conferencing and the note taker [36:42] for video conferencing is a game changer for us. Whatever it is, there's so many different varieties, but [36:47] Oh, I just love that. Which one do you use? [36:49] I think we use fathom. [36:52] Yeah, our sales team use that. It's really good for training and also summarization, so they're in for short hour time. Nice. [37:00] Thank you. [37:01] What will be the best performing models in 2024? [37:04] Yeah. [37:05] My prediction is there will be many, given the rate that every week, every week there's a new model coming up. [37:16] and on the LMSS arena. [37:19] they keep competing with each other. So this is all good news for the whole entire industry. [37:25] it's really hard to predict which one. But the one prediction I'm pretty confident is the model quality will keep improving and keep increasing. [37:36] In the world of AI, who do you admire most? [37:40] I would say madder. [37:42] It's not one person. [37:44] But [37:45] that Meta's commitment to open source [37:49] Um... [37:51] I think Mada is the most brilliant.

37:54-39:11

[37:54] in the journey of JNI by continuous open source, a series of LAMA models and continue to push a boundary, continue to kind of shrinking the quality differences. So okay, so what MANA's doing is [38:10] is basically decentralized power from the hyperscalers. [38:14] to everybody who has a dream to innovate. [38:19] foundation models, GDI models. I think that's really brilliant. [38:26] Bye-bye. [38:27] Okay, well, agents... [38:29] Perform or disappoint this year? Hmm. [38:31] I'm very bullish on agents. [38:35] It's... [38:36] I think it's going to... [38:38] it's going to blossom. [38:40] That's all we got. [38:41] All right. Thank you. It's really fun to have this conversation. Thanks for having me. Thank you for joining us. [38:47] *music* [39:11] you

Want to learn more?