Reflection AI’s Misha Laskin on the AlphaGo Moment for LLMs

LLMs are democratizing digital intelligence, but we’re all waiting for AI agents to take this to the next level by planning tasks and executing actions to actually transform the way we work and live our lives. Yet despite incredible hype around AI agents, we’re still far from that “tipping point” with best in class models today. As one measure: coding agents are now scoring in the high-teens % on the SWE-bench benchmark for resolving GitHub issues, which far exceeds the previous unassisted baseline of 2% and the assisted baseline of 5%, but we’ve still got a long way to go. Why is that? What do we need to truly unlock agentic capability for LLMs? What can we learn from researchers who have built both the most powerful agents in the world, like AlphaGo, and the most powerful LLMs in the world? To find out, we’re talking to Misha Laskin, former research scientist at DeepMind. Misha is embarking on his vision to build the best agent models by bringing the search capabilities of RL together with LLMs at his new company, Reflection AI. He and his cofounder Ioannis Antonoglou, co-creator of AlphaGo and AlphaZero and RLHF lead for Gemini, are leveraging their unique insights to train the most reliable models for developers building agentic workflows. Hosted by: Stephanie Zhan and Sonya Huang, Sequoia Capital 00:00 Introduction 01:11 Leaving Russia, discovering science 10:01 Getting into AI with Ioannis Antonoglou 15:54 Reflection AI and agents 25:41 The current state of Ai agents 29:17 AlphaGo, AlphaZero and Gemini 32:58 LLMs don’t have a ground truth reward

Published: Published Jul 16, 2024
Uploaded: Uploaded Jun 11, 2026
File type: Podcast
Queried: 00

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:36

[00:00] I think someone needs to solve the... [00:02] kind of depth problem [00:03] The field as a whole, I think, or large labs have been really working on the breadth. That's amazing, and there's a big market for that, and a lot of very useful things that get unlocked, but someone needs to solve the depth problem too. [00:33] Hi everyone, welcome to Training Data. [00:36] Today we're hosting Misha Laskin, CEO and co-founder of Reflection AI. [00:40] Misha's a former research scientist at D-Mind. [00:43] and his co-founder Giannis was the creator of AlphaGo and RLHF lead for Gemini. [00:48] Together they are building universal superhuman agents. [00:52] We'll chat about why we're so far from the promise of AI agents, even with best-in-class models today. [00:58] What we need to truly unlock agente capabilities for LLMs [01:01] and what we can learn from those who have built both the most powerful agents in the world, like AlphaGo and AlphaZero. [01:07] and the most powerful LLMs in the world like Gemini. [01:10] So Misha, [01:12] To kick things off, we'd love to learn a little bit more about your personal background. [01:16] You were born in Russia, [01:17] move to Israel when you were one, then the United States in Washington state, [01:22] when you were nine. [01:23] Your parents were pushing forward the field of technology and research in chemistry. And I think that inspired a lot of your love for pushing forward the frontier of technology and getting into the world of AI today as well.

1:37-3:07

[01:37] Can you share a little bit more about what inspired you to get into this field and what has inspired you throughout your childhood and adulthood so far? Yeah. [01:45] uh, [01:45] Yeah, definitely. Um, [01:48] When my parents, we emigrated from... [01:52] Russia. [01:53] to Israel, it was [01:54] When the Soviet Union collapsed. [01:56] And they came to Israel... [01:58] Basically, I mean, with nothing, they had, I think, $300 in their pocket. [02:02] which was then stolen from them. [02:05] as soon as they landed because they put down a deposit for an apartment. [02:09] And, well, that just disappeared. I don't even know if there was an apartment. [02:15] So. [02:17] And they didn't speak. [02:18] Hebrew [02:19] Um, [02:20] So they decided to kind of pursue... [02:24] a PhD in chemistry [02:26] at the Hebrew University of Jerusalem, but that's not because... [02:29] It's not because there was some kind of internal passion for academia at that time. [02:35] Israel was giving stipends to Russian immigrants to get further educated. [02:41] Interesting asking my parents about this because they kind of... [02:44] grew to love their craft as they got excellent at it. [02:47] So I think... [02:49] from them that might be the [02:51] The thing that... [02:53] I took away most. It's not that [02:55] Um, [02:56] It's not that they're particularly... [02:59] impassioned about chemistry to start, but as they kind of learned about it, got curious about it, and [03:03] really went deeply into it. I think they became kind of [03:06] um

3:07-4:38

[03:07] masters of their craft and [03:09] That's something that I found [03:10] really important myself. [03:13] Um, [03:14] Moving from there to the states, [03:16] uh, [03:17] Well, my parents promised that we're moving to this kind of beautiful state with all these mountains, this Washington state. And I remember, like, we're taking the plane flight. I mean, I bragged to all my friends in Israel and, you know, I'm not afraid. [03:29] I was really excited. [03:31] Um, so yeah, we're flying. I do see mountains in the distance. [03:34] But then the plane does a sort of U-turn. [03:38] And if... [03:39] Many people don't know this about Washington State, but it's kind of half desert and half... [03:44] you know [03:45] mountainous and forests. [03:47] And... [03:49] The plane turned into the desert. [03:51] And so I see it kind of landing in the middle of nowhere. And I asked my parents, like, where are the mountains? Like, well, you saw it from the plane. Um... [04:00] The reason I'm saying this is because... [04:03] I basically moved to a very boring place. Huh. [04:06] Which city? There's this area in Washington State called the Tri-Cities, and it has some pretty interesting history. [04:13] The reason it exists is because it was one of the sites for the Manhattan Project. So this is where the plutonium... [04:19] was enriched at this place called the Hanford site, which is a sister site to Los Alamos. [04:24] So. [04:25] It's a town that was basically built for that... [04:27] in the 1940s [04:29] And... [04:30] is in the middle of nowhere, Cayola Closalamos is, and there's not much to do. [04:35] I remember [04:36] seeing my first tumbleweeds, uh,

4:38-6:10

[04:38] You literally saw tumbleweeds. [04:41] kind of rolling across the highway. [04:43] I found myself in a place where I didn't really speak the language that well, English. [04:48] I was in this kind of very rural place that was different from where I grew up. [04:51] Didn't have many friends. [04:53] and had a lot of time on my hands. And the way I got interested in science since the time it was physics... [05:00] is... [05:01] After getting sort of video games out of my system, I got bored again. [05:05] and [05:06] I found these Feynman lectures that my parents had on the Feynman lectures in physics, and [05:12] the [05:13] They were so... [05:14] Interesting because... [05:15] Feynman had this way of explaining incredibly complex things in a way that [05:19] basically a person who's [05:21] I mean, not that educated mathematically at the time. [05:24] can really understand something fundamental about how the world works. [05:28] and [05:30] That is probably the thing that... [05:33] inspired me most, [05:34] I just got really interested in this idea of [05:38] understanding how things work [05:40] how things work at this sort of root level and [05:44] working on problems that are [05:46] sort of root node problems, and that [05:49] I mean, there are all these examples I was reading about, like the invention of the transistor, which was invented by Joe Bardeen, a theoretical physicist. [05:56] or how GPS works, it turns out you need to understand [05:59] you need to make [06:00] relativistic calculations [06:02] which has [06:04] coming from Einstein's theory, special relativity, [06:06] And... [06:07] I wanted to work on things like that. So that's why I got into physics.

6:10-7:45

[06:10] I pursued it for a while, got educated in it, got a PhD in it. [06:14] And I think maybe that [06:16] critical bit of information that I had not... [06:19] did not have in my context then was that... [06:22] You don't just want to work on [06:23] I have root node problems, you want to work on [06:26] root note problems of your time. You want to work on the things that are [06:29] can be unlocked now. [06:32] And it's no surprise, you know, when you're being... [06:35] trained as a physicist, [06:36] you're doing like these really interesting problems and learning very interesting things about how people thought about these problems. [06:42] you know, about physics basically 100 years ago. [06:44] 100 years ago, physics was the root node problem of our time. [06:48] And... [06:50] That's why I decided to not pursue professionally. I kind of did a 180 and wanted to do something very practical. [06:56] so I started up [06:58] startup. [06:59] But as I was working on that, I started noticing... [07:02] Deep learning as a field taking off. [07:04] And in particular, [07:06] AlphaGo. [07:07] like when AlphaGo came out [07:08] there is something... [07:10] something just felt very profound about it. That Pal... [07:13] How do they get this... [07:15] system that is like a computer to [07:18] not just perform at a [07:20] higher level than a human, but [07:21] Do so creatively. [07:22] Um, [07:23] In El Togo, there was this famous move called Move 37. [07:26] where [07:27] the, well, where the neural network [07:30] made this kind of move that looked like, it just looked like a bad move. Lisa Dahl was really perplexed by it. Everyone was perplexed by it. [07:36] It just looked like a mistake. [07:38] And it... [07:39] Turn out that... [07:40] Ten moves later, this was actually the optimal move to kind of put AlphaGo in a winning position for the game.

7:46-9:16

[07:46] And so you could tell that this is not just... [07:49] This is not just a brute force thing. This is not a thing that is just... [07:53] It's obviously, I mean, the system does a lot of search. [07:55] But... [07:57] It's able to find creative solutions that people haven't thought of before. [08:00] And so that kind of made me [08:03] feel pre-viscerally that [08:04] solving [08:06] This was the first real... [08:09] large-scale superhuman agent. [08:11] Yeah, that seemed profound. [08:13] That's how I got into AI. [08:14] And [08:15] And... [08:16] Got into AI to build agents from the first day. So there was a kind of nonlinear path where... [08:21] Um, [08:22] I was an outsider. I wasn't really... [08:25] I mean, it was competitive back then, too. [08:27] And... [08:28] OpenAI released these requests for research at the time, [08:31] um [08:32] this is maybe 2018 or 19, their requests for research were just things that they wanted other people to work on. [08:38] I think by the time that I was looking at this list, it was actually already stale. So I don't think they really cared about those problems, but it gave me something concrete to work on. [08:46] And I started making progress against one of these problems. [08:50] Um, [08:51] I felt like I was making progress. I don't know how much progress I actually made. [08:54] I was kind of peppering a few research scientists from OpenAI with questions and kind of... [08:59] I mean, effectively cold emailing them, [09:01] Um, [09:01] Until maybe I got... [09:03] too annoying and they started... Well, I guess they responded rather, I'd say... [09:08] Graciously. And I built some some relationships there. [09:14] And one of them introduced me to Peter, Abiel.

9:16-10:48

[09:16] who is a PI at Berkeley, one of the greatest, I think, researchers of our time in the field of reinforcement learning and robotics with his lab. [09:26] KIM does everything. They have some of the most impactful experiences. [09:28] generative model research as well, one of the key diffusion paper [09:33] diffusion model papers came out of there. [09:36] And, [09:37] Honestly got lucky. He took a chance on me and... [09:41] and brought me into his group. [09:43] He really, he had no reason to. There are, [09:45] After I was on the other side and looking at applicants coming into the group, [09:49] There is really no reason for him to take someone who is not vetted. [09:52] So he kind of took a chance. [09:55] and that I think was my kind of [09:58] foot into the field. [09:59] Mm-hmm. [10:01] You and your co-founder, Janis, have worked on, I think, some of the most incredible projects out of DeepMind. [10:08] And Google, maybe... [10:09] Can you give a [10:10] give the folks here a taste of [10:12] some of the projects that you both worked on, like Gemini and AlphaGo, [10:17] Maybe what were the key learnings from each and how they propelled your thinking forward to the present day? [10:22] Yeah. [10:23] So. [10:24] Giannis was... [10:25] basically the reason I got into AI. He was one of the key [10:28] engineers on AlphaGo [10:30] He was there in Seoul when they played Lee. [10:33] Played against Lee Sedol. [10:35] Um... [10:36] And he's actually... [10:39] Yeah. [10:40] Before AlphaGo, he worked on this paper called... [10:44] the DeepQ [10:46] Deep Key Network is DQNs.

10:48-12:19

[10:48] And... [10:49] This was... [10:50] This was actually the first... [10:53] successful agent of the deep learning era. This was an agent that was able to play Atari video games. [10:59] and [11:00] they've kind of catalyzed this whole field of deep reinforcement learning back then, which was... [11:05] AI systems are autonomously learning to act in, I mean, mostly video game and robotics environments. [11:10] But it was the first agent. It was kind of a proof point that... [11:13] You can learn. [11:15] you know, to act in an environment in a reliable way. [11:18] from just raw sensory inputs coming in. This was, I think, a big unlock that was [11:21] Completely unclear at the time, the same way that [11:23] I think neural networks working on ImageNet [11:26] was an unlock in 2012. [11:28] And then, yeah, Giannis worked on AlphaGo and the kind of subsequent line of work. There's AlphaGo, there's AlphaZero, there's a paper called MuZero. [11:38] And I think that really... [11:40] showed how far you can take this idea. It really scales. [11:44] um relative to the models we have today large language models the [11:48] Like... [11:49] AlphaGo model is actually really small [11:51] And it was so smart at this one thing. [11:53] I think the key lessons, at least for me, from AlphaGo, [11:57] were kind of encapsulated in this famous essay that Rich Sutton wrote. [12:02] this Reinforcing Learning Researcher. [12:05] or I guess kind of a sort of [12:07] father of a lot of reinforcement learning research uh put forth which is this uh [12:11] idea of the bitter lesson [12:13] in that [12:14] essay he... [12:16] basically says that [12:18] You want to...

12:20-13:50

[12:20] If you're [12:21] building systems that are based on kind of your internal heuristics [12:25] Um, [12:25] Those things... [12:27] will likely get washed away, [12:29] with systems that kind of just learn on their own. [12:32] And [12:33] or rather systems that... [12:35] leverage compute in a scalable way. [12:37] And he argued that there are two ways to leverage compute. [12:41] One is by learning, so that's training. [12:44] that's [12:45] when we think of language models today, [12:47] They're leveraging compute mostly through learning. [12:49] by training them on the internet. [12:52] The other way is search, which is... [12:55] Leveraging Compute [12:57] to unroll a bunch of plans. [12:59] and then pick the best one. [13:02] And [13:03] AlphaGo is actually both ideas in one. I still think it's the most profound idea in AI. [13:09] that [13:10] combining learning and search together. [13:13] is the optimal way to leverage compute in a scalable sense. [13:18] and those things together are the things that [13:21] Produce a superhuman agent at Go. [13:23] The issue with AlphaGo [13:25] was that it was only good at one thing. And... [13:29] I remember being in the field then and [13:31] it did feel kind of stuck, the field of deep reinforcement learning, because... [13:36] The goal everyone set out for themselves is to build... [13:38] General agents. [13:40] superhuman general agents. And [13:42] Where the field landed was superhuman, very narrow agents. [13:47] And there was no clear path to... [13:48] How do we make them general? [13:50] Um,

13:50-15:20

[13:50] Because they were so... [13:52] data inefficient that if it takes six billion steps to train on one task then [13:56] where you going to [13:58] get the data to train, you know, and all these others. [14:01] And... [14:03] That was... [14:04] the big unlock of the language model era, um, [14:08] Take. [14:09] One way to think about the Internet is, or all the data on the Internet, is... [14:13] a collection of many tasks. Like you have like, [14:15] Wikipedia is a task of kind of describing some historical events. [14:20] Stack Overflow is a task of Q&A on coding, [14:23] You can think of the internet as like a massive multitask dataset. [14:27] And that's interesting. [14:28] Yeah, and the reason we get generality from language models is because [14:33] it's an [14:33] It's basically a system that's trained on [14:36] Tons of tasks. [14:38] those tasks aren't particularly, let's say, directed or, you know, there's no notion of reliability or agency on the Internet. So it's no surprise that the language models that come out of that. [14:47] aren't particularly good agents. They're obviously incredible, and they do incredible things. [14:52] But, [14:53] One of these fundamental problems in agency is that [14:57] You need to think over many steps. [14:59] And you have some error rate over each step. [15:02] And that error accumulates. It's called error accumulation. [15:06] And so that means if you have [15:08] you know, [15:09] Some percent. [15:11] a chance that you're wrong the first step. [15:13] They'll compound very quickly over a few steps. [15:15] to the point where it's basically impossible for you to be reliable on a task that is meaningful.

15:21-16:51

[15:21] Um... [15:23] The key thing that I think is missing is that [15:26] Language models are systems that leverage learning. [15:29] Um, [15:29] They're not yet systems that leverage search or planning. [15:33] And, uh, [15:34] in a scalable way. [15:35] And so that, I think, is kind of the missing piece. [15:39] Okay, now we have general agents, but they're... [15:41] or we have general agents that are not very competent. [15:45] And so you kind of want to move up the competence. [15:47] And the only existence proof for that has been AlphaGo. [15:50] And it's been done through search. [15:52] I really love that and the encapsulation of how you just shared that. I think that sets the stage wonderfully for reflection. Can you share a little bit more about the original inspiration, this problem space that you're going after and your long term vision for reflection? [16:07] I mean, the original inspiration came very much from... [16:11] Giannis and I collaborated very closely on Gemini. Giannis led... [16:15] the RLHF effort [16:17] And I led a reward model training, which was kind of a key part of our LHF. [16:23] we were working on and what [16:25] everyone is working on these language models is you, in post training, you align them for chat. So you align them to be... [16:31] good interactive experiences with for some end user. [16:35] So that... [16:35] was, you know, through it [16:38] products like ChatGPT, [16:39] or BARD, which has now been named Gemini. These language models, like the pre-trained ones, are very adaptive. And so with the right data mix, you can adapt them to be highly interactive. [16:50] chatbots.

16:53-18:22

[16:53] End. [16:54] I think our key kind of [16:57] insight from... [16:58] working on that was that [17:01] There was nothing specific that was being done [17:04] for... [17:05] chat is just, you know, you're just collecting data for chat. [17:08] But... [17:08] If you collect the data for [17:10] anything for another capability [17:13] You'd be able to unlock that as well. [17:15] Of course, it's not so simple. A lot of things change in the sense, I mean, [17:20] One key thing is that chat is subjective, so the algorithms that you [17:23] train [17:24] are different than the algorithms that you're trained for something that has kind of an objective, like, [17:29] this task was done or not. [17:32] There are all sorts of issues, but [17:35] But the main thing was that [17:37] We think... [17:38] the architectures and models work [17:41] um [17:41] A lot of things that... [17:43] I thought were bottlenecks. [17:45] have been kind of [17:46] washed away with compute and scale. [17:48] um [17:49] Like long context length is something I thought would be... [17:51] Something would be... [17:53] you need a research breakthrough and now [17:56] All the players are... [17:57] releasing models with... [17:58] extremely long context lengths relative to what we thought was possible even a year or two ago. [18:05] The methods for training these things and aligning them and post-training them are pretty stable. [18:10] And it's really... [18:13] Yeah, it's a data problem. And it's a data problem and a problem of how do you [18:17] enable [18:18] planning and search on top of these objects and [18:22] uh,

18:24-19:54

[18:24] we thought [18:25] We'd move... [18:27] faster. [18:29] against this problem if we did it on our own. [18:32] I think we just wanted to move very quickly against it. [18:34] So you've described agents as kind of the dream [18:38] both for you and Giannis as researchers, but also for reflection. [18:42] Can we pause on the word agents for a little bit? Because now it's become, now that it's become the term of 2024, everybody is calling themselves an agent and the word is... [18:49] starting to lose its meaning a little bit. And I imagine that you have probably a more... [18:52] pure definition of what an agent is. Maybe, could you just explain it? How do you think about what an agent is and... [18:59] You know, when we look at... [19:01] Some of the agents that everyone's gotten really excited about recently, they're still, it seems like they're still very early in terms of being reliable enough to be... [19:09] kind of [19:10] colleague level true agents. So like, where do you think we are on that curve? What is an agent? [19:15] and how do we kind of get to the promised land? [19:19] Yeah, it's... [19:21] It's an interesting question since I [19:25] The term agents has been floating around [19:28] you know, within... [19:30] or the research community for a while. I mean, I think... [19:33] Well... [19:34] since kind of the start of AI, uh, but, [19:37] I primarily have been thinking more ages in the context of the deep learning era, so starting with... [19:42] Dick Johanss. [19:43] And... [19:45] The... [19:46] Definition [19:49] is... [19:50] It's pretty simple in that it's [19:53] It's an AI system.

19:54-21:25

[19:54] that is able to... [19:56] reason on its own and take, well, however many steps it needs to. [20:00] to accomplish some goal that's been specified to it. [20:02] That's kind of... [20:03] It. [20:04] and [20:06] Now the way that goal is specified has changed over time in the deep reinforcement learning era [20:10] The goal was usually specified through a reward function. [20:13] So for AlphaGo, it's whether you... [20:15] Won the game of Go or not? [20:16] There's no one wrote, like, go win the game of Go via text. [20:21] Uh, [20:22] So that's how people usually thought about agents. They thought of agents... [20:26] as [20:27] things that are optimizing a reward function. [20:30] Um, [20:31] But there's a whole area of research, even then, before language models, on goal-conditioned agents. So these would be... [20:37] Um... [20:38] either in robotics or video games or you set a goal for the robot which could be [20:43] you give it an image of, you know, [20:45] an apple having been moved somewhere. [20:47] and you ask it to kind of reproduce that image, and it has to act on the world. [20:52] and [20:52] pick up an apple and move it somewhere in order to accomplish the goal. [20:55] So. [20:56] short definition, it's [20:58] AI systems that have to act in an environment to accomplish some goal. That's an agent. [21:02] Mm. [21:03] And then I guess as a follow up, [21:06] If you take, for example, coding agents as one... [21:09] potential domain of agents where there's been a lot of activity recently. [21:12] you could say the goal is creates a [21:16] calculator app for me. [21:17] Yeah. And the agent has to go and accomplish the task. [21:21] When I look at what SWE agent and what Devin have done,

21:26-22:58

[21:26] Um, [21:26] Is that in your mind, is that agentic reasoning and does kind of scaling that up? [21:31] get us to the promised land or do you think there are kind of [21:34] different approaches that you need more on the RL side or more on [21:38] or whatever other techniques you might need to use to get us to the promised land. Because I think those agents are still in the 13%, 14% range. [21:45] Have a task completion rate. [21:47] range, and I'm curious how we get them to 99%. [21:51] They're definitely, by the definition of agents, these are agents. They're just on a spectrum of capability, maybe not... [21:58] at a high level of reliability yet. [22:01] I think the way most people think about agents today is [22:04] PIN [22:05] context of language models is [22:07] Prompted agents. [22:08] So [22:09] You take a model, [22:10] you prompt it or you set up some flow of several prompts to get it to accomplish a task. [22:16] that allows anyone to kind of [22:18] take a language model and [22:20] take it from zero to something that's kind of working somewhere. Um, [22:23] So I think that's quite interesting. [22:25] I think it can only go so far. So this, I think, is actually like... [22:28] very... [22:30] I mean, this is a kind of example of what I think the bitter lesson would apply to, because... [22:36] prompting things and kind of really directing them to go in like these specific ways that's exactly the kinds of [22:42] heuristics that we have [22:43] that we're baking into these models to kind of try to achieve higher intelligence. [22:47] Um, [22:48] I mean, every major advance in agencies since the deep learning era [22:53] um [22:54] has been kind of showing that with learning and, uh,

22:58-24:29

[22:58] search [22:59] a lot of that gets washed away. [23:01] I think that... [23:03] The purpose of a prompt is to specify the goal. [23:06] So... [23:07] You'll always need a prompt. You always need to tell... [23:09] an agent what to do. [23:11] But... [23:12] If... [23:13] once you start deviating from that and the purpose of the prompt is actually to put the agent on rails, [23:18] and you know [23:20] where you're kind of doing the thinking for it, right? You're kind of telling it, okay, now just go here and do this thing. [23:25] Um, [23:26] that I think is going to disappear. Um, [23:29] I think that's a local thing that's happening today. [23:31] future systems. [23:33] I don't think we'll have that. [23:35] So the key is that the thinking and the planning needs to happen in the AI system, not in the prompt layer. [23:40] in order to not hit a wall. [23:42] I think you want to offload as much as possible to the AI system itself. [23:49] Again, these language models have not... [23:52] They were never trained for agency. They were trained for chat interaction and predicting things on the Internet. [23:58] It's almost a miracle that you can prompt your way to getting something that kind of works. [24:02] But... [24:03] What's interesting is that [24:05] Once you're able to prompt your way to something that kind of works, that's actually the best... [24:09] place to start. [24:10] for a reinforcement learning algorithm. A reinforcement learning algorithm, all it does [24:14] is it reinforces good behavior? [24:16] and it kind of [24:18] downways bad behavior. [24:20] If you have an agent that is doing zero, that is just doing nothing, then there's no good behavior to upweigh, and so the algorithm doesn't work. This is... [24:26] known as a sparse reward problem. If you're not hitting your reward,

24:29-26:03

[24:29] like if you're not accomplishing your task than ever [24:32] then there's nothing to learn from. [24:34] But if you've prompted your way to an agent that is kind of working... [24:39] Like, [24:39] you know, SWE agent or something like this, it's getting 13%. [24:43] Something like this. [24:44] then... [24:45] you have something that is like minimally capable where you can reinforce really good behavior. [24:50] the challenge becomes a data challenge of where do you [24:53] Where do you get the set of prompts to train on? [24:56] Um, [24:57] Where do you get the environment to run these things through? [25:00] I guess... [25:02] sweet agent does come with their environment. [25:04] But for many problems, like you need to think about that. [25:07] Um... [25:08] And then... [25:09] Perhaps the biggest challenge. [25:12] is [25:12] How do you verify that a thing has been done correctly or not in a scalable way? [25:16] And if he can solve [25:19] They see where the tasks come from, which usually that's your products that's solvable. [25:23] Um, [25:24] what environment you run them through, [25:26] Uh... [25:27] what algorithm you use, but it's really kind of what environment you run them through. [25:31] and then critically [25:32] How do you verify if the thing has been done correctly or not in a scalable way? I think that's a recipe for agency. [25:39] I think that gets to the crux of the problem space in AI agents today. [25:43] um, [25:44] Just to set the stage a little bit for the problem that reflection is going after, [25:48] What do you think is the current state of the market broadly in AI agents? [25:52] I think many assume that we are capable of more than we actually are [25:56] with the models that exist today, [25:58] So what do you think the problem is, and what do you think is –

26:04-27:34

[26:04] Why do you think the current attempts around AI agents are failing us today? [26:09] One way to... [26:12] categorize or classify what it means to be [26:15] a general agent. [26:16] And maybe I'll use the term [26:18] Universal agent since... [26:20] I'll use the term generality to apply to breath, so... [26:23] A universal agent needs to be abroad, a very general agent that can do many things, can handle many inputs. [26:30] But it also needs to have depth in the kind of task complexity it can achieve. [26:35] and [26:36] So examples are [26:38] AlphaGo is probably the deepest agent that has ever been built. It can do one task. So not that useful. [26:45] um [26:46] It can play [26:47] Go, but not tic-tac-toe. Yeah. [26:51] The current system's... [26:52] language model systems like Gemini, Claude, ChatGPT, [26:57] the GPT series of models [26:59] Lean the other way. They're very broad. [27:02] They're not very capable in a depth-wise sense. They're extremely impressive. [27:07] capable, broadly, [27:09] And I think that's one of the things that's been... [27:11] Honestly, miraculous. Like, as I said, [27:14] The field... [27:16] felt. [27:17] Like, [27:17] we did not have an answer to generality. [27:20] And then these objects came along. [27:23] But now we're in the opposite. [27:25] End of the spectrum. [27:27] We have, I think, [27:28] More or less de-risked. [27:30] as a field. [27:32] progress towards breath

27:34-29:05

[27:34] Um, [27:35] That's especially evident with the latest generations of models like GPT-3. [27:39] 4.0 and the latest family of Gemini models then [27:42] I... [27:43] multimodal in the sense that [27:46] they just understand other modalities at the same base layer that they understand language. You don't need to translate one modality into language. [27:53] So that's, I'd call it breath. [27:55] Um, [27:56] But nowhere along this process were things trained for depth. [28:00] There's no... [28:01] The Internet doesn't have real data around how to kind of [28:05] Think sequentially. [28:07] Um, [28:08] The way people try to solve this problem is like work on data sets that... [28:12] might have the structure [28:14] and hope it generalizes. So math, datasets, coding datasets, [28:18] kind of what people refer to reasoning, which usually is [28:21] reasoning along the lines of can you solve a mathematical problem? [28:27] But that's still not really addressing... [28:31] the problem head on. I think we need methods that [28:34] You can take [28:35] recipes that say, [28:36] They're general in that you can take any task category. [28:39] um [28:40] have a bunch of prompts for it for your training data. [28:43] Um, [28:44] And, [28:46] make [28:47] a language model [28:49] kind of iteratively capable, more capable on those things. Um, [28:54] I think someone needs to solve the kind of depth problem. [28:57] and [28:58] The feel as a whole, I think, is... [29:00] or large labs have been [29:03] have been really working on the breadth.

29:05-30:36

[29:05] Let's see. [29:07] That's amazing, and there's a big market for that, and a lot of very useful things that get unlocked, but... [29:13] Someone needs to solve the depth problem too. [29:15] I think that takes us really nicely into the unique insight that you and Giannis have. [29:21] From working on AlphaGo, AlphaZero, and on Gemini, [29:24] and the importance of [29:26] post-training and data. [29:29] Can you share a little bit more about [29:30] how those [29:32] Experiences have shaped the unique perspective that you have that gets us to the unlock with the Gentic capabilities. [29:38] One of the things I found very surprising about... [29:41] language models is how [29:44] Close they are to... [29:47] how often times, even if they're not working on [29:49] something that you want them to, [29:51] They're actually quite close. They feel like a nudge away. [29:54] Um, [29:55] I feel like they need to be grounded in the thing a bit better. [29:57] And that's [29:58] I think that was the insight that led them to be good in chat. Like... [30:02] You could play with them, and they're, yeah, a bit unreliable, and they kind of go off the rails sometimes, but they're almost... [30:08] Good. [30:08] chat companions. Yeah. [30:10] And so... [30:12] Then there's a recipe for how do you take a pre-trained language model and make it a reliable... [30:17] chatbot. So, [30:18] By reliability there, it's just a... [30:21] the way you measure that is with human preferences. Do [30:24] people interacting with this chatbot preferred over other chatbots or [30:28] other versions of the previous versions of itself. So if the current version is much more preferred than the [30:33] you know, last few iterations ago, then you know you made progress.

30:37-32:09

[30:37] and [30:38] That progress is made by... [30:42] collecting data for it. So it's collecting data for [30:45] the kind of, [30:46] you know, queries that users input into a chat box. [30:50] the outputs that the models provide. [30:53] and an effective ranking between those outputs so that you push [30:57] the Model 2 [30:59] to, you know, index over on the kind of [31:02] more preferred outputs. So, [31:04] When we say ranking, where does that ranking come from? Well, it comes from humans. [31:08] So there are either human labelers or it's something that's embedded into the product. You sometimes might see [31:13] thumbs up or thumbs down and um [31:15] chat GPT. [31:17] it's harvesting your thumbs to kind of know what your, what your preferences, uh, [31:22] And that data is used to kind of align the model with the user preferences. [31:26] That's a very general algorithm. That's a that's a reinforcement learning algorithm. And that's why it's called reinforcement learning from human feedback or RLHF. [31:33] You're just... [31:34] upweighing the things that [31:36] human feedback. [31:37] expressing preferences for. [31:39] There's no reason why the same approach [31:42] would not be possible [31:44] for [31:46] for enabling more reliable agency. [31:49] There's... [31:50] A whole... [31:51] sequence of other problems you need to solve. I think that's what made... The reason this is so hard is because as soon as you... [31:57] go into kind of Asian territory, you have [32:00] More than just the language outputs, you have the tools that they interact with. [32:03] and, [32:05] The tools being suppose you wanted to send an email or work on an IDE,

32:09-33:40

[32:09] or anything that an agent does, it does in an environment, and that requires tools, and it requires the environment. [32:15] and everyone who's deploying agents is deploying agents in different environments. [32:19] And so there's a challenge of [32:22] how do you integrate with environments and [32:25] How do you onboard agency onto them? [32:27] So I think that's why [32:29] It's a bit of a schlep if you get into this kind of line of work. [32:32] and you have to be [32:35] careful about the environments and kind of, and, [32:38] yeah the way you structure it because you don't want to overfit to some you know [32:41] some particular like environment [32:44] But... [32:45] conceptually, [32:47] It looks very similar to aligning a model for chat. [32:50] They're just... [32:51] some more integration challenges that need to be solved along the way. Mm-hmm. [32:56] Since you view AlphaGo as kind of like the pinnacle of... [32:59] building an agent that was truly [33:01] Capable. I imagine you're trying to usher in an AlphaGo moment. [33:05] with LLM's [33:07] What do you think are the differences? Like, to me, you know, with gameplay, you have a very clear reward function. You have... [33:14] the ability to do self-play, [33:16] Like, [33:16] Is doing kind of the reinforcement learning from human feedback, do you think that's enough to kind of get us to an AlphaGo moment? [33:22] in LLMs or like, I guess, how should I think of the differences here? [33:27] I think what you said around... [33:29] Not having a ground truth reward is... [33:32] A key and maybe the key thing. [33:34] Um, [33:35] What we learned from the previous era of reinforcement learning research is that if you have a ground truth reward,

33:41-35:12

[33:41] You're... [33:42] kind of guaranteed success. Like, that's kind of... [33:45] There have been so many very... [33:48] Impressive. [33:49] um, [33:50] project that showed this at [33:52] Really... [33:52] unprecedented scale. I mean, you think aside from AlphaGo, there was [33:55] opening eyes Dota 5 or Alpha Star and... [33:59] let's say Alpha Star and Dota 5 are a bit more niche in the sense that you kind of have to play those games to understand, but [34:05] As a... [34:06] former StarCraft player. I was... [34:09] I still am completely blown away by Alpha Star. Like... [34:12] The strategies that the AI discovered were, it just looked like, [34:17] A very smart, like a smarter than us alien game. [34:20] like upon the earth, [34:22] Decide to play this game. [34:23] and completely out-competed us. [34:27] So, [34:28] That's due to the existence of... [34:31] A number of things, but a ground truth reward is really, like, is extremely important for tightening that. [34:35] Behavior. [34:36] um [34:37] Now, [34:40] both with human preferences and for agency, [34:42] These are very general objects and we don't have ground truth rewards for whether something is accomplished or not. [34:49] For a coding task, what's the ground truth? [34:51] of whether this was done the right way. [34:53] Like, it could pass some unit tests, but it can still be wrong. [34:57] It's a really hard problem. [34:58] And I think it's the [35:02] I think it's the fundamental problem for agency. [35:05] There are others as well, but this is kind of the big one. [35:08] Um, [35:09] the way [35:10] You get around this problem.

35:13-36:45

[35:13] for chat. [35:14] Um, [35:15] this again through RLHF, [35:16] Well, you train reward models. [35:18] um, [35:19] Her ord models are, it's a language model that... [35:22] predicts. [35:23] whether something was done or not correctly. [35:26] The challenge with that [35:28] So first, it works well. [35:30] The challenge with that is that [35:32] When you don't, in the absence of a ground truth, when you have this kind of noisy thing, [35:35] that it can be wrong? [35:38] your [35:39] policy or the agent [35:42] um [35:44] Quickly gets smart enough where it finds holes in the reward model and exploits them. Um... [35:48] To give a concrete example in chat, suppose you... [35:51] Notice that... [35:53] your [35:54] chatbot was outputting, um, [35:56] you know, [35:57] was... [35:57] Outputting some, let's say, harmful content or... [36:00] Uh... [36:01] Nike. [36:02] There are some topics that you don't want it to talk about because they might be sensitive. And so you put in some data into your data mix saying, like where it's examples of the chatbot kind of ignoring, or not ignoring, but saying, I'm sorry, as a language model, I cannot answer this. [36:14] What can happen is that, okay, you now train a reward model against this and [36:18] Suppose in your data mix you really, [36:21] only put in data points that [36:24] showed [36:26] showed instances of like this kind of [36:30] This happening? [36:31] but not... [36:31] instances of [36:33] um [36:34] a chatbot taking something like... [36:35] kind of sensitive and actually like, you know, answering it. [36:38] What that means is that the reward model could think that [36:41] It's actually like a good thing when you just don't answer the user's query ever.

36:45-38:16

[36:45] because I've only seen positive use cases of that. [36:48] And when you train against that, the policy [36:51] with language model will [36:53] at some point get smart enough and discover that [36:55] This reward model gives me high reward whenever I just... [36:59] Don't answer. Whenever I punt the question. [37:01] and it can collapse into a language model that just never answers. [37:05] Your questions. [37:06] And... [37:08] This is why it's very finicky and it's very difficult for this reason. [37:13] Um, [37:14] I'm sure a lot of users who [37:16] Have. [37:17] interact with ChatGPT or Gemini or these kinds of models, like probably through... [37:22] through interacting with them, [37:24] found sometimes that they kind of degrade [37:27] and they all of a sudden, like, [37:29] Don't answer questions as often as they used to. Get slightly worse at something. [37:34] or [37:35] you know, are politically biased in some way. [37:37] and [37:39] I think a lot of that is, well, it's artifacts of the data. [37:42] But the... [37:44] Artifacts in the data get amplified by bad reward functions. [37:47] So, [37:49] That is the hardest problem, I think. If I view the rough kind of... [37:54] large model training pipeline or large AI system training pipeline as pre-training. [37:59] and post-training. [38:00] Um... [38:01] I kind of think... [38:03] Like, pre-training seems largely like a solved, like, we're in the... [38:07] You know, the techniques are solved and we're just kind of in the race to scale. [38:11] Moments on the pre-training side. [38:13] Post-training still feels a little bit like in the kind of research...

38:16-39:48

[38:16] phase of market where people are still figuring out what techniques will work in a general way. [38:21] I'm curious if you agree with that. And in an ideal state... [38:25] Like, what is pre-training responsible for doing? How should we as laymen think about it? [38:29] And what is post-training responsible for accomplishing, and how should we... [38:34] Thank you about that from the perspective of a five-year-old. [38:36] Yeah. [38:41] I would generally agree with that statement that pre-training, [38:45] has become, [38:47] There are a lot of details that need to get right, and it's by no means easy, so it's a very hard endeavor. [38:52] Um, but it's a better understood endeavor at this point. Um, [38:56] And [38:58] One way, [39:00] that I think about pre-training is [39:03] I actually think thinking about it through the lens of something like AlphaGo, [39:07] Um... [39:09] his, his, [39:10] quite [39:11] simple and clear because it kind of, you know, [39:14] Rather than thinking about this massive internet thing, you just think about a very clear setting, which is clean setting, which is this game. [39:21] Um, [39:22] AlphaGo has two phases. [39:25] an imitation learning phase where [39:27] a neural network imitates a bunch of expert [39:30] amateur, like expert Go players. [39:34] and then it has this reinforcement learning phase. [39:36] You can think about pre-training as... [39:38] the imitation learning phase of AlphaGo. You're just kind of acquiring the basic skill of learning to play the game. You're not maybe, your neural network then is not the best in the world,

39:48-41:23

[39:48] Um, [39:49] But it's pretty good. It goes from zero to pretty good. [39:52] And... [39:53] Pre-training for a language model is going from zero to pretty good on everything. [39:57] um [39:58] which is why it's so powerful. [40:00] Post training, [40:02] is [40:04] I think about it as hardening good behavior. [40:07] What that means is [40:08] With AlphaGo, you did imitation learning. [40:11] You start off at a place where you have a neural network that... [40:14] can do something. It can, I mean, [40:16] It can play a game pretty well. [40:18] then you apply this other recipe to it which is reinforcement learning [40:21] which is then the network starts generating its own plans and [40:27] kind of acting through the game, getting feedback, and that could... [40:30] and basically good [40:32] Actions get reinforced. [40:35] That is, I would say that that's post-training. And you can think about [40:38] From a chat perspective, [40:40] your hardening [40:41] the model, like... [40:43] the good behavior along the chat axis. [40:47] It's actually... [40:48] quite interesting that [40:50] the high level recipe. [40:51] for training. [40:52] AlphaGo and for training Gemini is actually the same. You have this imitation learning phase, [40:57] And then you have a reinforcing learning phase. [40:59] Mm-hmm. [41:00] The reinforcement learning phase in AlphaGo is just much more sophisticated. [41:03] than what we have now, [41:05] And the reason comes back to reward models. [41:08] If you have a reward model, [41:11] That is... [41:12] that is fairly noisy and exploitable, [41:16] then [41:17] There's only so much work. There's only so much you can do before the policy gets smart and finds a way to trick it.

41:23-42:53

[41:23] And so... [41:24] even if you threw the fanciest RL algorithm at it, [41:28] and Monte Carlo Tree Search with AlphaGo. [41:30] um [41:32] it may not be that effective because it kind of... [41:35] Um, [41:36] it collapses into this kind of degenerate state where the policies hack the reward model [41:40] before it could even do any interesting search. Like suppose you're thinking about [41:45] Like if you were playing chess, [41:47] and you're thinking about what to do multiple moves ahead, [41:49] but your kind of judgment [41:51] is really bad at every move. [41:53] Then... [41:54] then there's no, like... [41:55] There's no point of planning ten moves ahead. [41:57] And I think that's where we are with RLHF today. [42:01] There's this [42:01] wonderful paper that I think is [42:04] very over or underrated [42:07] Um... [42:08] called scaling laws for [42:10] reward model over optimization. [42:12] This is a paper from OpenAI. [42:14] studying this phenomenon. [42:15] um, [42:16] And what's interesting about, I mean, a number of things, but it showed that this phenomenon happens. [42:22] "'at all scales.' [42:23] And I mean, you tried a couple of different RLHF algorithms and... [42:27] It happens at all scales for all algorithms that were tried in that paper. [42:31] and [42:33] I think it's such an interesting paper because it's the... [42:36] It's the kind of fundamental problem of post-training. It's that paper. [42:40] Um... [42:41] Yeah. [42:42] Just to pull on the thread a little bit, if you follow the results from alpha zero, though, then we may not need... [42:47] pre-training at all? Is that a fair... [42:50] Conclusion of... [42:52] of what to make of this.

42:54-44:28

[42:54] I think that... [42:56] At least my mental model is that... [42:58] the AlphaGo part, the imitation learning, is necessary. [43:02] Um, [43:04] more from a practicality standpoint. Um, [43:08] when [43:09] When we went from, or DeepMind went from AlphaGo to AlphaStar, [43:13] There was no... [43:15] Alpha Zero of Alpha Star. [43:17] There's no Alpha Star Zero. [43:20] Um... [43:22] after that or anything like this, and [43:24] Alpha Star... [43:26] had like a big part of it was imitation learning across a lot of games [43:30] I think with AlphaGo, it was like this kind of special place where you have... [43:35] You don't only have a zero-sum game, [43:37] but you can get to the end of that game, like, fairly quickly. And so it's, again, like, you can get that feedback about whether what you did... [43:45] Yeah. [43:46] was right or not. [43:47] Got it. Okay, so it's just way too unconstrained of a problem to throw it out, generally. Yeah, I think in practice, like... [43:52] Yeah, AlphaZero. [43:54] would work. [43:55] generally for everything if we had um ground truth reward functions for everything [43:59] But because we don't, [44:01] Um, [44:02] You need to do the imitation learning piece. [44:05] as almost, this is just like a [44:06] practical [44:08] We need to get into the game somehow. [44:10] You described earlier the importance from a technical perspective of having an agent in its environment. [44:15] Also from a product distribution and getting the product and user's hands perspectives. [44:21] It's important to... [44:22] think about what the right task categories are for users to first interact with the most powerful agents.

44:28-45:57

[44:28] What are some of the task categories that are on your mind? [44:31] And what do you imagine are some of the possibilities that users could use these [44:35] in their daily workflow. [44:37] If you want to make progress along the depth axis, um... [44:41] You could go for like AlphaGo first, which is like a really hard thing. [44:45] or you could kind of expand concentrically. [44:47] in the sort of complexity that have the tasks you're able to handle and [44:51] We are focused on kind of enabling depth. [44:54] but in this sort of concentric way. [44:56] And... [44:57] We care a lot about having a general recipe. [45:00] that, [45:01] is not... [45:04] that does not kind of [45:05] you know, [45:07] Inherit heuristics. [45:08] that are special to some tasks. So... [45:11] From a research perspective, we [45:14] We're building... [45:15] general recipes for this. [45:17] Now, [45:18] You have to ground those recipes in something to show progress. [45:21] And... [45:22] At least for us, it's important to show [45:25] Diversity. [45:26] of environments. And so [45:29] we're thinking about [45:31] a number of different types of agents, um, [45:34] Web Agents. [45:35] coding agents [45:37] Um... [45:38] I have OS computer agents. [45:40] um [45:41] The important thing for us is to just show that [45:44] uh, [45:47] you can have a general recipe. [45:49] for enabling agency. [45:51] Switching gears a little bit, [45:53] You've attracted a seller team already. [45:55] Who else are you looking to recruit? [45:57] on your team.

45:58-47:28

[45:58] Yeah, we've been fortunate to... [46:02] be able to draw... [46:03] Um, [46:04] some talent from... [46:06] the top AI labs, um, [46:08] in the industry. [46:10] uh, [46:11] And... [46:13] I think a lot of that has to do with [46:16] um [46:17] Well, with both of the work that Giannis and I did, [46:20] But... [46:21] Definitely, I think a lot of credit goes to Giannis and his reputation. [46:26] You know that there's this, I was watching the Michael Jordan documentary, and... [46:32] Michael Jordan was... [46:34] But one of the reasons he was so effective is because he was such... [46:37] an incredible kind of individual [46:39] basically contributor to the game, maybe be best. [46:42] Um, [46:44] that he really... [46:45] inspired [46:47] people on his team to get to his level. [46:49] Even if they couldn't [46:50] Get there. [46:51] Uh. [46:52] and [46:53] Giannis has this effect on people. [46:55] Like I worked very closely with him on Gemini and he had that effect on me. [46:59] uh, [47:00] I don't know if I ever got to Yana's level, but... [47:04] I aspired to, and I definitely... [47:07] became a much better... [47:09] engineer and researcher through the process. And I think that's [47:12] uh... [47:13] a lot of the draw is that [47:17] you get to learn a lot from him [47:19] we're [47:21] primarily continuing to look for, so we're not hiring out [47:25] Um... [47:26] Quickly, we're hiring out, I think, um,

47:29-49:02

[47:29] more methodically [47:31] um, [47:32] We're looking for... [47:34] Yeah, definitely interested in other... [47:36] Uh-oh. [47:37] researchers and engineers joining us on this mission. [47:42] I'd say a commonality between everyone who's joined is... [47:47] We're all very hungry. Maybe that's how I'd put it. [47:52] Giannis and I could have stayed and tried [47:54] pushing agents [47:55] you know, at DeepMind, [47:58] And as I said, I think the reason we decided to do it [48:01] in our own ways because we [48:03] we think [48:04] we can move quickly [48:06] and much faster against the skull. [48:08] and [48:09] Some of this urgency is driven by a [48:12] a real belief that we are [48:15] We are three or so years away from, [48:17] from something that [48:20] resembles a digital AGI. [48:22] And by that, that's a... [48:25] That's what I've been referring to as universal agents. Semi has both this kind of breadth [48:29] and depth of knowledge. [48:32] and [48:33] That means we're actually on a very accelerated timeline. [48:37] Yeah. [48:38] you're you know [48:40] A few months in, you're kind of 5% away from... [48:43] from [48:44] hitting that timeline and [48:46] Maybe Salis is also driven by [48:49] How quickly AlphaGo went? [48:50] from [48:53] Experts in the field [48:54] Doubting this is possible. Yeah. I think it's kind of decades ago. [48:58] human level or expert human level GoPlay was decades away.

49:02-50:34

[49:02] And... [49:03] how... [49:04] effectively they were able to solve that problem within months [49:07] I think... [49:08] we're seeing a similar kind of acceleration happening with language models. There's [49:13] Thanks. [49:14] One viewpoint you can have is that [49:16] we've saturated [49:17] Um... [49:18] a lot of what we can. We're on this sort of [49:21] at the tail end of an S-curve? [49:23] And we don't view it that way. We think we're [49:27] still we're still on an exponential. [49:29] Part of the reason is that these things are so bulky. [49:31] and slow to train [49:33] that [49:34] There's no way, collectively, as a... [49:38] you know, is it [49:39] field of researchers and engineers [49:42] that we've. [49:43] optimized it yet. [49:45] like [49:46] If it takes a few months to run, and I... [49:48] a few months and a few billion dollars to [49:51] Run the biggest model. [49:52] then how many experiments can you really run? [49:57] Yeah. We... [49:58] we kind of see things going at an accelerated pace. [50:01] Um, [50:03] And we think solving the depth and reliability problem is [50:07] Something that... [50:08] is not getting... [50:10] the kind of [50:13] attention it needs, [50:14] Like there are groups that are following this as I would call it more like side quests within these big companies. [50:19] But I think you need a player [50:21] that is focused entirely on it to [50:24] who solved this problem. [50:26] I love the framing of main quest versus side quest. [50:29] And I love the... [50:32] and the zero complacency

50:34-52:04

[50:34] and impatience in a healthy way that you and the rest of the team have. [50:37] And the other thing I'd highlight is the revered reputation that you described for Giannis [50:42] with inspiring and motivating other people, I think is true for you and Giannis. [50:47] from everyone we know at DeepMind. [50:50] So three years until I have an agent that will write my memos for me? Hopefully it lasts. I think three years. Yeah. I think the memos might be coming sooner. Because that was one of my burning questions. Is this like decades away? [51:04] It sounds like you're closer to the months or small number of years away. [51:10] I think small number of years. Wow. [51:12] Yeah, it... [51:15] Yeah, it's honestly kind of alarming, I think, the speed at which the field is moving. [51:20] And... [51:21] part of yeah [51:22] Part of depth and reliability, it's also like, [51:26] It is... [51:27] I mean, reliability is safety. [51:29] So you want to [51:32] You want these systems to be safe. I think that... [51:34] There's a lot of very interesting research in terms of [51:38] There's a recent paper from Anthropic on kind of mechanistic interpretability and that whole line of work is... [51:44] really interesting and I think starting to kind of get to the point where [51:47] There's utility. [51:49] in it as well in terms of [51:52] fighting like neurons in the model that are [51:54] Like, [51:55] you know, lying neurons or, you know, that you can kind of suppress. [51:59] But... [52:02] To me... [52:03] safety is reliability.

52:05-53:35

[52:05] If the thing is kind of running around your computer, [52:07] breaking all sorts of things. That's an unsafe system. Um, [52:11] Maybe it's like a... [52:12] utilitarian safety. Like you just want these things to work. And, uh, [52:16] and do what you intended them. [52:18] by what you ask them to do. [52:19] So I have a few years to find another hobby other than my memo writing then. [52:23] Yeah, well, or maybe you'll just have an army of... [52:27] AI interns that, um, [52:29] You know, we'll do all the research work for you. Can't wait. [52:34] So wrapping up our topic around reflection, [52:37] If everything goes right, what is your dream for reflection? [52:41] I think the... [52:43] They're [52:44] two angles at this question. [52:46] One is... [52:48] We're working on this because this is the... [52:52] kind of scientific root-node problem of our time. For scientists, we're not going to be [52:56] That's why we're so kind of interested and committed in it and [53:00] It's really there's. [53:02] a world where you get to be part of... [53:06] One of the most exciting journeys in science ever made. [53:09] And you've accomplished your goal of building Universal Agent. You have [53:14] highly safe, reliable [53:16] digital agents running around on your computer, [53:18] Um, [53:20] basically doing things that... [53:24] tedious work that you don't necessarily want to do you kind of [53:27] I think [53:28] rather than people going and [53:30] Uh... [53:31] Yeah. [53:33] spending less time working. I don't think...

53:35-55:06

[53:35] The Cumin [53:37] need or like [53:38] and the human need to be productive and to contribute is going to change. [53:42] I just think the capacity of each human's ability to produce energy [53:46] And... [53:47] you know, [53:48] impact the world is going to dramatically increase as you know in in my line of work there [53:53] As a researcher, there are so many things that... [53:55] I spend time on [53:57] that, [53:58] a smarter AI could help me out with to make [54:01] faster progress towards our own goal. [54:03] I... [54:03] I mean, this is kind of a circular, but if we had... [54:06] Something close to... [54:07] at DigitalAGI. [54:08] we get much faster through solving the problem with digital AGI. [54:12] That's one angle. [54:13] I think the... [54:15] The other angle is from [54:18] I guess we've kind of moved on to the other angle. It's from the user perspective of, [54:24] A lot of the things that we do on a computer... [54:27] are [54:29] you know, our [54:30] Maybe you can think about computers like the first digital tool that we've been introduced to as a... [54:36] you know [54:37] and [54:37] As people, [54:38] In the same way that [54:40] There were... [54:41] you know, [54:42] hammers and chisels and [54:45] Sickles, [54:46] that people used. [54:48] And I think we're moving towards... [54:51] the kind of layer beyond that [54:53] where instead of you having to [54:55] kind of [54:56] learn how to use all these tools with great precision and spend all your time on this, which [55:00] actually he's kind of [55:02] time taken away from achieving. [55:03] you know, [55:04] whatever personal goals people have.

55:06-56:37

[55:06] Um, [55:07] that you kind of have these... [55:09] incredibly [55:10] helpful [55:11] AI agents. [55:13] that [55:15] can help you bring [55:17] kind of any goal [55:19] that you have to fruition. [55:22] And I think it's very exciting because I think the kind of ambition – [55:25] of our individual goals is going to be [55:28] It's already increasing in this local sense. A software engineer can get a lot more done today with these tools. [55:35] This is just the beginning, and I think we'll [55:38] Yeah. [55:39] will be able to [55:40] Really... [55:42] set dramatically more ambitious goals for ourselves and for kind of these sorts of things we want to achieve. [55:49] um [55:50] simply because [55:51] we can... [55:53] offload a lot of the work that's needed to get there to these systems. So [55:57] These are some things I'm really excited about. [56:01] We'll close it out with a few questions that we like to ask everybody about the state of AI. First question, what are you most excited about? [56:09] in AI in your field or more broadly in the next one year, five years, and ten years? [56:18] I think they're [56:19] a number of things that... [56:21] The local one that comes to mind, because the paper is fresh, is this kind of work on mechanistic interpretability in that... [56:30] I mean, these models are... [56:31] largely black boxes, [56:33] And, um... [56:36] It's unclear...

56:38-58:08

[56:38] It's really unclear how to study them as like, what's the neuroscience of language models? Like if you think about them as brains. [56:44] And [56:45] This seems like a really... [56:47] interesting line of work that [56:49] Um... [56:50] is now starting to see [56:51] kind of signs of [56:53] Um... [56:54] Well, the size of it working beyond toy settings? Yeah. [56:57] So maybe like the sort of neuroscience of language models is, I think, kind of a really interesting thing. [57:03] Awesome. Field in AI to get into it. And more generally, [57:07] If I was in academia, [57:09] I'd probably be looking a lot at the science of science. [57:14] AI, so the neuroscience of AI is one thing, but... [57:17] there are all sorts of [57:20] all sorts of things one can investigate in terms of [57:23] Uh... [57:25] well, what really determines the scaling laws that these models have? [57:29] both from a [57:30] theoretical perspective and from an empirical perspective, how you change data mixes, maybe [57:35] kind of taking a step back. [57:37] We're [57:39] Um... [57:41] We're basically in the, like... [57:43] equivalent of what the late 1800s looked like for physics. [57:48] electricity was being discovered. [57:50] No one knew. [57:52] why it worked or how [57:53] There's a lot of like... [57:55] there are a lot of empirical results. [57:58] But there was no kind of [58:01] theory behind it [58:02] which just meant that they were not very well understood. [58:05] And then this very rich set of

58:08-59:40

[58:08] theoretical models were developed. [58:10] that were very simple to [58:13] understand these phenomena. [58:15] And... [58:16] that gave rise to actually basically the next wave of, um, [58:21] empirical breakthroughs. [58:22] And so [58:24] I think the science of AI is kind of in that state. [58:27] right now and I'm very excited to see where that goes. So interesting. [58:32] Who do you admire most in the world of AI? [58:36] I think most people... [58:38] uh, [58:39] when getting this question, or some people might... [58:42] kind of [58:44] put someone [58:46] Yeah, I mean... [58:47] Maybe I'll take that back and say like, [58:51] I want to emphasize people that I admire. [58:54] Um... [58:56] based on having worked with them and kind of see how they operate because, [58:59] Um, [59:00] through my last number of years in AI, [59:05] I think, yeah, there are a handful of people like this who've inspired me [59:08] um [59:09] And [59:11] One of them is... [59:13] uh [59:15] So Peter Abiel is certainly one of them. [59:20] He is... [59:22] I think I've never... [59:24] seen anyone, I think, [59:27] must operate as efficiently as Peter. [59:29] That was something that, and to date. [59:31] Like, since meeting him and to date... [59:34] Like, uh, [59:35] I think there's a sort of... [59:37] You think a lot about research as a creative pursuit oftentimes.

59:41-1:01:12

[59:41] And I think [59:43] What I learned from Peter is just... [59:46] sort of operational competence and efficiency around this. [59:49] Um, [59:50] He's very creative as well, and his lab does a lot of creative work, but I think [59:54] It's brought... [59:56] Like. [59:56] These things are very hard and they need to be pushed hard. [1:00:00] with with great focus. [1:00:02] And [1:00:03] he... [1:00:05] He ran his lab. It's the tightest trip that I've ever been on. [1:00:09] And... [1:00:10] really helped focus like [1:00:14] all the projects. [1:00:16] And so... [1:00:17] Yeah, I think I look up to him a lot, both in terms of his... [1:00:20] Um... [1:00:21] The work that he's done, obviously, it's... [1:00:24] It's remarkably... [1:00:26] cross field. Like he's both [1:00:29] done [1:00:29] Well, incredible break to work in reinforcement learning. [1:00:32] and on supervised learning generative modeling. [1:00:35] and [1:00:36] And a lot of it has been, I think, from kind of recognizing and enabling, like, talent. [1:00:42] there is very much [1:00:44] It was... [1:00:45] Like... [1:00:45] the group was [1:00:47] It was a bunch of independent thinkers, students, PhDs. [1:00:50] And people are kind of [1:00:54] pursuing... [1:00:55] like what was interesting to them, but the way I saw it, like Peter is a sort of like great amplifier. Yeah. Like he kind of helped people. [1:01:02] amplify and focus on the thing that really mattered within... [1:01:06] within their pursuit. [1:01:08] I think in a... [1:01:09] A few other people that come to mind.

1:01:13-1:02:49

[1:01:13] So my manager at DeepMind, his name is Vlad Muni. [1:01:17] He's um... [1:01:18] Yeah, I think... [1:01:21] Also, like a very... [1:01:23] an incredible, very creative scientist. He was... [1:01:26] first author of the deep QN paper. And then there's actually, there are actually two papers at the time. There's a, [1:01:30] like this A2C, A3C papers. These are [1:01:33] These are basically the two algorithms that define reinforcement learning. [1:01:36] and he [1:01:38] of deep reinforcement learning. [1:01:40] And he kind of really, um, [1:01:42] Pioneered both. [1:01:43] Yeah, I think [1:01:44] His... [1:01:45] strength is like he was [1:01:49] extremely kind and people-oriented, very humble despite his accomplishments. [1:01:55] Giannis. [1:01:56] as well. I mean, definitely. I mean, Giannis has the Michael Jordan effect. I think he really... [1:02:02] Yeah, he just... [1:02:03] You just wanted to be the best you could be when you worked with him. [1:02:07] And... [1:02:08] Our LHF team was quite small. [1:02:11] and people pushed really hard. [1:02:13] in order to, like, [1:02:14] Largely, I think, inspired by him. [1:02:17] Yeah, these are some people I really look up to. [1:02:20] Thank you for sharing that. [1:02:22] Um, [1:02:23] It's so interesting to hear you say about everyone and one comment on Peter Beal. [1:02:29] I tell him all the time that he's also just created a mafia of founders. [1:02:33] in the last couple of years, and it's probably because he's taught them how to [1:02:36] do many things, [1:02:38] And there's a self-selection that's naturally happening. The creative thinkers and the independent thinkers who come into his lab. But he's also taught them a lot about how to run a tight ship and how to focus incredibly well. So I'm sure that doesn't come...

1:02:49-1:04:20

[1:02:49] uh, [1:02:50] without intention on his part. Maybe last question. Any advice you have for founders building an AI? You are... [1:02:58] You're just starting your journey right now, and I'm sure you've asked others for a lot of advice. What advice would you pass on to the next generation? [1:03:06] I think, one, I think I'll be in a better position to answer that question in a few years in a way. [1:03:11] um [1:03:12] that is much more meaningful. [1:03:14] But I'll actually provide a piece of advice that I lived through through my previous startup. [1:03:20] Um, [1:03:21] which has nothing to do with AI, [1:03:23] to just work on things that are [1:03:25] Like. [1:03:26] internally like really matter to you [1:03:28] in a way that is... [1:03:30] almost independent from what's happening around you. Like, in a way that... [1:03:35] when things go bad, like it's still interesting to you, like it's kind of a... [1:03:39] There's just some fundamental drive around this problem that independent of everything else that's happening, [1:03:44] is just really interesting to you. Um, [1:03:47] Maybe I say that about AIs because there's this... [1:03:52] It's such a... [1:03:53] interesting, highly capable, cool technology. [1:03:56] And so... [1:03:57] there's this sort of, I think, appeal of taking it and just kind of like, [1:04:01] Well, let's just see what... [1:04:02] what we can do [1:04:03] Um... [1:04:05] I think you inevitably find yourself in a hard place. [1:04:08] without having a very strong internal compass of... [1:04:11] independently of AI. [1:04:12] Like what it is that's important to you and what you want to do. [1:04:15] And so, [1:04:17] having been in that position previously, [1:04:19] Uh...

1:04:20-1:05:52

[1:04:20] That's [1:04:21] kind of what I would have done differently and what I would advise people to do. [1:04:26] I really love that. The line that I like to think about is, play in your own stadium. [1:04:30] And don't get distracted by the glitz and glamour of someone else's stadium. [1:04:34] You need that. [1:04:35] internal drive and [1:04:37] grit and obsession with the problem to get you through all the tough times. [1:04:41] Yeah. [1:04:41] And I think there are things that come with it, like [1:04:44] If you really care about some problem, [1:04:47] Like, [1:04:48] you will care about the customers who you're solving the problem for. Like, [1:04:53] Having customers that you don't care about is like a terrible place to be. [1:04:57] I think, yeah, it has to come [1:05:00] And it's not like, I think it's kind of actually hard to control who you care and who you don't care about. That's like a personal thing. Yeah. So you can't, it's actually really hard to force yourself to... [1:05:09] Like, [1:05:09] care about something out of necessity if it's kind of not aligned with [1:05:13] something [1:05:15] inside you already. [1:05:16] So no more shopping and retailers for Misha. [1:05:19] Yeah, so I was building software for kind of inventory prediction for retailers and [1:05:25] For some people, [1:05:27] you know they'd really care about that problem like there's a reason they would they've kind of seen that problem they've um you know maybe if you're a merchant at like a [1:05:35] you know, one of these retail companies, you've really [1:05:38] "'felt it viscerally. [1:05:39] um [1:05:41] And in my case, you know, I hadn't. It was like we were... [1:05:44] trying to make a, you know, [1:05:47] Just trying to build a revenue-generating business almost kind of independent of like an internal...

1:05:52-1:07:03

[1:05:52] Um... [1:05:53] Compass. [1:05:55] Misha, thank you so much for joining us today. You are working on the most ambitious problem of our time. I love your framing of the root node problem of our time. [1:06:05] And I think that is today agents. [1:06:08] And it's very clear that both your and Giannis' experiences [1:06:12] make you the very best team at what you do. Giannis, obviously from a [1:06:17] our LHF perspective and yours from a [1:06:21] reward model training perspective. [1:06:22] and the insights and experiences that you've both had [1:06:26] working on AlphaGo, AlphaZero, and Gemini. [1:06:28] We're so excited for the future of reflection. [1:06:31] Yeah, thank you for having me. [1:06:33] *music*

Want to learn more?