#1: Thomas Larsen on AI Measurement and Evaluation

Capability evaluations, safety evaluations, preparedness frameworks, and why they're important

and

May 22, 2024

Thomas Larsen, former Director of Strategy at the Center for AI Policy, joined the podcast to discuss capability evaluations, safety evaluations, preparedness frameworks, and why they’re important.

Available on YouTube, Apple Podcasts, Spotify, or any other podcast platform.

Our music is by Micah Rubin (Producer) and John Lisi (Composer).

Highlights

Existing Capabilities

Thomas: I think what impresses me most [about language models] is their ability to do in-context learning. They're not that good at this, but they can see a few examples of something that they weren't trained on, and then figure out—within their lifetime without needing to go through more gradient descent—what they should be doing in order to achieve that task. And that seems like a really worrying preliminary capability that sees directly on the path towards generalized intelligence.

Capability and Safety Evaluations

Thomas: Measuring the capabilities, measuring the progress from where we are now to where we might get a few years from now or a few decades from now is extremely critical.

And there are two types of evaluations, which I think you also mentioned. There's capabilities evaluations, which is “How capable are the systems? What can the system do if it was trying to do that? And then there's alignment evaluations or safety evaluations, which are “What is the AI system in fact going to do?” […]

I think to do effective governance, we should do both of these things. We should be measuring the capabilities and checking when can it start doing these dangerous things. And then we also need the safety evaluations, which are “For a given dangerous thing that we're worried about our AI doing, are our safety techniques adequate for mitigating this?”

Possible Future AI Contributions to AI Research

Thomas: So right now, you have about 1,000 people working for OpenAI. I think a little less than that, but, you know, an order of magnitude of 1,000 people working at OpenAI. Suppose they train GPT-5, and now GPT-5 is as good at doing research generally as an OpenAI engineer. That means that they can very rapidly scale up to, let's say they can run 100,000 instances of GPT-5. That means that they've just scaled up by two orders of magnitude. They had a 100x increase in how many researchers they can have working on the problem. And now this could produce GPT-6 in much less time, right? It could figure out the algorithmic improvements in much less time than if it was just humans. And then after that training run is completed, they could immediately upgrade all of those AI research assistants. […]

To get the type of speedups that I'm really concerned about, like a 100x speedup of OpenAI, you really need pretty autonomous AIs that aren't just marginally speeding up how fast a single coder can code their project—they're going off and coming up with and executing on and iterating within entirely new projects that can operate without much human oversight and without humans that are heavily in the loop. So I think there's a lot of things you can do to measure the capabilities from here to there, right?

Building an Emergency Respons

Thomas: I think we should be requiring [AI evaluations]. And I think we should also be creating a body that is looking and taking in the results of these capabilities evaluations, and then making decisions about whether AI labs can continue deploying or training models based on the results of these capabilities. And I think that this basically needs to happen before we get AI systems that are extremely capable.

Jakub: Okay, so a government office that approves or denies permits or some sort of consent to continue with deploying the model or maybe building it.

Thomas: That's right. Yeah. Of course, I think this is a pretty big ask, right? I think it's difficult to do this. And so a more minimal ask I that I quite like is just developing these model evaluations, checking how capable the models are and reporting them to the government. And then having the government build up emergency response capacity in case AI systems get out of control in a certain way. And what this emergency response office would be doing is they'd be processing the results of these evaluations. They'd be building new evaluations and receiving the results from the voluntary ones that labs do as well. And they would be planning and figuring out, you know, if next year, let's say we get an AI system that has this certain dangerous capability or this general capability, here's how we're going to respond. And also, you know, taking steps to brief the rest of government on that, because right now we've got all these government agencies that are, I think, rapidly trying to orient to AI, but no real centralized body that's trying to figure this out and coordinate an organized response.

Relevant Links

Safety Cases: How to Justify the Safety of Advanced AI Systems (Joshua Clymer, Nick Gabrieli, David Krueger, Thomas Larsen)
Model evaluation for extreme risks (Google DeepMind)
Preparedness Framework (OpenAI)
Responsible Scaling Policy (Anthropic)
We need a Science of Evals (Apollo Research)

Transcript

This transcript was generated safely by AI with human oversight. It may contain errors.

Thomas Larsen | 00:00.211

Measuring the capabilities, measuring the progress from where we are now to where we might get a few years from now or a few decades from now is extremely critical.

Jakub Kraus | 00:17.516

Welcome to the Center for AI Policy podcast, where we zoom into the strategic landscape of AI and unpack its implications for U.S. policy. I'm your host, Jakub Kraus, and today's guest is Thomas Larsen. Thomas is Director of Strategy and my colleague at the Center for AI Policy, which designs and advocates for legislation to reduce catastrophic risk from AI. In our conversation, we focus on evaluations, which are experimental tests to learn more about AI systems. Thomas has thought a lot about them, and I hope you enjoy. Thomas, thanks for coming on the show.

Thomas Larsen | 00:56.040

Thanks for having me, Jakub.

Jakub Kraus | 00:58.680

Yeah, I'm really excited to chat with you today about how we can measure dangerous AI capabilities and respond to them. And to kick things off, why... Did you become interested in this area? What led you to start working on it or find it important?

Thomas Larsen | 01:19.596

Yeah, well, so in high school, I was thinking about what career I should pursue, and I was reading all these books. And there was one really vivid moment I had where I was sitting in the... Yeah, just outside of a tent, I was camping in the woods and I was reading this book, Superintelligence, and I realized that AI was likely to be the defining technology of the 21st century and dramatically change the landscape of... Everything about modern society, I think, is not an exaggeration. And from that moment on, I was pretty laser-focused on wanting to make sure that the adaptation of AI technology generally goes well. And that we avoid the risks, and as we all want, avoid the risks and achieve all of the benefits. This was really before scaling. was a widely known thing. This is before language models. I think this is before transformers were invented. So, you know, people were using RNNs and LSTMs to do sequence modeling. And those were just dramatically less effective. And... performance, like state-of-the-art models, you know, we could do chess, we could do Go, we could do various specific narrow tasks where we had really good training data, but we really didn't see anything like the generalization capability that we see right now. So right now, language models in particular, they can do, I think what impresses me most is their ability to do in-context learning. They're not that good at this, but they can see a few examples of something that they weren't trained on, and then figure out, you know, within their lifetime without, you know, needing to go through more gradient descent, what they should be doing in order to achieve that task. And that seems like a really worrying preliminary capability that sees directly on the path towards generalized intelligence.

Jakub Kraus | 03:24.543

Yeah, just for listeners, there's these terms like few-shot learning or in-context learning, and usually the simplified way to think about it is you, in the prompt to ChatGPT or earlier versions of it, say an example of the task you wanted to do or a description. And then even if it wasn't ever trained on making special themed fantasy names for different... varieties of bananas, it can actually pick that up and do so better if you show that in the prompt. And what's especially interesting about it is that this wasn't really a thing in smaller models, from my understanding. But let's quickly go to... Okay, AI is growing pretty general, pretty capable. Maybe autonomy is something you'll talk about. But why do you think this is, you said this was dangerous. What could go wrong here? And how likely is that? What does it actually look like?

Thomas Larsen | 04:28.502

Well, so I mean, that's a really complicated question. Lots and lots of things could go wrong. I think that the central driver of the things that I am most worried about is what I said, general intelligence and AIs that can autonomously achieve real world tasks with a higher degree of competence than humans or equal to humans or higher. And I think it could even become much, much higher. And I think the right reference class for this is like the introduction of a second species onto the planet. I think Geoffrey Hinton used that metaphor and I really like it. It's like we're trying to create a second species that's smarter than us and put that on the planet. And if you're a society of monkeys and you're just figuring out how to build a human or like many, many humans and make many instances of that and are going to deploy them throughout your economy to do everything that they're useful to do, this is going to change how almost everything works, right? and it's also going to introduce loads and loads of risks in ways that are really hard to predict concretely because we can't predict exactly which technologies AIs will invent. We can't predict exactly how AI systems will go about transforming our economy, but I think it's pretty robustly supported that we will get very fast economic growth, we'll get increased centralization of power. And this is mostly because feedback loops are self-reinforcing. And once AI companies develop leads, those leads will allow them to make more money and then invest more into the next round of AI systems, use those AI systems to do better research, etc. And I think that this leads to... a few AI systems with much more power and control over the entire economy of the world than, you know, humans have. And if those AI systems goals are misaligned with humans, I think this ends really badly for humans. Ideally, my hope is that we can build AI systems that are aligned that, like, in fact, just want to do things that are good for humans and want to obey our commands. and as well as create institutions throughout society that can distribute the gains, the gains from AI productivity equitably throughout society so that everyone benefits and not just, you know, one country becomes the global superpower that dominates over everyone else. I'd hope that we can get global coordination here. Yeah, so that was a lot. And I think there's a lot to discuss in there. But that's sort of my overall picture of I think. where I see things going and what I think the bad route looks like and what I think the good route looks like.

Jakub Kraus | 07:25.888

Yeah. Yeah, I guess we're working on AI governance is sort of in the position of the monkeys wondering if they should govern the new humans coming out.

Thomas Larsen | 07:39.208

Yeah, yeah, exactly. And I mean, I very much think we should. And there's a whole bunch of detailed questions on, well, how exactly should that work, right?

Jakub Kraus | 07:46.395

Yeah. Yeah. And measuring what the systems can do, how good they are at it, seems important, but maybe there's even more sophisticated evaluations we can do. Can we? Do you see a way to measure alignment? Measure misalignment? How can we check if the system's doing what we intended it to do?

Thomas Larsen | 08:11.101

Yeah. So... So you alluded to a couple things there, which I think are extremely key to doing effective AI governance. So obviously right now, current systems are somewhat capable, but they don't have nearly the type of capabilities that could cause this type of explosive economic growth that I think will happen. So, as you said, measuring the capabilities, measuring the progress from where we are now to where we might get a few years from now or a few decades from now is extremely critical. And there are two types of evaluations, which I think you also mentioned. There's capabilities evaluations, which is “How capable are the systems? What can the system do if it was trying to do that?” And then there's alignment evaluations or safety evaluations, which are “What is the AI system in fact going to do?” Where, you know, you might imagine you have an AI system that is capable of, let's say, a classic example that people can throw around is create a bioweapon and deploy it widely, right? Maybe your AI system has that capability. If it's aligned, right, if it has, or if it has other adequate safety measures in place, then... it won't in fact do that thing, which is good. Which means we need to, you know, I think to do effective governance, we should do both of these things. We should be measuring the capabilities and checking when can it start doing these dangerous things. And then we also need the safety evaluations, which are “For a given dangerous thing that we're worried about our AI doing, are our safety techniques adequate for mitigating this?” And I think most work so far has been done in the capabilities evaluations part. And I think those are a lot more straightforward to do, right? Just seeing, you know, can my AI system in fact do this task X? But the safety evaluations are more complicated. I'm not seeing much work and I think are also really, really important. And so I'm quite interested in thinking through what safety evaluations should we build and how should we do that. And yeah, so that's sort of the overall picture. Actual safety evaluations that I'm interested in are a couple. So one baseline is just cybersecurity around the model weights. I think that's a very simple one. If you have control over the model weights, and where maybe I'll define control by they're running on your servers in ways that you know about.

Jakub Kraus | 10:52.367

And model weights are sort of the file that you can run the AI system from, but if you interpret it just as text, it will look like a bunch of numbers.

Thomas Larsen | 11:02.027

Yeah, that's right. So security on that, right, is obviously a critical component to a lot of safety stories. And so we can evaluate, right, we can do safety evaluations to check. How robust is our security measures that are keeping our weights from being exfiltrated? And there are two channels we need to work back.

Jakub Kraus | 11:25.238

Can you define exfiltrate?

Thomas Larsen | 11:27.198

Yeah, so when I say exfiltrate, I mean the weights being removed from our safe environment on the lab servers. I'm talking about them going from the servers that the AI lab is running to external servers, either that the AI itself has bought in RAM, or maybe a foreign government has, or some external group of hackers has, just some external hardware that is not under the oversight of whoever's building the AI system. Okay.

Jakub Kraus | 11:58.865

And I'm kind of curious what you're seeing is how that would happen, but maybe that's going to come up.

Thomas Larsen | 12:05.386

Yeah, so... Well, so there are lots of ways this could happen, right? I mean, it could be any of these external groups could try to do weight exfiltration by doing normal, like, a thing of the labs. So trying to, you know, remotely access their servers, and then just send the weight file off. It could also happen with the AI itself, right? The AI may have some output channels to the external world, and it could use those output channels to smuggle copies of itself outside the lab. And there's a particular worry if massive numbers of this AI are allowed to run fairly autonomously without much oversight.

Jakub Kraus | 12:43.834

Yeah, early on people would say, well, we'll never run AIs on the internet to do things autonomously. We'll have them locked up in a box. But I guess that's not really happening for some systems. At least ChatGPT has access to the internet now, and the autonomy part is not super there, because it might fumble if you give it a sufficiently sophisticated task, but that seems like something people will make a fair bit of money on and try to actually do.

Thomas Larsen | 13:18.142

Yeah, I mean, so right now the AI systems are pretty... they're just not very smart, and that's why it doesn't really matter that we've deployed them widely, given them internet access, right? Their systems just aren't very capable. They're not even that useful, right, economically. And so, you know, they definitely just don't pose much risk. This is not a reliable security assumption, though, right? Because it will stop being true, right? It will almost certainly stop being true that these systems are just passively safe due to their lack of capabilities. And so more stringent controls need to be put into place.

Jakub Kraus | 13:59.532

Yeah. Yeah. And so let's just, can you emphasize what is safety evaluation? Sounds nice. What exactly is it? You've been touching on it a lot. And then after that, maybe we can go into some sub questions from there, but can you summarize for the listener? What are these safety evaluations? Why are they important?

Thomas Larsen | 14:23.232

So a safety evaluation is some evaluation that checks whether an AI has some specific safety property. And those safety properties might be security properties around the model weights, or they might be alignment properties that refer to things that the AI system wants, or what actions or plans the AI will choose that refer to the goal that the AI system has. Or they could refer to any other factor about the AI system that relates to its safety.

Jakub Kraus | 14:59.891

Yeah. And what's the state of the art of these today?

Thomas Larsen | 15:07.506

Yeah, so very little work has been done on safety evaluations.

Jakub Kraus | 15:11.845

That seems like a mistake. Why would we have done no work on this?

Thomas Larsen | 15:16.785

Well, so it's extremely hard, I think, to do. to do good work on safety evaluations. And I think the big reason for that is the lack of interpretability that we have within these AI systems. So in particular, I said we want to, you know, so a particularly important type of safety evaluation is these alignment evaluations, right? These evaluations that check what's the goal of this AI system? What's it in fact trying to do? Because given the goal of a system, we can... pretty reliably bound which actions or plans it might actually take. Mostly people have been doing preliminary work, what I would call preliminary work, which sets us up in order to do safety value issues later. And what that preliminary work looks like often is stuff like interpretability research, where we try to figure out, you know, AI language models, they are these gigantic bundles of numbers, and we don't understand how those bundles of numbers correspond to, like which action the AI will take in certain environments. And in order to do good safety evaluation, you usually have to argue about the structure of the system, right? So for example, in nuclear, right? In order to argue that your nuclear power plant won't explode, you need to make all sorts of physical arguments based on your understanding of physics, based on where internally the chemicals are located within the plant. you need to argue that the structure, the safety structure that you've set up within the plant will necessarily mean that the reaction will be stable and not go over some bad safety threshold. We can't do that type of internal analysis of neural networks because we simply don't understand how they work, right? We're building these things, but we don't understand them. And that understanding is a key blocker to making good safety arguments. Instead, we have to rely on what's called behavioral analysis. We just put them in an environment and then we see what they do. And this is useful, right? We can put them in all sorts of environments and see what they do in all sorts of environments. But it's very different from understanding how they work because it's a lot weaker, right? There might be a new environment that we hadn't tried it on where the behavior is just very different.

Jakub Kraus | 17:34.848

And the overall situation is that we just don't have many safety evaluations, yeah? Yeah. And do we have capability evaluations?

Thomas Larsen | 17:45.928

Yeah, so we're starting to have capabilities evaluations. So some initial ones are a bunch of the labs, a bunch of the big AI labs like Anthropic and OpenAI have voluntarily committed to... to do some safety evaluations, or to do some capabilities evaluations, sorry. And those are stuff like checking for bio capabilities, checking for autonomous replication capabilities, right? So can your AI set up a new instance of itself on, let's say, AWS, which is like a remote cloud service? Amazon. Yeah, Amazon. And create... yeah, create new copies of itself. And then, you know, obviously that would be, could be an increasing process, right? Where the new copies themselves spin up new copies. And I think, yeah, so there's been a number of, a number of papers on this stuff. You know, another one let's say is situational awareness. So I think Owain Evans and his collaborators did some work on figuring out, you know, can the model distinguish between when it's in training, in testing, and in deployment?

Jakub Kraus | 18:59.384

And what about, maybe there's more capability evaluations you want to talk about, but specifically, can we test systems for their, right now, increasingly, AI systems are being used to, say, generate training data for another AI system, maybe a smaller one. So in that way, maybe contributing to AI research. There's also NVIDIA's top GPU or AI chip. The H100 had maybe 13,000 circuits in it that were designed by, I think, an AI system, maybe a reinforcement learning system. So AI is being used to help with AI research and development. How can we measure that trend? And what's the picture you see for AI contributing to research and development in AI?

Thomas Larsen | 19:53.907

Yeah, so first I want to zoom out for a second and I want to say, you know, why do we even care about AI doing AI research. And the main reason is roughly that this could become really, really fast. And in particular, it could scale beyond human-level intelligence. So imagine that... So right now, you have about 1,000 people working for OpenAI. I think a little less than that, but, you know, an order of magnitude of 1,000 people working at OpenAI. Suppose they train GPT-5, and now GPT-5 is as good at doing research generally as an OpenAI engineer. That means that they can very rapidly scale up to, let's say they can run 100,000 instances of GPT-5. That means that they've just scaled up by two orders of magnitude. They had a 100x increase in how many researchers they can have working on the problem. And now this could produce GPT-6 in much less time, right? It could figure out the algorithmic improvements in much less time than if it was just humans. And then after that training run is completed, they could immediately upgrade all of those AI research assistants. And this process could continue until they're much, much smarter and the AI systems are producing the vast majority of the cognitive labor that's going into making AI systems better. So how do we measure this? As you said, current AI systems are being used to start automating charts of research. However, the AI systems that are already being used are mostly pretty narrow, right? There are specific RL algorithms that can be used to optimize chip designs, you know, GitHub Copilot can marginally speed up individual coders, but it can't, you know, code autonomously, right? To get the type of speedups that I'm really concerned about, like a 100x speedup of OpenAI, you really need pretty autonomous AIs that aren't just marginally speeding up how fast a single coder can code their project—they're going off and coming up with and executing on and iterating within entirely new projects that can operate without much human oversight and without humans that are heavily in the loop. So I think there's a lot of things you can do to measure the capabilities from here to there, right? So one thing you can do is you can just measure how autonomously capable is the AI system. So right now, AI systems aren't very autonomously capable. And one of the reasons we, one of the ways we know that is because they can't execute on long horizon tasks. So one could just come up with a capabilities dataset that just looks like, here's a bunch of tasks that take a large number of steps to complete. For example, write some program that we think takes at least a thousand lines to complete. That's a pretty long task. It usually takes a human probably a couple days at least. Right? I... Once an AI starts doing that, I think I will be, you know, be a lot more concerned than I am right now, because AI systems seemingly can't do that at all, right? They probably can't even write, you know, 100 or a few hundred files of Python code, because they would just get confused, right? They can't really keep track of long things that don't really form great abstractions. There's a number of things missing from the reasoning, we could just create a data set. So that the long long coding task is one example, you could create all sorts of long horizon tasks, right tasks that take a large number of steps to complete. And then you could create a continuum, right, where you start with, you know, short horizon tasks and increase up to longer horizon tasks. And then you can measure growth. And then you can create trends that you can extrapolate and see, you know, how progress is going, whether it's continuous, whether it's predictable, and in that way, get much better knowledge than we currently have about, you know, when we might get AI systems that can perform those things that I'm concerned about.

Jakub Kraus | 24:41.623

Okay, so the vision is we, for example, have a task that simply cannot be done without a pretty sizable chunk of code being written. And maybe there are things like this where, I don't know, maybe set up a profitable business is a little too far. But do you have other concrete examples?

Thomas Larsen | 25:10.929

So another example I quite like is generating long form media. So for example, books and movies often rely on callbacks to the beginning of the book and foreshadowing at the beginning of the book or the movie to the end, right? It's not a very good book if each chapter is basically generated in isolation and then strung together.

Jakub Kraus | 25:37.948

Yeah, we need some foreshadowing.

Thomas Larsen | 25:39.968

Yeah, exactly. You need foreshadowing, but you also need, you know, character development to happen over time. You need to, like, sort of have some sort of plan, right? You don't want your book to end, like... Or you don't want your TV show to end like Lost. I didn't see it, but apparently it was very bad.

Jakub Kraus | 25:55.148

Ooh.

Thomas Larsen | 25:56.297

Right? You want to actually have a plan for what you're going to do with the characters, or else it ends in pretty catastrophic failure.

Jakub Kraus | 26:04.610

Like Lost?

Thomas Larsen | 26:05.992

Like Lost, exactly. So you want your AI... So, you know, we can test, you know, we can... It's hard to objectively evaluate how good a book is, but we can have some decent measurements of how good a book or a movie or some long-form piece of media is. And we can have our AI systems try to, you know, generate those as best as possible. And once they start doing, you know, books and movies that are as good as humans, or better than humans in some ways, especially in this long form aspect, where it's not just that they're writing individual paragraphs or chapters, it's they're writing, you know, entire books or entire movies or entire TV shows that are even longer than movies. This suggests that they're really tracking, you know, they're doing long term planning, they're doing they're like, you know, updating over time, they're figuring things out. Yeah, I think that's another good indicator.

Jakub Kraus | 27:02.108

Yeah, I would love to see someone test it out. So I want to go into you talking about the labs, the companies like Anthropic and OpenAI, who put out some ways to respond to dangerous capability evaluations or safety evaluations, if they have them by then. But what about for quickly for policy, if you could wave a magic wand, is there a particular policy you want Congress to pass relating to evaluations? Should we be requiring them? Something else? Funding them?

Thomas Larsen | 27:40.163

So I think we should be doing both of those things. And more, I think we should be requiring them. And I think we should also be creating a body that is looking and taking in the results of these capabilities evaluations, and then making decisions about whether AI labs can continue deploying or training models based on the results of these capabilities. And I think that this basically needs to happen before we get AI systems that are extremely capable.

Jakub Kraus | 28:15.501

Okay, so a government office that approves or denies permits or some sort of consent to continue with deploying the model or maybe building it.

Thomas Larsen | 28:26.601

That's right. Yeah. Of course, I think this is a pretty big ask, right? I think it's difficult to do this. And so a more minimal ask I that I quite like is just developing these model evaluations, checking how capable the models are and reporting them to the government. And then having the government build up emergency response capacity in case AI systems get out of control in a certain way. And what this emergency response office would be doing is they'd be processing the results of these evaluations. They'd be building new evaluations and receiving the results from the voluntary ones that labs do as well. And they would be planning and figuring out, you know, if next year, let's say we get an AI system that has this certain dangerous capability or this general capability, here's how we're going to respond. And also, you know, taking steps to brief the rest of government on that, because right now we've got all these government agencies that are, I think, rapidly trying to orient to AI, but no real centralized body that's trying to figure this out and coordinate an organized response.

Jakub Kraus | 29:42.498

Okay, so sort of someone who's going to make crisis management plans so that we can execute them instead of rushing to figure out what to do if we have time. And now if we fail to get things done, we've got these labs who are trying to be proactive. OpenAI recently put out their preparedness framework, preparing for catastrophic risks, and it's resembling another document from their competitor, Anthropic, who has a responsible scaling policy. Both of them sort of outline when the AI gets this capable, we're going to do this to make it safe. And what do you think of these documents? Let's start with Anthropic.

Thomas Larsen | 30:24.769

Yeah, so Anthropic released this thing called AI safety levels in their responsible scaling policy. And overall, I generally like what they've done so far. The big problem I have with it is that the AI safety levels that they've defined only go up to a certain level, right? They only go up to ASL 3, which corresponds to models that can amplify misuse risk, right? So they can help individual humans, you know, let's say, you know, the classic example is build a bioweapon or improve their hacking or whatever. But they don't get into autonomous AI capabilities. And they say, you know, we're going to leave that to ASL 4 and up. And I think since what I'm most worried about is autonomous AI capabilities doing dangerous tasks, mostly without humans in the loop, I really care that this isn't, you know, isn't put in place. And of course, Anthropic... is working on this as we speak and trying to get those commitments out. But as of now, you know, they don't exist. And I think it's really, really important that we get really strong, really strong commitments that, yeah, that would do things like, you know, have the labs commit to stop scaling once we get... sufficiently capable models, right? Like, let's say we get models that are sufficiently capable that they can self-exfiltrate from the labs, given our security measures. I think that should be a condition upon which the labs stop building more powerful models.

Jakub Kraus | 32:05.524

Yep. And OpenAI, I think, has something like this where they say we will stop development under certain conditions. Is that enough?

Thomas Larsen | 32:16.351

Yeah, so... I think if their conditions are sufficiently expansive, yes, that's enough. Right? I think that the OpenAI Preparedness Framework, as of right now, is too vague to evaluate whether this is sufficient. And it relies a lot on sort of corporate governance structures where they've got, you know, they've got their board, they've got their leadership team, they've got their safety team, they've got the preparedness team. And it relies a lot on each of these actors sort of making the right decisions as we get AI systems that are more capable.

Jakub Kraus | 32:50.619

So distilling it down, what is the, how, if you were Sam Altman, which I don't know, new board. What would you do if you were in charge of OpenAI? Would you modify, in terms of their preparedness framework, what would be your main modification?

Thomas Larsen | 33:09.012

Yeah, so my main modification would be, or so as an initial document, I think this is fine. My main modification would be to work to create more concrete commitments that look more like the thing I was talking about, which is... you have specific capabilities evaluations, and then you've got specific commitments that look like once this capability fires, we will pause until we have at least this good safety put into place. And here's how we're going to evaluate whether our safety is good.

Jakub Kraus | 33:42.168

More specific? Is that basically-

Thomas Larsen | 33:45.024

So more specific would be great, yes. Okay. Another great, I think, change might be increased ability to do external feedback. So suppose OpenAI is able to build an AI system that's just much more competent than all of the rest of the world's AI systems and much more competent than humans. And so they have, you know, a strong lead over the rest of the world and can, you know. has like tremendous, let's say, ability to, you know, transform the economy, whatnot, right? I think that in that case, and let's say their safety concerns, right? Let's say their safety team isn't sure whether to go ahead or not. I think in that case, they should have commitments to have external reviewers, right? Not just the internal lab safety team. They should have an external board of AI safety experts that go through their safety solution and evaluate, you know, is this actually safe or not? And I think this, I think having that external board creates a good counter pressure to all of the internal forces and internal profit motives that, you know, force... or at least strongly push OpenAI to continue scaling and continue making their systems more capable and therefore more dangerous.

Jakub Kraus | 35:10.104

Yeah, yeah, basically there's a giant conflict of interest. It's you're evaluating your own product for safety.

Thomas Larsen | 35:18.410

That's right, yeah.

Jakub Kraus | 35:19.638

So running up on time, but the one last thing I wanted to mention is just how these OpenAI and Anthropic are, they're kind of the only ones, right? Are there other AI companies with 100 million plus dollar models that have preparedness frameworks?

Thomas Larsen | 35:39.048

Yeah, so a lot of the big companies, so there were voluntary commitments secured by, I think, both the Biden administration and the UK AI Safety Summit, I think. They've got safety commitments from Meta, Amazon, Google DeepMind, and I think likely a couple others that I'm missing, as well as OpenAI and Anthropic. But I think the other frameworks were less specific and even weaker than the OpenAI and Anthropic ones. And so don't count for much in my book.

Jakub Kraus | 36:10.721

Okay. Well, thanks so much, Thomas, for telling us about AI capabilities, safety evaluations, capabilities evaluations, preparedness frameworks. I really enjoyed it. And thanks so much for coming on again.

Thomas Larsen | 36:27.058

Yeah, thanks, Jakub. Thanks for having me.

Jakub Kraus | 36:31.250

Thanks for listening to the show. You can check out the Center for AI Policy Substack for a transcript, links and more. And if you have feedback, I'd love to hear from you. You can reach me at jakub at AI policy dot US. And looking ahead, next episode will feature Mark Beall discussing the risks AI poses to US national security. I hope to see you there.

A guest post by

Jakub Kraus

Tarbell Fellow writing about AI

Center for AI Policy Podcast

Discussion about this post