#10: Stephen Casper on Technical and Sociotechnical AI Safety Research

Interpretability, robustness, audits, RLHF, incentives and more

and

Aug 02, 2024

Stephen Casper, a computer science PhD student at MIT, joined the podcast to discuss AI interpretability, red-teaming and robustness, evaluations and audits, reinforcement learning from human feedback (RLHF), Goodhart’s law, and more.

Available on YouTube, Apple Podcasts, Spotify, or any other podcast platform.

Our music is by Micah Rubin (Producer) and John Lisi (Composer).

Relevant Links

Cas’s website and Twitter / X
Eight Strategies for Tackling the Hard Part of the Alignment Problem (Stephen Casper)
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks (Tilman Räuker et al.)
The Engineer’s Interpretability Sequence (Stephen Casper)
A.I.’s Black Boxes Just Got a Little Less Mysterious (The New York Times)
Explore, Establish, Exploit: Red Teaming Language Models from Scratch (Stephen Casper et al.)
Robust Feature-Level Adversaries are Interpretability Tools (Stephen Casper et al.)
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (Stephen Casper et al.)
Black-Box Access is Insufficient for Rigorous AI Audits (Stephen Casper et al.)
Goodhart’s law (Wikipedia)

Transcript

This transcript was generated safely by AI with human oversight. It may contain errors.

(Cold Open) Stephen Casper | 00:00.644

Who are the AI auditors? What systems do they audit? Who hires the auditors? Who pays them? What incentives do they have? Who decides which systems get audited? Who decides what problems are fair game for auditors to look for?

Jakub Kraus | 00:22.818

Welcome to the Center for AI Policy podcast, where we zoom into the strategic landscape of AI and unpack its implications for U.S. policy. I'm your host, Jakub Kraus, and today's guest is Stephen Casper, also known as Cas. Cas is a computer science PhD student at MIT working on research in technical and socio-technical AI safety. We discuss topics like interpretability and explainability in AI, red teaming and robustness, evaluations and audits, reinforcement learning from human feedback, Goodhart's law, and more. I hope you enjoy.

Cas, thank you for coming on the show.

Stephen Casper | 01:11.735

Yeah, good to be with you. Thanks.

Jakub Kraus | 01:13.956

So in your master's thesis, the introduction begins with a section titled, Test Sets Are Not Enough. And researchers commonly evaluate AI systems based on their performance on different testing data sets. But you wrote that, quote, a black box performing well on a test set does not imply that the learned solution is adequate.

And you've also written about how problems with AI systems can be divided into two categories. One is failures that developers might encounter during the normal process of building the system. And then. Second is failures that are a lot easier to miss during the default development. And the less observable or unobservable failures are harder to get feedback on. So can you walk through what are some of these less observable failures and how might normal AI development processes miss them?

Stephen Casper | 02:16.171

Yeah, thanks for asking. So I sometimes describe the difference here as being the difference between what can be thought of as the easy part of the AI alignment problem versus the hard part of the AI alignment problem, right? There are going to be certain classes of failures that we can just stumble into, or we can just find, right? And these are machine learning problems to solve. These are things that test sets can detect. These are things that adversarial examples can find. These are things that red teams can cause to surface, right?

And when we discover problems like this, and we can iterate, it allows us to iterate on solutions to them. It allows us to adversarially train against them. And making sure that we're protected from these things is just, you know, a normal machine learning problem, right?

But just contrast that with, you know, what we described as these like unobservable types of problems. These are just things that are kind of harder to find. And they could be harder to find for two reasons, right? One thing that can make a problem hard to find is that it's really hard to come across examples that elicit the problem. Anomalous examples, certain types of adversarial examples, or issues that might involve, you know, deceptive alignment or things like this. These are all conceptually types of problems that just a system could have, but it could be really, really hard for us to find examples that elicit them. And for this reason, they can be really hard to address.

Another type of hard to find or like unobservable problem, though, can emerge not when you can't find examples that elicit it, but when you just don't notice it. Certain types of subtle biases or subtle problems with machine learning systems kind of also fall into this category. Even if you have examples that elicit a type of bias, it might take a very fine-grained analysis or it might take a very comprehensive and socially nuanced understanding of the model to really characterize them or try to figure out how to frame the problem. So that's another problem as well.

So overall, when we look at the difference between, you know, observable failure modes and unobservable failure modes, and when we get more fine grained and we look at the difference between, you know, unfindable unobservable failure modes and just easy to miss unobservable failure modes. I think the lesson that we need to take when we think about AI alignment is that standard machine learning techniques are just not going to be the solution to a lot of the problems that we're very worried about.

And the fact that there are these problems that standard machine learning techniques like adversarial training are not well equipped to address is kind of what motivates lots of research agendas like anomaly detection, latent adversarial training, mechanistic interpretability, white box evaluations of models, right? There's a lot of stuff that people in the AI safety community are really excited about right now. And that excitement largely comes from different methods kind of being geared toward solving these non-standard machine learning problems of hard to find failures.

Jakub Kraus | 05:25.216

And what is your definitions for people who are less familiar with some of these concepts like alignment or deceptive alignment, for example?

Stephen Casper | 05:38.081

Yeah, I think that's a really good question because alignment is just, it means a lot of things in a lot of different contexts, right? When I use alignment, I usually am using it to refer to the goal of getting an AI system to act in accordance with the goals and interests of its user, right? So I explicitly don't usually try to use alignment to refer to like aligning an AI system with society at whole, for example, but others will use that. Others will use that definition. So that's what I kind of mean by alignment. It's the problem of making the AI system do what the user wants it to.

When I talk about um, Oh, sorry, what was this other question? Deceptive alignment?

Jakub Kraus | 06:15.432

Right.

Stephen Casper | 06:16.272

Yeah, when I talk about deceptive alignment, this is another term that'll have subtle differences depending on where you hear it. But when I think about deceptive alignment, I guess I just am kind of thinking of a broad category of failures in which an AI system might be, through some means, kind of tricking its overseers into thinking that it's more safe or more aligned with them than it really is. Deceptive alignment failures are... characterized by the potential for a system to, you know, once it finds out it's in deployment, or, you know, once it goes off to do its job in the real world, you know, suddenly behaving a bit differently than it did in under evaluation settings, right? And that's kind of why deceptive alignment failures are one of these difficult to spot failures. There are, by definition, issues in which an AI system is going to behave differently in development than it will in deployment.

Jakub Kraus | 07:12.276

So one way people try to address the less observable failures is using AI interpretability techniques. This is a research area trying to understand what's happening inside AI models, not just giving them text inputs and seeing what text they output, but studying the actual math happening inside the black box. And you've written a literature review of interpretability research focused on deep neural networks where you defined an interpretability method as, quote, any process by which an AI system's computations can be characterized in human understandable terms. And you've also co-authored several papers advancing new forms of interpretability research and have written extensively about the need to bridge the gap between interpretability research and actual applications of it in AI engineering projects. So how would you describe the overall goals of interpretability research? And how effective do you think the field has been so far at advancing towards those goals?

Stephen Casper | 08:25.999

Yeah, thanks for the question. So I don't even usually think of myself as an interpretability researcher. But like you mentioned, I keep just coming back to the topic over and over again over the years. Because it's too interesting, right? I can't stay away apparently. And like you mentioned, I've previously tried to define interpretability as like any process by which an AI system's computations can be characterized to a human. And this has made sense to me and I kind of gravitate toward a very general type of definition.

But sometime after kind of giving that definition, I found myself increasingly dissatisfied with any particular way of thinking about interpretability work. I was kind of confused for a while about what we should be really doing with this whole interpretability thing, which led to a change of opinion of mine about a year and a half ago. Where instead of caring about any particular definition of interpretability or trying to define it in an increasingly broad way, I kind of stopped caring about it and tried to just start thinking about whatever, you know, interpretability techniques in quotation marks as just things that hopefully we want to be useful to engineers who care about like safety and beneficial AI in society. Right.

So I try to make a point lately of abjuring almost any actual definition of interpretability in favor of just the principle that, you know, anything traditionally thought of as interpretability or not, that helps an engineer accomplish their goal is kind of fair game or something that we should, you know, care a lot about or think is important, right? And in that sense, you know, interpreting a model is part of a very similar process to just, you know, running a test set through the model. Both are really fair game when it comes to learning something important about it in order to help you accomplish your goals with it.

So this led to some work of mine from about a year and a half ago that I titled The Engineer's Interpretability Sequence, which was a sequence of kind of posts that I wrote about how interpretability research sometimes doesn't always seem to produce techniques that are very useful to engineers. And what directions forward might be able to make interpretability more useful to engineers.

So what does interpretability need to do to make us safer in the real world? How good has it been at advancing toward this goal? I think, at least recently I've been thinking, it's clarifying to think of interpretability as kind of like having four different tiers. Or there are like four different things you can do with interpretability tools or like standards of evidence for interpretability tools being good. Right. And these standards, I call them the hypothesis standard, the science standard, the engineering standard and the safety standard. And these kind of go in increasing order of like their relevance and their difficulty.

If an interpretability tool kind of meets the hypothesis standard, that means it's good for suggesting to humans that the model might work a certain way, right? And you see lots of kinds of traditional interpretability research from the past, you know, decade and beyond that focuses on this standard. For example, if you visualize a neuron in an image processing network and the neuron looks like a dog and you're like, oh yeah, that's the dog neuron, you know, this is kind of, you haven't accomplished anything yet. It's important to be clear about that. But it is kind of useful to be able to have an interpretability tool that helps give humans ideas or give humans things to kind of look more into.

The next standard is what I call science, where you make a testable prediction and then show that it hopefully validates. So if I look at a dog neuron and then I'm like, OK, I predict that this neuron is going to respond a lot when I put out of distribution dog images through it. And suppose it does. Right. And that's great. Like something more rigorous has been done other than like kind of coming up with a hypothesis. Science has been done. A prediction has been made and validated.

That third standard, the engineering standard, is where you have to show that the interpretability tool is useful for accomplishing some type of real world task that an engineer might be interested in. If we showed that the neuron in the network that seems to be a dog neuron is a neuron that could be manipulated to incisively change the model's behavior when it processes dog images, you know, then we've met this engineering standard, right?

And then there's this fourth and final standard of safety. Where, which emphasizes that just because something's useful to an engineer doesn't mean it makes anyone safer in the real world, right? So showing that something convincingly is going to make us safer in the real world probably needs to be connected to some sort of real world application, or some sort of argument that a technique is much better for defense than offense.

And I think historically, the interpretability field has been really, really good at analyzing methods and holding them to a standard where they help produce useful hypotheses, right? And I think that historically, the interpretability field has also been okay and is increasingly getting better at holding methods to a standard of being useful for basic science and being useful for testing, you know, about making hypotheses and making testable predictions and validating them. But I think kind of the next two frontiers that the interpretability field is going to really need to grapple with are the showing that interpretability tools are useful for engineers and then showing that interpretability tools are uniquely useful for safety. And I think now is a pretty exciting time in interpretability research because we might be kind of on the cusp of some interpretability tools that are competitive and useful in real world applications for engineering tasks with neural networks.

Jakub Kraus | 14:29.146

And what do you mean by competitive? What's an example of a tool that is not competitive? And if you have one interpretability tool that is competitive in the sense you're using that term?

Stephen Casper | 14:45.057

Yeah, by competitive, I mean that it's better than some sort of baseline or better or more appealing in some sort of way, for some use cases than an additional approach that might be simpler. For example, maybe you could use some interpretability techniques to understand pretty well what's going on inside of a network, right? And you could maybe play with the weights or the neurons in order to change the model's behavior. And you could have a really academically interesting model editing technique that uses an interpretability tool, right? But is this competitive with just training the model on different data? Or is this competitive with just fine-tuning the model differently? Or is this competitive with any number of representation editing or model editing approaches that have been used in recent years? That's a different question, right? And it won't be time to declare victory from the engineer's standpoint until interpretability tools are shown to not just be able to perform a task, but able to perform a task better than the methods we would turn to otherwise.

Jakub Kraus | 15:53.082

And one case study is Anthropic's recent interpretability work. So in May, they published a new approach and got coverage in the New York Times for it. The approach is called Sparse Autoencoders. It essentially extracts a small slice of an AI system's internal activity, and transforms that activity into a higher dimensional, bigger mathematical space in a way that incentivizes it to be legible and easier to understand. And with their publication, they used sparse autoencoders to change their Claude AI chatbot so that it would focus intensively on the Golden Gate Bridge. And people thought this was pretty funny. They got some press out of it.

Now you made predictions before the paper came out about what it would and wouldn't accomplish, and you were fairly accurate. And then after it came out, you expressed concern that Anthropic might be overselling its progress. So how big of an advance was Anthropic's paper, and what remains to be done in this line of interpretability research?

Stephen Casper | 17:18.356

Yeah, I think that was a good description of what sparse autoencoders are doing, by the way. So thanks. And my short answer, and I'll give the much longer answer in a second, but my short answer is that I think there's a lot of exciting stuff going on with sparse autoencoders right now. And I'm really excited to see what happens next. But I think it's going to be really, really important with what happens next to focus a lot on, you know, meeting the engineers and the safety focused engineer standards.

So what's the long answer, right? Anthropic has been working on this sparse autoencoder agenda for a while, and they have a lot of kind of soft power in the AI safety research space for kind of agenda setting and steering conversations. And as a result, you know, lots of people have been really excited about this sparse autoencoder work for a while. And it makes a lot of sense, right? There are difficulties when you try to interpret the model's actual embeddings. And there's problems that we suspect it has. And maybe if we just make the embedding space larger, it'll be easier to, you know, pick out specific features or isolate specific things that are related to particular behaviors in the network, right? So sparse autoencoders are kind of appealing from this perspective if they work very well.

The sparse autoencoder paper focused mostly on, you know, producing hypotheses and doing easy experiments to kind of show that testable predictions validate. And that's really, really useful, right? And I think there's a lot of promise that comes from this sparse autoencoder paper. But nothing yet was done to kind of show that sparse autoencoders are a useful technique for accomplishing useful tasks, right? And this is what we need to wait on.

I'm pretty excited about what might happen next, but I'm also a little bit worried about kind of the way that this was presented because Anthropic has a, you know, a publicity team or a PR team that their communications kind of have to go through that, you know, might be a little bit more focused on the commercial side of Anthropic’s success compared to the safety side of Anthropic’s success. And I think maybe that kind of played a role in how this paper was kind of presented. I think some safety washing happened, where even though no useful contributions to safety were made in this paper, it was just kind of a, it's best viewed as like an engineering report about how they scaled up on some stuff.

You know, there was a lot of excitement being drummed up about implications for safety as if the check had been cashed yet, which it hasn't. And I'm not super worried about specific claims. I don't want to be a nitpicker. And I have tried not to be. But I just am worried about a few empirical examples in the past that seem to have resulted directly from the way that Anthropic has publicized their interpretability work.

There have been some viral social media threads and comments made by non-technical people citing Anthropic's press releases that suggest that much more progress on interpretability is being made than actually has been. There was a document circulated within the UK government, a report that seems to have been alluding to Anthropics work when it made an overclaim that a safety, the interpretability problem was, you know, on its way to being solved. And I think stuff like this is a good example of how hype and sometimes overhype, even if it doesn't confuse technical people, can have some pretty adverse effects by confusing non-technical people about how much progress has been really made.

Overall, I think that there has been some safety washing and I hope there isn't a bunch more in the future. So ultimately, my advice to a policymaker here is to be really critical of what you might hear about mechanistic interpretability in the next little while. And to basically completely ignore the field as a whole until or unless people start using it to modify models better than existing techniques like fine tuning can, or to red team models better than existing techniques like adversarial attacks can. But unless we see this, you know, interpretability is just a really big field that people are working on and have a lot of excitement about for a lot of good reasons. But it's still not something that is going to make us safer in the real world, at least yet.

Jakub Kraus | 21:48.578

And let's talk a little bit about adversarial AI, adversarial machine learning. It studies different attacks that deliberately exploit or find or expose vulnerabilities of AI systems. So one classic example is an AI system might see a picture of a panda bear and call it a panda. But then if an attacker makes a pretty minor, carefully chosen adjustment to the panda bear image, the AI system will call it an orangutan. And another example is these trojans, they're called, or backdoors, which are instances of an attacker manipulating an AI model so that it only fails on a very particular kind of triggering input. One example is Anthropic trained AI models that will write unsafe or buggy computer code if the prompt tells the model that the current year is 2024.

And then one other concept here is a red team, which we've talked about a little. So this often can mean a benevolent group that tries to be an attacker to expose safety vulnerabilities and ultimately enhance safety of the system once the errors are fixed. And it can be a team of humans, or it could be a group of automated tools that try to serve this purpose.

So you've written a lot of papers on adversarial attacks and red teaming, and you've found interesting overlaps between this area and interpretable AI research. So what kinds of adversarial machine learning research do you see as most promising for improving AI safety? And why do you see it that way?

Stephen Casper | 23:57.961

Yeah, and I really like this question. And like I mentioned earlier, I don't usually consider myself an interpretability researcher, but I do usually consider myself a researcher who focuses a lot on attacks and robustness. I think it's cool to remind myself every once in a while, just like how weird lots of adversarial attacks really are. You know, there are small perturbations or small modifications, sometimes to inputs, that are sometimes imperceptible, sometimes funny or gibberish looking, you know, just things that you wouldn't expect, things that you don't think should have an effect on the way a network behaves, but they can have like almost arbitrary or just really drastic, really important effects on how they behave.

And obviously, this has a lot of implications for making AI systems function reliably, and specifically making them function more safely. Because, you know, vulnerabilities that come from adversaries and anomalies like this are part of the hard part of the alignment problem like I talked about. They're pretty important things to avoid. So adversarial machine learning research is a really really broad and big set of research topics and fields right and it's probably one of the only things that can rival the field of interpreting AI systems in how rich and well-researched it is.

And as you might expect, you know, there are two, between these two types of research field clusters, there's a lot of overlap. And I've been particularly interested in the past few years on, you know, overlaps and the intersection of adversarial attacks and interpretability, because I think that people with both types of goals can benefit from working with people who research the other.

So right now, you know, there's been over a decade of really interesting research on adversarial attacks and defenses. Same with interpretability too, but where do we stand? What's the most useful? Usually when I ask myself that kind of question, I'm thinking about the foundation models or the biggest large language models, the most intelligent AI systems to date. And there's an arms race going on with attacks and defenses in these models as has gone on in different adversarial sub-literatures in the past, right? Um, teams keep finding new ways to make them fail and developers keep finding new ways to patch old attacks. And from this arms race, we're learning a lot of stuff.

And one of these, the lessons that we're learning is that large language models have a lot of these really weird failures, right? Um, pretty concerning ones too. There are lots of things that these language models are specifically fine-tuned not to do that they can suddenly, you know, do again once they're attacked, right? Behaviors that we tried to get rid of, but they can persist, or capabilities that can persist nonetheless and resurface when the model is attacked. And I think this is a really big, like, ongoing challenge, figuring out what we can do to neural networks in order to actually get rid of bad capabilities, instead of just behaviorally suppressing them when we fine tune.

But I want to stress one kind of point. Because you asked me about what is important with adversarial machine learning research. And I'm actually a bit ambivalent about how valuable making more technical progress on some of this stuff might be. I'm a really big fan of avoiding more failure modes. But I also do have worries about some of the capabilities that could come from, you know, just getting good at arbitrarily making systems robust to arbitrary failures, right? Because this kind of stuff contributes to speeding up AI capabilities progress. And as long as capabilities progress is safe and socially beneficial, you know, I'm a really big fan of this, but I really like to emphasize the point that capabilities progress should only proceed at the pace that our society and governance is able to handle, right? So from that perspective, I'm not always. And it makes me kind of ambivalent about, you know, like more and more technical problems being solved within scaling labs until we have governance structures to kind of handle the results.

The thing that I am more excited about with adversarial research is evaluations. Everyone's been pretty excited about AI governance for a while. And one of the things that everyone seems to agree on is that we should have more evaluations of AI systems and they should be good and meaningful evaluations, right? And there are a lot of open technical and political problems involving doing good AI evaluations. And as someone who cares a lot about this, and as someone who works a lot on its robustness, something I found myself increasingly working on in the past year or so has been the technical side or the tooling side of doing good evaluations of AI systems risks.

Jakub Kraus | 29:00.762

Interesting. And can you talk a little more about what that kind of work looks like. How do you make evaluations better?

Stephen Casper | 29:11.279

Yeah, so I think there's a political side of this problem and there's a technical side of this problem, right? The political one involves answering a bunch of somewhat challenging questions and iterating on policy proposals and policies themselves, right? There are a bunch of different things that need to be answered, like who are the AI auditors? What systems do they audit? Who hires the auditors? Who pays them? What incentives do they have? Who decides which systems get audited? Who decides what problems are fair game for auditors to look for? And who decides whether the system can't be deployed after looking at the result of an audit? There are all these really important questions to answer on the governance side. I think a little bit about all of these. And something I've been interested in lately is how different agencies at the national and international level who regulate different technologies answer each of these questions.

On the technical side, you know, there's this problem of like, how do outsiders meaningfully evaluate AI systems in order to figure out what problems they might have? And there are questions involving like, not just doing good science, but like having the access needed to do good science, right? Because there are different things that you can do to assess an AI system if you have query or black box access to it versus if you have full access or white box access to it. And, you know, there's a whole spectrum of gray in between, right? So figuring out how auditors can do the things they need to do with sufficient forms of access is a problem I think about a lot.

Something else that I think about on the technical side relates to a problem I mentioned earlier. Where the AI systems sometimes can retain these harmful latent capabilities that rarely surface, but when they surface, it can be problematic. And given that AI systems are so frustratingly good at retaining bad capabilities that we try to retrain out of them, there's a really interesting technical problem to be worked on involving like, how do we tell when an AI system has capabilities that are hard to elicit? And I've been thinking a lot about the topic of generalized adversarial attacks as an evaluation technique that we could use to maybe answer this type of question.

A generalized adversarial attack just refers to an adversarial attack, but the adversary gets to be extra sneaky or like gets to do some extra mischief. So in a normal adversarial attack, the goal is to come up with an input that makes the system fail, that elicits some sort of bad behavior from the system. The goal of a generalized adversarial attack is to do that, but where you get to do something else. Maybe you can fine tune the model a little bit. Maybe you can mess with the model's internal states. Or maybe you can look for information hidden in the model's representations, even if the model doesn't output that information, and so on and so on.

I think something nice about generalized adversarial attacks is because they're stronger than regular adversarial attacks, because the adversary has more affordances, this can help us make more conservative estimates about safety, right? This is another way, this is one way of getting at the same problem we've been talking about earlier. The hard part of the alignment problem, figuring out what types of risks a system might have even if they're hard to elicit. And I think generalized adversarial attacks are a really good way to go about this from a technical approach, and something that I hope can be a good solution to put in the toolboxes of real world auditors.

Jakub Kraus | 33:00.650

Interesting. And you were mentioning a bit about this issue of access. So I wanted to dig into that a little bit more because it's really relevant for policy. You wrote this paper where you showed how different degrees of access to the AI model itself or the information about it or some resources related to it can make audits stronger. So the paper was called black box access is insufficient for rigorous AI audits.

Black box access is what everyone gets with ChatGPT, where you can send an input and get an output, but you can't see inside of it or see the parameters that are operating it. And then outside the box access is if you have some context about how the system was made, like the training data, the code that's running it, how it was created, how it was deployed. And then white box access is having especially the model weights of the system and the mathematical parameters. And you mentioned there's shades of gray between all these.

So right now there's no laws that require AI models to be audited. There's some voluntary arrangements. The recent Claude 3.5 Sonnet model was tested before deployment, and the Commerce Secretary Gina Raimondo reported that the U.S. AI Safety Institute will also be testing new advanced models before deployment, but it's not clear how much access they'll have. So what are you seeing as the most important advantages of giving the auditors white box or outside the box or generally increased access to AI models?

Stephen Casper | 35:03.030

Yeah, I think that's a really good question. And I think the answer to this question, or the best answer to this question, really differs depending on who the audience is, right? If the audience is technical people, you know, then all I have to say is that, hey, remember all of these algorithms, they use white box access. Remember all of this useful work that used outside the box access, etc, etc. As we all know, this enables a lot of really useful, powerful tools from the literature to be used to find issues with the system.

And this paper, which I really enjoyed working on, this paper that you mentioned, is hopefully useful for technical people as a reference, but it's not very surprising for most technical people. Because everyone in machine learning knows that when you can access the parameters and the activations and propagate gradients through a model, etc., etc., etc., you can do a lot more. And everyone knows that when you have outside-the-box access to contextual information, you can design much better evaluations.

But when I answer your question, thinking about a policymaker audience, you know, then I think it's a little bit more interesting. Then I think it's much more important or kind of novel to understand. Because while the scientific community really, really knows that and has known well, and it's never been a mystery or an open question, that like white box access and outside the box access are really useful for evaluations, you know, we're not seeing this, right? We are not finding that existing examples of audits are the most rigorous types of audits that we might hope for.

And that's not to say that existing auditing work is bad or existing auditing work is, you know, being done a certain way in order to lock in low standards, right? I'm really excited about existing evaluations and auditing work. But I equally stress the importance of kind of like, trying to work in the right direction, right? This paper that we wrote was the answer that a bunch of very talented co-authors and I came to when we're trying to ask ourselves the questions about how we can really raise the bar for what is expected of AI audits in the future. And the hope here is to kind of communicate to semi-technical and non-technical audiences the importance of kind of white and outside the box access.

So to a policymaker, or someone interested in policy, the kind of thing I would be inclined to say here, and the reason why I think you might be interested in this kind of topic, is because from a scientific perspective, it's going to be very useful to have audits with as much access as we possibly can. But from a political perspective, it's going to be a battle to kind of figure out what types of access are plausible and feasible. you know, what trade-offs are really worth making. And this is going to be a big challenge.

Whenever I talk to people about this paper, you know, the first question I would almost always get, I'd get the same question every time, would be like, what about security, right? What about these trade secrets and proprietary models being protected within a lab, right? And the answer here, fortunately, is that the parameters of the system never need to leave the lab. And there are some different solutions here. In addition to a bunch of legal solutions that many other industries have involving secure oversight, things like non-disclosure agreements and contracts that prevent auditors from going and working for a competitor. There are also technical solutions, right? There are APIs. APIs can give great forms of light gray box access to systems that can allow auditors to effectively use tools that leverage the internals of models. And meanwhile, there are physical solutions too. You can just have auditors work on site, just like employees of the company itself.

And despite some of these problems being solvable, there's this big worry that they just might not be solved because there might not be the political knowledge or institutional will to do so. There's a lot of precedent for low access audits, which are good. I'm not saying they're perfunctory, but they're just not as powerful as the kind of audits that we might really want to stake society's future on. And I just think it's really important to communicate that not only are techniques that involve greater forms of access useful from a technical standpoint, they're also possible. And we should really seriously consider how to make this happen for high-risk systems when they get audited in the future.

Jakub Kraus | 39:52.616

A big topic in AI safety is reinforcement learning from human feedback. This is one of the most widely used methods for steering language models. And you wrote a popular paper last year on open problems and limitations of reinforcement learning from human feedback, or RLHF.

The main way RLHF works is humans give feedback on how desirable a chatbot's responses are, and then they train an AI model that's called the reward model to reflect, represent those human preferences. And then lastly, they use the reward model to steer the chatbot towards responses that humans prefer. And in your paper, you highlight some issues with RLHF that you find are tractable and could be fixed.

But you also highlight some issues that are fundamental and can't be solved within the RLHF paradigm. So some examples are humans can be misled; humans can't evaluate performance on certain tasks like very complex or difficult tasks; you need a lot of human labor for getting the high quality human feedback and having it be detailed and from diverse sources and on diverse examples; an individual human’s values might be hard to represent with a reward model; the reward model further could be imperfect, and if it's not perfect, then optimizing heavily for it could lead to problems; and having just this one reward function might be impossible to use to represent everyone's values in a really diverse society of humans. So what alternative approaches do you think researchers can explore to address these issues and build AI models that are more aligned and better serving human values?

Stephen Casper | 42:08.552

Yeah, and a good description of RLHF. Like you mentioned, there are kind of like three key processes that go into it. The first one you like you discussed was this human preference elicitation part, where humans evaluate how good the model's performance has been. The second part is, you know, training this reward model to be a proxy for a human, and then the third part is updating the model to be better, updating the model's policy.

And just like there are these three algorithmic components of RLHF, there are three sets of associated problems. All these problems with RLHF all kind of boil down to one of these three things: issues with the human feedback, issues with the reward model, and issues with the policy. And all the solutions, just the same thing.

So how can we improve things within this framework? There's definitely exciting stuff that can be done on all three fronts. For example, I think just yesterday, OpenAI put out some new research on using AI assistance in the feedback process to help humans give better feedback or better catch problems with models. And this is a pretty cool example of some pretty simple and low-hanging fruit work that OpenAI did to, like, make some really meaningful progress on getting better human feedback in the process. And analogously, you know, a better oversight of a reward model and better training of a policy using different methods, you know, can really help here.

But like you mentioned, there are also fundamental problems with RLHF rather than just tractable ones. And fundamental problems are things that can be mitigated, but solving them is not going to happen no matter how hard we try with any sort of method that we call RLHF, right? An example of a fundamental problem with RLHF is the fundamental fallibility of humans. We have limits on our attention and we have limits on our intelligence and our ability to effectively evaluate things, especially at scale.

So what do we do about this other class of fundamental challenges with RLHF? Like I mentioned, there are issues with human feedback that are just never going to go away. I think thinking constructively about solutions here gets more fun and interesting, actually, because you have to kind of take a system safety lens or a socio-technical safety lens and just try to kind of think outside the box of what could be used to compensate for or catch the failures that we're never going to fully see go away with RLHF.

A common concept from system safety, which has been adopted in AI safety work, is thinking of a safety system as having multiple layers or multiple safeguards that are kind of like Swiss cheese. Imagine a bunch of Swiss cheese slices kind of spaced out but laid parallel with each other. Every one of them is going to have holes, and sometimes those holes are going to line up with each other. And if you think of an attack or a failure mode is like shining a laser through, your goal is to have enough layers of Swiss cheese and as few holes in each as possible such that there is no laser that can be shown through all of them at once, right?

So even though it's kind of not working within the framework, I think some of the most valuable work on making RLHF-trained systems more safe is just going to focus on, you know, how to wrap these systems and use these systems in a way that can give us some more levels of assurance, right? And you can quickly see how this starts to branch out to not just be a technical problem anymore but just be a socio-technical problem of how we really integrate RLHF-trained systems into society.

We're kind of in this chatbot age right now, this ChatGPT dynasty of AI systems where these advanced chat AI assistants are the state of the art. But as AI systems get smarter and more widespread and start providing more essential services across society to us, we're going to kind of leave this age and go into a new one and it's going to be really really important to figure out like what layers of Swiss cheese we can put between increasingly advanced systems and users, and between potentially compromised users and society, and between the developers of the systems themselves sometimes and the rest of society.

Jakub Kraus | 46:42.598

And this starts to get into the intersection between technical knowledge and technical research and human institutions, societal processes. So this is a kind of socio-technical AI safety research. And you mentioned to me that you're working on doing less purely technical research and more socio-technical research. So can you unpack why you're thinking of making this shift?

Stephen Casper | 47:22.882

Yeah, I am an abjectly technical person, or at least that's my background, right? But I've become kind of increasingly curious about doing more work that's socio-technically and governance relevant in recent years. Because I'm just kind of becoming increasingly pessimistic about technical solutions really making the difference or really saving us. I think lots of technical problems might be solvable through enough layers of Swiss cheese. And I think that solving them might not be neglected enough to warrant additional work from me in helping to solve them. And I think that the real challenge is likely to be societally focused.

One thing that I think about here is that, you know, imagine we fully solved all of the technical alignment problems we might ever want to solve. You know, what have we done? Well, we've reduced the risks of accidents a lot, but we haven't necessarily reduced a ton of risks from negligence or misuse. And we might have even increased that risk by kind of making the pace of AI development more rapid. Do you know what I mean? So in order to safeguard against ongoing, perpetual, into-the-future-forever risks from AI, I think we want strong institutions.

I think of AI as being maybe the same thing as nuclear weapons. We've solved nuclear alignment. The nuclear technology makes energy when we want, it blows up when we want, right? But we're still in perpetual crisis and danger mode as a global community from it, for institutional reasons. And I think AI might be the same.

So I've been trying to ask myself the question, what can I do that's not so technically focused anymore, but what can I do that's kind of societally relevant? Sociotechnical AI safety is, I perceive it to be kind of an emerging field that is not working on anything that no one has ever heard of before. It's working on topics that are a bit familiar, but it's doing so in a community and from a standpoint that I think is really nice.

One thing that anyone who spends enough time in different like in the AI safety community and in other communities is probably unfortunately all too familiar with is that, you know, there's a lot of, like, competitiveness and sometimes pettiness between different communities that are all trying to focus on making AI better for society or less dangerous for society. They're arguably unnecessary, right? I think this is all, you know, to a large extent, just kind of a meme perpetuated by Twitter pundits. But there certainly is some degree or some, you know, kernel of conflict that exists between certain socially responsible AI communities and certain technical AI safety communities.

Sociotechnical AI safety is a field that I think is potentially very well positioned to be a good bridge between communities that haven't gotten along as well as they should have in the past. And I really like its ability to focus on AI, the real AI problems, the ones that are going to emerge in the context of society and power structures and politics.

Jakub Kraus | 50:42.860

Interesting. I can definitely see that there's this historical gap between people who want to tackle some of the systemic issues surrounding AI, such as stuff that's not purely building safe technology in the sense that it doesn't blow up when you're using it, but also avoids problems with misuse or concentration of power. And I think that's a promising way to try to bridge this gap. And there's probably a lot of fruitful forms of socio-technical AI safety research. So what are some kinds that you think are especially important, particularly for technical researchers or people with a more technical background to start contributing to?

Stephen Casper | 51:41.132

I think there are probably three things that come to mind for me here. The first one is evaluations, right? Like I mentioned earlier, as someone with a background in attacks and robustness, I think a lot about evaluations recently. And earlier, I was talking a lot about this necessity of figuring out how to solve our institutional problems and build institutional infrastructure for doing AI evaluations that properly respects the technical toolbox and the technical limitations of certain pieces of that toolbox. So more of what I said earlier, right? About the importance of figuring out how to do evaluations.

The second answer to this question I think is very closely related. And it's to evaluate societal impacts of AI rather than just evaluating behaviors or capabilities of AI systems. Technical safety people are really, really good at benchmarking and red teaming AI systems in order to determine the extent to which they have a certain property or tendency that can be understood in the context of a data set by looking at their behavior. But a much harder problem, and a problem that is consequently much more neglected and much more poorly understood, is the actual problem. You know, trying to figure out what effects the AI system is going to have in society, in the complex social fabric that it's going to be deployed into.

And there's this challenge right now in the literature, I think, involving a paucity of research, a very small amount of research currently on evaluating and quantifying socio-technical harms of AI systems, you know, second and third degree effects that are kind of hard to monitor until the system's actually deployed. So one area of research I'm really interested in is thinking about how to do social impact, societal impact forecasting and societal impact evaluation better. And figuring out how to come to an understanding that we as a society, you know, probably want to control AI such that it proceeds at a pace that allows us to really thoroughly or adequately understand the impacts of the last generation of systems before we deploy the next.

My third answer here, my last one involving socio technical AI topics is a little bit meta. It kind of focuses on like not just the AI systems but like what the AI systems enable, and not just the power of a system but the power that they confer to humans, and not just the potential degree of an AI system’s misalignment with its user but the degree of its user’s misalignment with society. In machine learning or in AI safety we talk a lot about Goodhart's law.

Goodhart's law states that usually any statistical proxy for your goal ceases to be a very good goal once it's optimized for. And there's this worry that we will design AI systems or test AI systems on useful proxies that cease to be useful when the AI system is optimizing against that proxy. And this is how we can get things like deception and misgeneralization and reward gaming and specification gaming.

I think a really nice thing that the socio-technical AI community is poised to do is really meaningfully do research based on the fact that Goodhart's law doesn't just apply to the AI systems that we build. It applies to us. It applies to all of our institutions, right? Because we ask certain things of institutions and we give institutions certain incentives. For companies, it is political capital and money. For governments, it is votes and more political capital. But optimizing for those things is not the same thing as optimizing for societal benefit. So studying the things that we're all gonna get wrong because of the ways that our institutions are set up, you know, governmental ones, private research ones, even academic ones, I can point the finger at myself. It is something that I think the socio-technical AI community can do. And it's stuff that I think is pretty important and pretty neglected because it's a challenge.

Jakub Kraus | 56:35.210

Yeah, Goodhart's Law is a really useful concept. And this example of how there's lots of areas of society that are exemplifying it seems important. If you're running a private corporation, you will be trying to make money and ensure that shareholders are satisfied. And this typically works somewhat well because people will pay money for stuff that they actually want, but then it can run into problems with addiction. So a tobacco company might make a lot of money, but it also could get even children and teenagers addicted to a product that's ultimately harmful. And there's a lot of debate right now around whether social media falls into this class of being harmful for kids, even though people want to use it a lot. So these might be seen as examples of optimizing hard for this profit measure not necessarily translating into societal benefit.

Is there anything else you wanted to say or wish I had asked about?

Stephen Casper | 57:55.041

Yeah. I guess one thing I'll add is that I might be on the faculty job market in maybe two years. Who knows? Wink, wink. Another thing to add is that if anyone wants to contact me, you can go to stephencasper.com. You can find my email. And another thing in general is that if someone, if you want to kind of dive into what I do or get a good idea of what I think about, try to read between the lines of the paper titled Black Box Access is Insufficient for Rigorous AI Audits. Because I hope that paper kind of can be used to infer that I'm working toward trying to raise the bar on audits a lot. And we'll see about what's next there.

Jakub Kraus | 58:42.283

So if people want to find your work, you mentioned your website, stephencasper.com. Any other places?

Stephen Casper | 58:50.388

I have a Twitter, for better or for worse, as many other people in the machine learning community do. You can find me there as well if you are so interested.

Jakub Kraus | 59:00.775

Great. Cas, thank you so much for joining the show.

Stephen Casper | 59:06.479

Yeah, it's been really nice. And thanks for the chance to chat.

Jakub Kraus | 59:12.743

Thanks for listening to the show. You can check out the Center for AI Policy Podcast Substack for a transcript and relevant links. And if you have any feedback, please feel free to email me at jakub at AI policy dot us. Looking ahead, next episode will feature Professor Ellen P. Goodman, a distinguished professor of law at Rutgers Law School, discussing policies to promote accountability in AI. I hope to see you there.

A guest post by

Jakub Kraus

Tarbell Fellow writing about AI and policy

Center for AI Policy Podcast

#10: Stephen Casper on Technical and Sociotechnical AI Safety Research

Interpretability, robustness, audits, RLHF, incentives and more

Relevant Links

Transcript

Discussion about this post