#8: Tamay Besiroglu on the Trends Driving Past and Future AI Progress

A comprehensive overview of the factors shaping AI progress, and an analysis of AI's potential trajectories over the coming years

and

Jun 14, 2024

Tamay Besiroglu, Associate Director of Epoch AI, joined the podcast to provide a comprehensive overview of the factors shaping AI progress, from algorithmic advances and hardware scaling to data availability and economic incentives, and to analyze the potential trajectories of AI development over the coming years.

Available on YouTube, Apple Podcasts, Spotify, or any other podcast platform.

Our music is by Micah Rubin (Producer) and John Lisi (Composer).

Relevant Links

Epoch AI website
Tamay’s website, Tamay’s X
Will We Run Out of Data? (Epoch)
- Cf. Epoch’s earlier analysis
Training Compute of Frontier AI Models Grows by 4-5x per Year (Epoch)
- Cf. Epoch’s earlier analysis
How Much Does It Cost to Train Frontier AI Models? (Epoch)
Algorithmic Progress in Language Models (Epoch)
Future-Proofing Frontier AI Regulation (CNAS)
Compression Represents Intelligence Linearly (Yuzhen Huang et al.)
Situational Awareness (Leopold Aschenbrenner)
Approaching Human-Level Forecasting with Language Models (Danny Halawi et al.)

Transcript

This transcript was generated safely by AI with human oversight. It may contain errors.

(Cold Open) Tamay Besiroglu | 00:00.847

There’s some amount of compute that will result in a model that is able to automate most of the things that humans are able to do. And this amount of compute is probably not more than 10 orders of magnitude than we are currently using.

Jakub Kraus | 00:23.368

Welcome to the Center for AI Policy podcast, where we zoom into the strategic landscape of AI and unpack its implications for U.S. policy. I'm your host, Jakub Kraus, and today's guest is Tamay Besiroglu. Tamay is Associate Director of Epoch AI, and we talk about trends and constraints on key inputs to AI research, like hardware, data, algorithms, energy, spending, and so on, and how much AI progress we can expect to see in the coming years. I hope you enjoy it. Tamay, thanks for coming on the podcast.

Tamay Besiroglu | 01:10.972

Great to be here.

Jakub Kraus | 01:13.494

Can you outline a simplified model that the audience might find useful for understanding what's fueling AI progress?

Tamay Besiroglu | 01:26.195

Sure. So this is a question that we at Epoch... have worked quite extensively on, on just decomposing these different factors that are contributing and shaping development to AI. And I think the three things that might be most important are the scaling up of the amount of compute that is used in training of these systems. And this is in large part the result of more spending on larger clusters, larger data centers, more advanced GPUs. And this is also partly driven by just innovation in GPUs becoming faster and more efficient. And overall, we see that the amount of compute used in training is being scaled up by about 4x per year, which is much faster than it used to. Before deep learning, we saw this being scaled up by about... 30, 40% roughly at the rate of Moore’s Law, but now it's 4x per year, which is much accelerated. The second important trend is kind of algorithms. So this is architectures, training techniques, a bunch of the primitive objects and structures that are used for deep learning, like activation functions and embeddings and so on. And these contribute the equivalent of about, you know, two or three X per year in terms of equivalence of compute scaling. And so we are making progress faster than just the rate at which we're scaling compute because of these gains from better algorithms. And then the other part, which is quite central too, which enables us to scale up these models, is the scaling of data sets and the amount of data that we are using to train these models, which is scaling up at about 2 or 3x per year. And this is largely enabled by the fact that there's this abundance of internet data available that we can use for training.

Jakub Kraus | 03:52.415

Now, data and compute are linked, right? The data is directly proportional to the compute. It's roughly compute is six times the number of data points times parameters.

Tamay Besiroglu | 04:09.564

That's right. So the amount of training compute that you need to expend is proportional in the amount of tokens that is seen during training. And there are some results about how to optimally scale the amount of data given some compute budgets. And the relationship here is that you should roughly scale the size of your data set, the amount of tokens seen in training, with the square root of the amount of compute that you use in training. So that seems roughly right. Even considering some of the kind of things around overtraining, you might want to use slightly more data than some of the kind of scaling law results might suggest. But roughly scaling things with the square root seems reflects our understanding of how to scale data sets.

Jakub Kraus | 05:09.961

And you published recently a attempted replication of the Chinchilla paper, which found the optimal way to scale data along with parameters, like a good ratio between them. What were the key takeaways from that? Is the Chinchilla paper wrong?

Tamay Besiroglu | 05:34.377

Yeah, so I think the takeaways from that... is, I guess, not so much that the Chinchilla paper was wrong. I think it had an error in... So the Chinchilla paper derived these three scaling laws using three different methods about... And these scaling laws tell you how to kind of optimally scale the size of your model and the amount of data given some compute budget. And so what they found is that you want to roughly scale your data and model size proportionally. And so you get this kind of square root scaling, given that the amount of compute is the product of these two things. And so you want to, if you double your compute budget, you want to increase your data set and model size by the square root of two. And that is what they found. And they had a parametric scaling law that told you given some value of how much you know how large your model is and how many tokens you're using you can as you can kind of um uh kind of predict the loss of the model on predicting the next token.

And they had made a mistake in how they estimated this parametric scaling law. And so this scaling law gave you kind of inconsistent suggestions about how to scale your model optimally. And these were inconsistent with the other methods that they had used. And so their paper was effectively correct in that, you know, Scaling these two things proportionally is right, and if you correctly estimate this parametric scaling null, then you do find that you want to scale these two things proportionally. But in their paper, they had this kind of inconsistent result and they didn't really know how to resolve this. And I think they just kind of were like, oh, well, we have these two other methods and these two other methods, you know, tell us that we should scale things proportionally. And that was the overall conclusion of the paper, but they had this other method that was kind of inconsistent.

And so we effectively resolved that inconsistency and unified these three scaling laws that they had. I think also the parametric scaling was just maybe the most important contribution of the paper, had it been correctly estimated, because it enabled you to predict how the loss value of your model, like how well your model performed, at any kind of arbitrary combination of size of model and the amount of training tokens. And the other the other methods didn't do that. And so I think it doesn't really overturn any of the main conclusions but I think it kind of refines and unifies the scaling laws that they found now for doing the scaling.

Jakub Kraus | 08:42.230

You mentioned that you need to increase by orders of magnitude in how large your data set is. But Epoch found that we will probably run out of high quality language data before 2026. And can you first unpack what's the difference between high quality and lower quality language data in your analysis there?

Tamay Besiroglu | 09:12.426

Yeah, so I think these notions are somewhat rough because we don't really have a mature literature on actually figuring out, you know, measuring quality of data and how it relates to performance of these models. And so we kind of roughly operationalize this as kind of training data that... labs are actually using to train models. And we basically assume that the data that they're using is data that would be helpful and data that they're kind of filtering out is lower quality data. And so you might think about GitHub, codebases, Wikipedia, books, arXiv as being these quintessential examples of high quality data. And then there might be random internet blogs that have... random internet users contributing as being maybe somewhat lower quality text. But I think since we've written this paper, we have developed a better understanding of some of this distinction since a lot more research has come out on this since we originally wrote this in 2022.

Jakub Kraus | 10:40.406

How has the understanding evolved since then?

Tamay Besiroglu | 10:44.212

Yeah, so I think a couple ways. So one way it's evolved is that there has been some research on how to filter Common Crawl. Common Crawl is this scrape of the internet. And there's been some research about how to filter this in a way that produces models that perform really well on tasks. And there have been some of these ablation studies that show that actually you can get away with filtering not super aggressively on these large corpus of kind of internet scrapes. And this, you know, has... pretty good results for performance. And so I think we've become, as some of these techniques have evolved, more optimistic about training on this kind of lower quality data and that effectively extending the runway that we have for scaling of these models and finding kind of more useful training data.

Jakub Kraus | 12:00.522

What would be an earliest year that even that would be exhausted, even relying on the lower quality data.

Tamay Besiroglu | 12:12.133

Sure. So yeah, I can quickly go through some of the estimates about the stock of data that we have and how much data we're currently using. So right now, the Common Crawl, which is this effort by some nonprofit to scrape periodically the web. And they've been doing this for quite some time. I think we estimate that that contains on the order of 100 trillion tokens. And for reference, some of the largest training data sets are on the order of 10 or 20 trillion tokens. These are training sets that have been used to train models. Now, maybe OpenAI and Google are training on slightly more than 10 or 20. We don't really know. But at any rate, there is still some room for just training for, you know, given the size of Common Crawl for expanding our data sets. But of course, there's also data that isn't scraped by Common Crawl. So the kind of indexed web that is indexed by search engines like Google is, you know, potentially quite a bit larger than what is crawled by the Common Crawl. And so we estimate that to be on the order of 500 trillion tokens. So maybe 5x the size of Common Crawl.

And then there's also some recent work that shows that you can train for multiple epochs. You can show your model the same data multiple times. And that seems to have the effect as it seems to have the same effect as training on kind of new data. So you can effectively train for maybe 4 or 5x, and that has the same effect as having access to 4 or 5x as much data. And so that, again, extends this runway. So, you know, we have maybe 500 trillion tokens on the index web. We can train for maybe four epochs. So that, you know, gives us, you know, two quadrillion tokens roughly.

Now, there might be some filtering that you have to do. Kind of almost certainly there's going to be quite a bit of filtering to get, you know, at the highest quality text. But the latest research suggests that this doesn't need to be super aggressive. So maybe filtering out half or maybe at most 90% of that data. And that leaves you with about 50 to 200 trillion tokens worth of good training data. And so, again, if you train on this multiple times, that gives you a lot of effective data that you have. And this would be sufficient for training a model of the scale of about one... 10 to the 28th or 10 to the 29th. So that's about three orders of magnitude or four orders of magnitude more than Gemini Ultra or GPT-4. So that's roughly a GPT-6 level model.

We could probably have enough data to train a model of the scale of GPT-6, considering the kind of two orders of magnitude difference between these GPT iterations. And so this would roughly be another four years of scaling that we could effectively do given the amount of data that exists on the index web. And that is enough to get quite a lot of progress, presumably. So four years in AI is a long time, and a lot of progress happens in four years. And so this runway of about four to five years is actually not terribly short. It's sufficient to enable us to make a lot of progress.

Jakub Kraus | 16:38.995

Yeah, so if we've got enough data to get to GPT-6 levels of compute, then the natural question is, can we get that much compute? So you were alluding earlier to how... The computational resources are growing quickly in AI training. And up until about 2010, Epoch found that it was about in line with the pace of Moore's Law, where it's... Moore's Law is finding that the number of transistors that can fit onto one computer chip doubles about every two years. And Epoch found that the compute in AI training was doubling about every 20 months. But then it was growing faster, as you mentioned, with companies spending more to use more compute in tandem. So let's first talk about spending. Do you think that companies will be able to afford a level of compute for training GPT-6? Or will they need to gradually get enough incentive to spend that much? And when do you think that level of spending for a GPT-6 level model might actually be feasible?

Tamay Besiroglu | 18:04.519

So right now, companies are spending on the order of... maybe $100 million to maybe $500 million on kind of a single training run, on the hardware for that period over which they're using it. And then kind of all the researchers, electricity, and a bunch of other things.

Jakub Kraus | 18:34.609

And which of those costs dominate?

Tamay Besiroglu | 18:40.511

Yeah, it depends. I think a large portion is going to be the hardware. So these chips are, you know, just very expensive. I think for some, in some cases, the research efforts, so the number of researchers involved, it's just very large. And these researchers make a lot of money. So I think the Gemini Ultra paper had on the order of 1,000 authors on it. If they're making multiple hundreds of thousands of dollars per year and spending a good fraction of the year on that project, that's already a pretty sizable wage bill. And so maybe for some training runs, for some of these development efforts, actually wages may be currently the... largest kind of spending segment. I think behind that might be the GPUs and then also kind of the physical buildings of the data centers and so on. I think that's also kind of very large. So those would probably be my guesses as to what's happening. Now, these companies are not... super transparent about this so this is just based on on my somewhat informed speculation

Jakub Kraus | 20:01.367

Great. And then do you think, how do you see spending evolving?

Tamay Besiroglu | 20:08.147

Sure. So right now, as you said, the amount of spending on, say, hardware is growing by maybe 3x every year. And these training runs cost on the order of maybe $500 million a year. And so next year, we might get a single training run that costs over a billion dollars, and then it's the 5 billion the year after that, and so on. And so these numbers are very large. And I think it's a good question to ask, when might these companies no longer have or no longer be willing to make the next jump to another 3x in spending as these sums get progressively larger? And, you know, I think... we are at a fairly low base with how much is being currently spent. Now, you know, hundreds of millions of dollars is, of course, in some sense, a large number. But for these tech companies that, you know, have hundreds of billions of dollars in revenue and, you know, similar orders of, you know, money being spent on other R&D activities, I think it is very possible that they can continue scaling the amount of spending by another 10 or 100x.

I think developing really capable AI might just be extremely valuable. So globally... We spend about $50 trillion per year on wages. And developing a technology that could automate a substantial fraction of labor could be worth some fraction of that flow of that wage bill that is spent every year. And so that's on the order of trillions of dollars a year. And of course, that's a flow. And so if you do some discounted cash flow analysis, it'd suggest that building a technology that could automate a large fraction of work that gets done globally could potentially be worth spending on the order of trillions of dollars on.

Now, it's obviously going to be quite hard for companies to be able to access as much capital as that. There are some rumors and some reports that OpenAI, Sam Altman in particular, is interested in raising on the order of trillions of dollars. And I'm not quite sure what to make of that, but I think it's not entirely implausible that this decade we will see investments of kind of, or at least commitments to invest on the order of hundreds of billions or perhaps even a trillion dollars on all the infrastructure that is needed to do and to provide chips and get them to data centers and train very large models.

Some other useful reference points are just, you know, like the Apollo program or the Manhattan project which people spend, which, you know, US government spent on the order of 1% of US GDP on. You know, it might be hard for companies to spend, you know, one percentage point of GDP. But maybe if they get sufficient support from government or are able to convince investors with the technology that they have already, I think this is not entirely unlikely.

Jakub Kraus | 24:06.370

Yeah, we saw the Schumer's AI roadmap coming out and it was encouraging ramping up to $32 billion a year, and then potentially more than that later on. So I could see governments contributing substantially. Now, the natural next question is, you can spend that money, but what are you going to be able to purchase with it? So starting with the chips themselves, there's a really complex supply chain behind them. So SK Hynix makes high bandwidth memory and they are already sold out mostly through 2025. Same for their competitor Micron. GPUs also require these extreme ultraviolet light lithography machines, and some of the most advanced ones, the high numerical aperture ones, are sold out for the next 12 months. So some of these key components in the GPU supply chain seem to be selling out. We can look at how much the final product is getting sold, so the best GPU from NVIDIA, maybe hundreds of thousands are getting sold in 2023. But then an individual company like Meta says they have plans to get 350,000 H100s by the end of the year, and a full compute arsenal equivalent to 600,000 of those. So it seems like, to me, potentially... You couldn't, at least today, spend a hundred times as much on compute and be able to actually acquire the chips. Does the supply of AI chips seem like a constraint to you over the next few years?

Tamay Besiroglu | 26:15.131

I'm not sure if it's going to be a major constraint over the next few years. So we know roughly how much TSMC is scaling up production. And so we know, for instance, that NVIDIA shipped around half a million H100s in 2023 and that they're expected to ship around 2 million H100s in 2024. So that's 4x per year, which is the same rate at which we're actually increasing the use of compute, which suggests that... You know, we were able to serve, we were able to meet demand, this growth in demand.

Now, I agree the supply chain is extremely complicated and there are many, many difficult, complex pieces of equipment and kind of parts that you need to get ready in the right quantities in the right time to actually be able to supply in time and so we've seen this production be bottlenecked by packaging, packaging these chips. Some of these bottlenecks seem to have been somewhat resolved at this point I think if you just look at the number of wafers that NVIDIA produces, leading-edge wafers, so less than 7 nanometers, I think it's actually much more than they're actually putting in H100s, which suggests to me, at least if packaging isn't an issue, that they should be able, in some sense, to produce more H100s. Now, maybe it's just really hard to scale up packaging. It's kind of unclear. My sense is that if you kind of throw enough money at some of this, some of these constraints will just be somewhat resolvable.

I think another consideration here is just companies might just be somewhat unwilling to spend a lot of money given the margins involved. So NVIDIA has these ridiculous margins. And so companies might just be partly unwilling to spend, you know, $100 billion on scaling up compute if like 90% of that is just being paid out in dividends by NVIDIA. And so maybe there's some sense in which the lack of competition is resulting in a slump, in like less demand than otherwise would happen. And so maybe there's some part to... There's some sense in which the lack of competition is resulting in a slump, in like less demand than otherwise would happen. Now, you know, maybe this playing field becomes more competitive if some of these kind of chip... competitor. So, you know, OpenAI is interested in having their own way of securing chips. Google has their own TPU and that, you know, they don't have to incur the same difficulties involved with dealing with NVIDIA and the kind of margins that they're charging. And so my sense is that at least some companies are probably able to scale up the chips that they're receiving very rapidly. And TSMC is able to produce enough wafers to make that happen.

Maybe this becomes a more important bottleneck over the kind of longer horizon where TSMC is kind of at capacity of how many AI chips it can produce and many of its fabs and many of its leading edge fabs are focused just on producing data center GPUs. Maybe at that point it becomes quite hard for this to scale up because they'll have to produce new fabs, which takes on the order of three to four years. And there might be some bottlenecks associated with that there. Now, maybe if they plan accordingly and they kind of foresee this surge in demand, they can make this a fairly smooth scale-up.

My guess is that there's going to be some occasional hiccups in the supply chain where we will have difficulties with scaling and shortages and so on. I think this is going to be an important problem, but I don't think it really stalls the scaling very substantially.

Jakub Kraus | 31:08.978

Okay. And then... There's all this equipment in the data center, like cooling the GPUs because they get very hot. And I haven't researched if there's any bottlenecks there on the non-GPU equipment in a data center. But one that gets talked about a lot is electricity. So you looked even at the limits of energy efficiency of AI chips. You estimated that they can't be more than about 200 times as efficient as they are in converting energy into computation. Now, even if we were able to get that 200-fold improvement, then you'd still need this massive quantity of chips to get something like 10 to the 29 operations in training one AI model. And there are some techniques where maybe you could distribute those chips so they're not all in one place drawing that electricity. But absent that, it seems like a bunch of energy needs to get to a single physical location. And... Amazon recently bought a data center near a nuclear power plant. It might use up to a gigawatt of electricity. But will energy potentially, by 2030, become another looming constraint on AI scaling?

Tamay Besiroglu | 32:45.544

Yeah. So yeah, just for some reference, so a model like Gemini Ultra or GPT-4 or something requires on the order of tens of thousands, maybe 10,000 state-of-the-art data center GPUs. And they run at around 700 watts. Now, if you add in the electricity needed for cooling and other things, that's maybe double that. And so 10,000 of those running at about 1,000 watts, that's about 10 megawatts, which is... I think the equivalent of how much a thousand American households spend or consume in electricity per year. And, you know, this is growing very rapidly in line with the amount that we're scaling these training runs. We get some benefit from more efficient chips, but these efficiency gains are not large enough to offset the overall increase that you get from just scaling up the number of GPUs you're using. And as you said, there might be limits to do with increasing the energy efficiency of these.

So it does look like we're going to grow very rapidly the amount of electricity that is needed to kind of support this training. Now, as you said... A key question here is whether you can distribute the training run, like a single training run across multiple data centers that are geographically separated so that you can access a bunch of different grids and you know, as it were, spread out your electricity draw from a bunch of different power plants. And if you could do that, then these constraints look a lot less worrisome. Now, my understanding is that it is quite important to have these data centers be closely geographically co-located or even occur in a single data center. And then to support that, what you need is a lot of draw from the electricity grid, or you need a power plant that is co-located with your data center.

And companies, as you suggest, are kind of scrambling to figure out a way to support that electricity draw. And so they're buying the solar farms. So I think Meta bought a couple solar farms. Or I'm not sure if they bought, they might have just contracted the solar farms. And Amazon bought this nuclear power data center. And I think the nuclear power data center supports about 1000 megawatts, which is 100x more than probably was used to train something like GT4, roughly. And so there's still, you know, maybe 100x room for more electricity draw. And so, you know, maybe you can scale up the amount. you can scale up your models by two orders of magnitude of compute, even with maybe the best current, the most energy-intensive data centers that we might have right now. I think the nuclear power plant that Amazon is working with might be able to provide even more electricity than that, so that there might be even more room there. So... I think right now, you know, in the near term, I think companies are scrambling to figure this out. My guess is they will probably get something that works reasonably well.

The electricity grid might be a bit of an issue where there are known issues with expanding the amount of electricity that is connected to the grid. So these grids have these long queues where a power plant willing to supply additional electricity takes a bunch of time for them to get hooked up. So it's potentially hard to get a bunch more electricity online. Then there are issues with producing these distribution transformers that are important for these power lines. So I think there's a bunch of uncertainty here. My guess is that we can probably scale up the amount of electricity that is being used by just a couple orders of magnitude, even with the current power plants that we have.

Now, maybe this is not going to be easy. It requires these companies to buy these very large, very expensive power plants, or at least contract with them. But companies are kind of trying to do this. So my guess is things will probably work out in the next couple of years. After that, it becomes much less clear. If you need on the order of many different, you know, many power plants to supply the electricity for a single training run, then that becomes harder because of issues connecting with grids and things like that. And also just with construction and regulation to do with building these huge power plants. So I have a lot of uncertainty about this constraint after three or four years. And I think more work by people who actually have expertise in data centers and electricity grids would be quite helpful for trying to figure this out.

Jakub Kraus | 39:08.845

Yeah, that 100x sounds like a lot, but earlier we were talking about a fourfold, or 10 to the fourth jump, rather than a 10 squared jump. So does that mean that we should cap out what we expect in compute scaling at about 100 times what it is today?

Tamay Besiroglu | 39:38.194

Yeah, I don't think that's the way, I don't recommend doing that just because, you know, this is a very new bottleneck. And companies have just basically started trying to figure ways around this out in the last couple of years, or maybe they're even more recent than that. And so, you know, given, again, the value of training these large AI systems, I suspect that there's like a lot of effort and money and lobbying going to be involved in trying to figure out ways of getting a bunch of electricity. And, you know, my guess is this could probably just turn out fine and end up not being a huge bottleneck. It's possible that it turns out to be an important bottleneck and one that might make it hard to scale beyond 100x from what we currently have. Now, I mean, maybe what you could do is you could look at the largest power plants that we have available in the US or elsewhere, these large hydroelectric dams and so on, and they produce, I think, a lot more electricity than the kind of solar farms that Meta is working with or even the nuclear power plant that Amazon is working with. And so I don't see any reason in principle for why it's not possible to scale up beyond that, especially if the incentive is sufficiently large to get these behemoths, take this as a very, very serious problem that they should spend a large fraction of their effort and energy trying to resolve.

Jakub Kraus | 41:28.063

Okay. But it's still... How many more orders of magnitude can you get even from all that effort? Maybe you could get to 10 to the 29, but are we ever going to be able to go past that?

Tamay Besiroglu | 41:44.580

Right. So my guess is that once you train a model of the scale of GPT-6 or something like that, it becomes a lot easier, at least if, you know... When the scaling laws continue to hold and you start seeing models that can automate a large fraction of tasks in the economy, I think it'll become a lot easier to raise the financing, to raise the support from relevant governments to continue scaling up these systems. And so I agree that right now, it might be quite hard if you, like, today wanted to train models of the scale. to figure this out. But my guess is that you could demonstrate the value of doing the additional order of magnitude of scaling once you have GPT-5 or GPT-6. And then at that point, you know, you could have a lot of financing for buying and scaling up power plants or a lot of governments might be interested in providing access to a lot of power in exchange for. you know, some involvement or some access to the technology that's generated. Maybe there are national security interests involved that, you know, result in governments being willing to, you know, have these large power plants constructed and make that happen quite quickly. On top of that, you know, it seems possible that we figure out how to train these models in a geographically distributed way, and maybe that ends up working. And if so, then this constraint becomes much less limiting.

Jakub Kraus | 43:41.636

Okay. So we've talked a bit about whether compute scaling can continue. One quick last point there is maybe the... companies can get a bit more bang for their buck on something like chips or maybe there will be better optimizations and networking equipment and the overall data center design since there hasn't been as much of an incentive to be optimizing AI specialized data centers relative to other kinds of computing data centers before in the past.

But specifically on chips, Epoch found that the computation that can be bought at a given price, so the operations per dollar on top performing GPU chips, doubled about every three years between 2006 and 2021. And I believe in another piece you found that for machine learning GPUs, you might see doubling every 2.1 years. And then if we just look at Nvidia's top GPUs, they vary a little bit in price, but just to show that they are seemingly getting faster, the B200 GPU from 2024 is more than twice the speed of its predecessor, the H100 in 2022, which in turn is more than triple the speed of the A100 GPU from 2020. So are these roughly the right numbers to be thinking of in terms of how many operations you can buy per dollar spent on a chip?

Tamay Besiroglu | 45:34.547

Yeah, I think that's right. I mean, there's a lot of other things that you have to spend on. So networking is one thing, which is maybe about 10 to 20% of the overall hardware cost. And then cooling and the physical buildings of the data centers and so on. But, you know, the... cost of computation is going down at the rates that you said, maybe even slightly faster. If you look at going to lower precision formats, so most of training has happened in FP16, so half precision, and maybe we could get even lower to 8-bit precision, and that might give us... the 2x increase in the amount of compute per dollar for training. So this is fairly rapid, but it's not as rapid as the scale-up just from additional investment. So I think the thing that matters much more than this technology improvement for chips is just how much labs are able and willing to spend on scaling up these training runs.

Jakub Kraus | 47:00.444

And let's say companies scale, maybe they get to 10 to the 27 in a few years or 10 to the 29 by 2030. There's this extra shadow improvement happening behind the scenes of algorithmic progress. So the numbers on that are in Epoch's investigation for both vision models using image data and language models using text data. There are these big improvements in how much performance you can attain with a given level of compute. So for language models, the compute required to reach a set performance threshold has halved approximately every eight months between 2014 and 2023. And then I believe it was nine months for vision models. So that's pretty similar. Do you think these training efficiency improvements are going to continue at roughly the same pace? Could they slow down or speed up?

Tamay Besiroglu | 48:12.063

Yeah, so in the paper on language models, where we found that every eight months roughly you get a doubling of kind of effective compute from improvements in algorithms. And we tried to, you know, we investigated whether this was speeding up or slowing down. And what we found was, you know, no evidence of this slowing down or speeding up. And so I think that's evidence in favor of this view that we should expect this to continue.

Now, I think we don't fully understand all that's happening here. And I think our understanding of algorithmic progress is just somewhat early. And so we at Epoch are working on trying to get a better sense of where things might be headed in the future. My guess is that, you know, this is largely the result of smart people at top labs making improvements to a bunch of training techniques and sometimes publishing them and diffusing them through the field. And also scaling of hardware and running experiments to test the kind of algorithmic innovations. And in part also adapting to larger scales of hardware where if you scale up your training run. New techniques start to become attractive and you try them out and you figure out that some new trick works really well at this larger scale and you start adopting it. And so if we expect that we can scale compute, which I think we can, at least in the near future, and probably also medium term, then we should expect algorithmic progress to continue giving us these gains exponentially over time at a similar rate that we've seen historically.

Jakub Kraus | 50:23.152

And before we jump into the actual fruit of all these improvements to algorithms or ways to scale and build bigger and bigger models trained on more and more data and more and more computation, I want to make sure we've covered everything. So are there any other major constraints or major growth factors that could be helpful, harmful, that Epoch is looking out for or sees as potentially shifting the picture we've painted so far?

Tamay Besiroglu | 51:02.650

Yeah, so there's one factor that we haven't discussed, which is some technical limitations about just scaling clusters. You know, it's possible that for reasons to do with latency of communication between GPUs, within GPUs, especially between GPUs, it just might become infeasible to scale a cluster to a million GPUs or 10 million GPUs. The fact that there's this imbalance between the speed of operations and the speed at which you can communicate information between GPUs might result in it being really hard to achieve high utilization at very, very large clusters. The other constraint might be that you get occasional failures. And so if you have a million GPUs or 10 million GPUs, you have one in a million failures happening every day. And that might just be... It's certainly going to be annoying to deal with. It might be kind of hard to resolve for labs. And so those are some questions we've investigated. And our conclusions here is that especially this kind of latency issue might be somewhat tricky given the fact that memory bandwidth is quite slow. And so we have some work in the pipeline that suggests that it's... with current technology, slightly difficult to scale beyond 10 to the 29 models, 10 to the 29 floating point operations. So again, roughly a GPT-6 scale models with current technology.

Now, of course, at the time we have GPT-6, we will no longer be working with current technology. And my guess is that, you know, given that we have haven't really hit this bottleneck yet. And we haven't really started to build the technology that enables us to overcome this bottleneck. And my guess is that... The fact that this currently looks like a bottleneck isn't very strong evidence that this will continue being a bottleneck. And I think that there are a bunch of technical approaches that you could potentially use to overcome some of these issues. We will have a paper on this out fairly shortly, which I'm excited to share.

Jakub Kraus | 53:52.655

Okay, so that might stop you past the 10 to the 29 level. Is there anything else that might stop you from getting to 10 to the 29 other than what we've discussed in the next six years?

Tamay Besiroglu | 54:06.764

Yeah, I mean, one thing is just getting the financing. So 10 to the 29, how many H100s would that be? I mean, a lot. I think this would probably be in the hundreds of billions. I would have to work this out, but I think it's going to be a lot of money. And. Raising the financing is maybe not very straightforward. I mean, it's a lot of money and you need to provide, as a company, a very strong case that investors are going to see good returns. Or at least have a good chance of getting very favorable returns. And convincing investors of this might turn out to be a challenge. It's kind of unclear to me. Now, maybe governments might step in and be able to invest substantially. And we've seen, say, as you said, U.S. government being increasingly interested in using public funding and OpenAI reportedly being in conversations with various governments to raise funding. So it's unclear exactly how kind of constraining this might be. And my own sense is that if you think that the next scale-up is going to provide a pretty impressive demonstration of capabilities that ends up being useful in various industries, that will convince more people, and then you continue scaling up an order of magnitude or two, and that convinces even more people, and gradually you're able to continue this train of just raising an additional round of much more financing each time after having convinced investors of the value of doing so with the previous scale up.

Jakub Kraus | 56:13.701

Okay, I can see that. Now let's make a prediction. So let's say we get to 10 to the 29, about a 10,000-fold increase. And then if algorithmic progress is happening a bit more frequently than doubling every year, maybe that could, after six years, take you to about a million and an effective compute. And this is what the Center for a New American Security found, roughly, in their report on... predicting future AI progress. So they did two estimates. One found by 2030, if current trends continue with no serious stoppages, that you could see around 40 million fold increase in effective compute. And then they did a more pessimistic estimate accounting for maybe spending tapers off, and maybe hardware improvements have constraints. And they found around 3 million fold estimate by 2030. So does this figure of a million times more effective compute by 2030 sound about in the right ballpark?

Tamay Besiroglu | 57:42.781

I think it's somewhat tricky to think about the effective compute, the kind of gain from algorithmic improvement over such a long horizon. It's kind of unclear precisely the extent to which this kind of stacks and compounds over time. I think the way that I think about this is that we know scaling for physical compute quite well, and we have these somewhat well empirically validated scaling laws for this. And so that 40,000 number, that seems much more true to me. This effect of compute scaling, I think there's a lot more uncertainty about the rate at which this is happening, how much this compounds over time, such that I would have much greater error bars on that contribution from algorithmic progress.

Now I think if you ask me what my kind of median expectation is, I think, yeah, that seems probably right, maybe something like, you know, 10 or 100x additional gains from algorithms on top of the gains from compute by the end of the decade. And, you know, you can basically plug this into scaling laws and see exactly, you know, how much better those models are. Now, of course, that doesn't tell you most things you want to know, because that just tells you how good is it at predicting tokens. You can do maybe slightly better looking at the curves for performance on benchmarks that are more meaningful, like MMLU or BigBench. And there we, you know, if you plug that in, you will find that those models should be just better than, you know, should be better than the best human at most of the kind of benchmarks in the usual Q&A setting. And I think that's itself really important and meaningful. And potentially, you know, a lot better at many other things too.

Jakub Kraus | 59:55.708

Okay. Just one clarification. Their more optimistic estimate was 40 million but you were saying maybe 40 thousand seems like a conservative estimate.

Tamay Besiroglu | 60:06.200

So the scaling to 1e29, that's 10,000x. And then I think another 10 to 1,000x from algorithmic progress, which you has large error bars for the reasons I mentioned. And so that nets out to, you know, let me see. Yeah, like 100,000 to 10 million or something. Like those numbers kind of all seem somewhat plausible to me. But I probably wouldn't give you, yeah, like I have a lot of uncertainty here. And so I think I would suggest just having kind of wider error bars around some of these numbers.

Jakub Kraus | 60:56.340

And you were just talking about the actual capabilities you get out. It's not necessarily easy to measure. So in the algorithmic progress on language models paper, you looked at perplexity on Wikitext and Penn Treebank, which quantifies how well the models can predict text from places like Wikipedia and the Wall Street Journal. So... This perplexity per compute is improving exponentially, but does that mean that the resulting general capabilities of AI systems are growing exponentially?

Tamay Besiroglu | 61:43.350

So yeah, just to clarify, this is exponentially with time rather than with compute or something. And so yeah, that is what we find that we see this exponential improvement in performance that is kind of unexplained by the compute scaling component. And so it's from other things, improving training techniques and... and architectures and so on. So the invention of the transformer, development of implementations of key algorithms like Flash Attention and Flash Attention V2. And so those things contribute quite a lot. Maybe about... 30% of the overall gains in perplexity from 2010 to today is from improvements in algorithms. And then maybe the remainder, 60%, 70% is due to just the scaling of the amount of compute used in training.

And how does this relate to improvements on what are known as downstream performance measures. So there's perplexity, predicting the next token, that's the upstream metric that's optimized for during pre-train. But then these models are usually evaluated on these downstream benchmarks. So MMLU, Big Bench, Human Eval, and a bunch of other things. Testing its coding, its mathematics, and STEM reasoning abilities. And there, it's slightly tricky because, of course, many of these performance metrics are bounded between 0 and 1. And so you get a sigmoidal or logistic curve as a function of time or amount of compute used in training. But there's some evidence that you get... kind of linear gains with the the perplexity of your model. So if your model is lower perplexity, you get kind of linear gains on downstream performance. And at least when you're not at the ceiling or when you're basically around 50% or something. you kind of improve, you improve your algorithms or scale up your computing, you get linear gains in the accuracy or something like that.

Now, of course, as you get to the ceiling, it can't be linear because, you know, it has to plateau. But, you know, there was a recent paper that, for instance, showed that there's a linear relationship for many of these downstream performance tasks and the compression ability of these models, which is closely related to the kind of upstream perplexity measure. And so the answer to your question is probably, as we improve those algorithms, over time we have these exponential improvements on these downstream tasks, which is unfortunately slightly masked by the fact that you have bounded measures of performance. And so at the top, you get this plateauing effect.

Jakub Kraus | 65:07.102

And for the audience to give a more visceral sense of what these benchmarks might be measuring if they're trying to measure general performance, There's this new one that's getting popular, the Google-Proof Q&A Benchmark, GPQA. It's a test suite of questions. It's written by PhD students and graduates in biology, physics, and chemistry. And they wrote the questions to try to stump PhDs who are in unrelated fields, even if they get unlimited time, unlimited access to the internet to try to research the answers. And leading AI systems get around 50% of those questions right. Another popular test that we were talking about is the MMLU, Massive Multitask Language Understanding. And that benchmark has questions from a broad range of 57 different subjects. So anything from US foreign policy to nutrition to high school stats to computer security and more. And the leading AI systems get around 85% to 90% on MMLU. So by now, would you say they're probably plateauing on MMLU and we need to find other benchmarks?

Tamay Besiroglu | 66:27.286

Yeah, that's a good question. So I'm not sure. I think one thing you could do to figure this out is just take a random sample of MMLU questions and figure out what fraction of questions has issues or ambiguities. So the ceiling is roughly going to be determined by the rate at which these questions have mistakes in them or ambiguities where the answer is kind of unclear or the question is unclear. And yeah, I mean, you know, I think we're no longer at this like linear portion of the curve where, you know, a doubling of compute gives you roughly the same gain in performance. We're certainly kind of we're seeing this plateau. And I think this is just mechanistically what happens, but it results in some confusion among observers where people are like, well, you know, we scaled up, you know, GPT-3 to GPT-4 and we got this huge gain and now we don't get this huge gain on MMLU anymore. Which is, you know, obviously a kind of silly thing. But unfortunately, there's people who are convinced by arguments like this.

I think these benchmarks do end up saturating quite quickly. You know, MMLU is, what, like four years old or something? So this has lasted longer than many other benchmarks usually do. And I think it's probably time to move to different benchmarks as those get created. And maybe GPQA is a good benchmark. And I would be excited about seeing a lot more benchmarks that are harder, that have different styles of question, that maybe have tasks involved rather than just giving you the right single number answer or the right multiple choice answer or something like that. And have actual tasks that models have to perform over a long horizon and are graded on a bunch of parts of that task and things like that. There's a lot of room for improving benchmarking, I think.

Jakub Kraus | 68:47.900

Okay. And then you mentioned in a few years, MMLU got close to being saturated and that this was kind of... good by the standards of other benchmarks that might get saturated even more quickly. So it does start to raise this question of, even though we were talking about these potential constraints like energy and how fast companies can scale, of what the ultimate limits are that we can be seeing in these capabilities. And so you wrote a paper that said, called the direct approach to AI timelines that was looking at how can we forecast when AI is about as good as humans across the board. And my understanding of the insight there was some papers have looked at, let's try to estimate how much computation the human brain uses. And we could look at this bio anchors paper, how much computation might have happened throughout evolution or how much within a human's lifetime. And your paper was instead looking at how much compute you'd need to predict the next word almost perfectly, in a way that would be indistinguishable from a human. Is that about right, and how would you describe this approach to forecasting?

Tamay Besiroglu | 70:25.308

Yeah, your description was roughly right, but it's slightly different. In our paper, we try to answer this question of how much compute do you need in order to produce a very long output, maybe a scientific paper that is indistinguishable from the kind of relevant distribution, so the distribution of scientific papers that are published or are posted to arXiv. And actually with the scaling laws that we have, you can actually compute this directly from these estimates. And the key idea here is that a measure of perplexity is closely related to a question of how many tokens do you need to observe to be confident that what you're looking at is the distribution that's being modeled versus the output of the language model or, you know, of the machine learning model.

Jakub Kraus | 71:29.361

So sort of how much evidence you need to see before you can tell that it's not from that original human-made distribution of content.

Tamay Besiroglu | 71:39.220

Exactly, exactly. And so you know what we can compute is how much compute, how much training compute you need in order to produce an output that is of the length of say a scientific paper or maybe a book and such that an observer you know would at the end of reading it not be confident that it's not actually just like a book or scientific paper. And I think this is… You can think about some output that would have a certain length, such that you would be fairly confident that if an AI system would be able to produce this in a way that's indistinguishable from, say, a human expert, that at that point, this model is actually doing cognitive work in a way that is able to substitute for a human worker. Because, you know, if a model is able to produce a scientific paper that is indistinguishable from the corpus of scientific literature, according to experts, then, you know, at that point, you might think that it's able to do science in some way. And so the nice thing about this approach is that it's very simple. It requires just having these scaling laws that are estimated from these training runs. And we have those estimates. And so the only parameters that you need is really the parameters of scaling laws, which people have estimated before. And so you can directly back that out of these estimated scaling laws, which is kind of the advantage of this approach.

Jakub Kraus | 73:30.598

And what were the conclusions of this?

Tamay Besiroglu | 73:34.636

The conclusions, I think to preface this, is to say that this work is somewhat speculative and it makes a bunch of assumptions about what distributions are and about the nature of the process of human text and things like this that are somewhat speculative. But our conclusions were that you might need no more than, say, 1e34, or 1e35 operations to train a model that would be able to output something like a scientific paper in a way that's basically indistinguishable from the distribution of scientific papers. And this is, of course, a lot of compute. And this is nine orders of magnitude more than we currently use to train the frontier models.

Jakub Kraus | 74:30.248

Is it indexed to today, with effective compute?

Tamay Besiroglu | 74:35.776

Right. Yeah, it's indexed to today's effective compute, and so that's nine orders of magnitude of effective compute, and so that's not, you know, we can get there by a combination of algorithmic innovations and scaling up our physical hardware. But this kind of approach is for estimating an upper bound. And so, you know, it is likely that we end up producing models that are able to produce scientific work in a way other than purely doing this kind of autoregressive training. We might have RL or something like that. The way that we built chess engines that are superhuman wasn't by just purely imitating grandmasters or something like that. And we figured out some way of using RL to get there. And so it's possible that we end up coming up with new techniques that shave off some orders of magnitude of compute.

And there's, of course, a lot of uncertainty about this bound stemming from the uncertainty over what the scaling exponents are for the scaling laws. As we talked about before, unfortunately, these are not always estimated in a super reliable way. There's uncertainty about whether these papers are doing a good job, but also about, you know, just the standard errors of those exponents tend to be not trivial. And so that injects a lot of uncertainty. But overall, my guess is that, yeah, you don't need… there's some amount of compute that will result in a model that is able to automate most of the things that humans are able to do. And this amount of compute is probably not more than 10 orders of magnitude than we are currently using.

Jakub Kraus | 76:34.391

Fascinating. So this is an estimate of when AIs will reach human-level performance. And the economic impacts of that would be quite… transformative, sometimes people use that term. You looked particularly at whether economic growth could speed up very rapidly. And from your understanding of that literature, what might be the effects of not necessarily building this nine orders of magnitude jump in effective compute, but as we get closer to that, how will the economy be responding to improvements in AI capabilities?

Tamay Besiroglu | 77:25.204

Yeah. So, you know, I think an important part of what will happen is that There's going to be a lot of investment as it becomes clearer that we're able to automate a large fraction of human labor with AI. Then that is going to fuel a lot of investment in building the next generation of AI systems, given the tremendous value from automation. Just replacing human workers and saving a wage bill is very valuable, but also speeding up the rate of economic growth and increasing the size of your economy.

So if you are able to automate, in the limit, if you're able to automate all labor, then economic models say that this is likely to result in accelerating economic growth. So if you have automated all labor, including researchers, if you have automated R&D, then this results in hyperbolic growth. So this is faster than exponential. And so, you know, this produces a lot of economic value. And so the economy is going to be much, much larger.

And, you know, getting to this world in which people have, you know, a lot more wealth is very valuable. And investing a lot in making sure that happens soon is worth kind of a lot today. So if you are able to get a 10 or 100x increase in your wealth in 10 years, and you can try to make that happen sooner by giving up some of your wealth today by investing in AI, that ends up being kind of a favorable trade. And so this results in a lot of investment in AI.

The other thing is, it's possible to reallocate investment from conventional types of capital, so buildings, machines, and so on, into compute-related capital. And doing that is also just really valuable. And so maybe 20% or something of US GDP is spent on building up the kind of conventional stock of capital. And so it's possible that a large chunk of that ends up being spent on building up compute-related capital, so fabs, lithography machines, and various other things. And that is, I think, an important part of this trajectory that early on a lot gets spent.

Now, once you get close to automating half of the tasks in the economy, at that point, economic growth might be much higher. So I think I've written before about how this might increase economic growth by about 10x. And potentially somewhat more than that. And once you have automated about half of your economy and your economy is much larger, this then results in a lot more resources available to do additional scaling of computation, training larger systems, building more compute to run more models. And you might continue accelerating until you've fully automated almost all tasks that humans are doing.

At that point, I think it's very unclear exactly what happens to the rate of economic growth. I think this depends on whether there are bottlenecks that are hit and when those bottlenecks end up being encountered. And so, you know, there might be bottlenecks to do with natural resources or the amount of energy that we can dissipate into space. And we might have to go beyond Earth to use more resources and have an easier time doing more computation and things like this. And some of those constraints might end up limiting the amount of economic growth.

So historically we've seen accelerations of economic growth. So from the kind of hunter-gatherer era to the farming era to the industrial era, we've seen about an order of magnitude of speed up between each of these eras, so 10x. And so maybe in the next era will be another 10x. Maybe it'll be 100x or something. I know there are some people who think it might be much faster than that, and there are certainly a bunch of analogs from biology that suggest that growth could, in theory, be much faster in some circumstances. I think, unfortunately, we don't have a very good idea of how fast this might be, and I think it's just a very hard question to figure out.

Jakub Kraus | 82:38.144

Yeah, there was an analysis from Tom Davidson that tried to estimate roughly how quickly you would go from AI that can do 10% of all the cognitive economic tasks that use brainpower, could be done remotely, to doing 90% or 100%. And he also looked at, well, let's look mainly at the ones that could contribute to AI progress. Let's look at AI related R&D kinds of tasks. And I believe he found it could be anywhere from months to several years that we make this leap from somewhat close to human level AI systems to AI systems that are at human level or beyond on just about everything that doesn't require physical labor. Is that something that you see as plausible? Because that's a very rapid transition in terms of the societal impacts, I would expect.

Tamay Besiroglu | 83:46.464

Yeah, I think that analysis looks right to me. So, you know, a very quick way of seeing this would be to suppose that… if it's right that it requires only maybe 10 to the 32 or 33 orders of magnitude of compute to get the full automation. And then if the starting point is the point at which we have maybe 20% automation from AI, which in some sense, we're probably not yet there. That means that the gap in the amount of compute between full automation and 20% automation is maybe six, seven orders of magnitude of compute. And we could scale at this rate of maybe 10x a year. And so that suggests that it's not going to be more than a decade. And it could potentially be much faster, especially if you have accelerated economic growth, which seems fairly likely to me as soon as you get substantial AI automation, you have much faster growth of many of these resources and much faster growth of technology. And that shrinks these times by a factor of two or more. So I think on the order of a few years, it seems fairly plausible to me.

Yeah, maybe the months thing seems, you know, it seems plausible. I don't think it's my median outcome that it'll be on the order of months between like 20% and 90% automation. I think it'll take on the order of years just because, you know, it seems like many of these bottlenecks are kind of important and are kind of difficult to deal with. And there are these time skills associated with expanding production of compute and building fabs and so on.

Jakub Kraus | 86:00.163

Okay. That makes sense that… especially if we're trying to scale up hardware more, you were already alluding to all these constraints just at the 10 to the 29 level around energy and others. The one reason that gives me pause is perhaps there are some less physically constrained ways to improve AI, and the main one that stands out is algorithmic progress. If you have a system that can actually outperform humans at coming up with better algorithms, that seems really salient to me.

Tamay Besiroglu | 86:40.394

Sure. I think that's right. So yeah, there's this notion of like a software only singularity where at a fixed level of compute, you have an AI system that does AI R&D, improves training techniques, architectures, runs experiments, figures out how to build the next generation of AI system. Then there's a question of whether that results in... this kind of accelerating loop where you have a doubling of input to this R&D process that results in a greater than doubling in gains in efficiency so that it's able to produce even better AI systems the next time around that accelerates the generation of the next you know, the production of the next generation of models.

And, you know, there is a standard way of thinking about this problem from economics about the returns to research effort. And, yeah, so this... The way to think about this is that if the returns to research effort are in some sense greater than one, such that a proportional increase in the inputs results in a greater than proportional increase in the efficiency of the AI systems that you produce, then you get this kind of explosion where you have increasingly reduced lead times between the next generation of model or maybe increasingly powerful AI systems that could accelerate R&D in the future even more.

And so we have done some work on estimating similar kinds of returns for software progress or for software projects. If you double the number of researchers on a software project, like improving a chess engine, do you get, you know, a doubling of efficiency or greater than doubling or less than doubling? And what we find there is that sometimes you get a greater than doubling of the kind of efficiency of your system. So that in the AI case, if you get a greater than doubling, you can then have, you know, more than twice as efficient AI system in the future, improve R&D and get this acceleration. So sometimes this happens for kind of software projects that this returns to R&D is greater than one. But in some cases, in perhaps most cases, this doesn't quite happen. And so that might be some evidence in... against this idea that just automating software R&D is sufficient to get to getting, you know, to crossing this finish line of like building AI systems that are able to do everything. So that's one part.

I think the other part is just that historically... returns to progress in software and progress in hardware have been kind of tightly linked. So the rates of progress are like surprisingly similar over very long horizons. And my guess is what's going on is that these are quite tightly linked, such that it's hard to get fast progress in software without also having fast progress in hardware or fast scale-ups of hardware. And that's because it might be useful to run experiments on increasing hardware budgets to figure out your next software innovation. So I think if that's correct, then you can't just improve software without also scaling up your hardware and expect that that will continue at the pace that it has historically. I guess if you don't scale up hardware in tandem with improvements in software, then you might run out of kind of new things that you can try with software. And so the kind of the progress in software might end up stalling, which suggests that, you know, software only singularities of this type are maybe not super likely.

Jakub Kraus | 91:24.326

Okay. This sounds compelling. I'm just still trying to square it with my intuitive picture of what happens if you have a human-level digital engineer. So if you have an AI system that can totally automate the labor of finding better AI algorithms… To put some concrete numbers on it, and maybe there's a smoother transition to this, but let's say you all of a sudden hit this human level tipping point where before you couldn't get super big speed ups from letting the AI system run on its own and all of a sudden you can. So one really crude napkin math estimate found that if you scale up to GPT-6 or 8 or some point in the future where you might hit this, that you might have enough compute just from training the model to then run a billion copies of it. So suddenly you've gone from having very few independent AI engineers to having an entire global population of them all in your lab.

And you did push back on this in another way in your dissertation where you wrote about “Are models getting harder to find,” and you find that over time, ML research requires more inputs to get the same outputs. So productivity is declining, but you would need, from my math, maybe 60 years before that rate of, it was 4% to 26% decline, actually makes models so hard to find that they cancel out some huge, huge improvement in OpenAI's number of engineers. So what's missing from my picture of that?

Tamay Besiroglu | 93:32.524

Yeah, so I think the key thing is just that having a one-time increase from unlocking this ability to deploy a bunch of these AI engineers will have an effect. But not a huge effect. It'll be muted by the fact that you get this issue that it's hard to parallelize R&D. So we have estimates of this effect and, you know, we get estimates of the rate at which you're making progress on an R&D problem in software or in TFP efficiency in the economy scales roughly with the cube root. So that increasing the number of researchers by a factor of a thousand only gives you a 10x increase in the kind of instantaneous amount of progress you're making.

And so the thing that really matters is whether it is not this one time increase, but whether the path you're on is one that's accelerating or decelerating. And that is important because if it accelerates, then... very quickly you'll end up seeing extremely rapid rates of progress. And so my claim is that it is tricky for this to be accelerating given this co-dependence on hardware. You also need to accelerate the amount of hardware you're using. So that's, I think, one important consideration. I think this parallelization penalty mutes the effect that you get from having a billion AI engineers.

The other thing that mutes the effect is just limited access to hardware. You're going to compete with a billion other engineers to get the same compute. And that's going to make it hard to do AI research.

And so I think the key question here is with R&D, there are two effects. There's this kind of standing on the shoulders effect, that if you make technological progress, that makes it easier in the future to come up with technological progress because you have better tools, you have better insights, you have a deeper understanding of related problems. But there's also this kind of stepping on the toes effect, which is if you scale up the number of researchers, you are kind of duplicating efforts and that... that reduces the overall effect of scaling, that results in less than proportional gains for a proportional increase in the amount of investment or number of researchers. And so the question that matters for the overall dynamics of the system is whether these two effects are such that the returns to research effort are greater than one or less than one. And there, I think, our evidence suggests that it's maybe greater than one, but it's likely going to be dependent on hardware improvements happening in tandem.

Jakub Kraus | 97:11.758

Okay. So this is all looking into the distant future of AI. As we wrap up, I wanted to look just in the shorter term. How do you see upcoming AI capabilities or existing AI capabilities doing and contributing to forecasting work. So there was this one paper recently that found language models being used in a couple different components of a larger automated forecasting pipeline. And they found that that bigger system could forecast a lot of real world events almost as well as if you take a lot of predictions from competitive human forecasters and aggregate those predictions. So how do you expect the AI capabilities to contribute to forecasting… especially events like your current work on forecasting AI progress or AI's economic impacts?

Tamay Besiroglu | 98:11.977

Yeah, so I think, you know, we internally at Epoch are using AI for helping with the work we're doing. So, you know, building data sets and parsing papers and extracting details from them that we then turn into data sets and then try to figure out what to project and extrapolate where things are going. And that seems helpful for kind of forecasting related research work.

In economics I know that people have used LLMs to try to figure out what tasks are You know, to go through a huge database of jobs in the US economy and specific tasks and then have an LLM rate how automatable different tasks are and then come up with scores of how exposed and how likely to be automated specific occupations are. I think these applications are pretty great and AI will continue making more contributions to having that work happen. And I'm pretty excited about that.

I think for forecasting itself, like judgmental forecasting or statistical forecasting, I think one question that seems to bottleneck that effort is just coming up with the right questions to ask. Like, what is the thing that we should forecast… what is the question such that if we had good forecasts, it would resolve a bunch of our uncertainty. And I think that is hard. And I don't currently see GPT-4 as being able to do a good job at that. I think this is just about having good research intuitions. And the best systems don't yet have these very good intuitions at the level of top researchers or something. So that, it doesn't quite help with yet. And I think that's the part of the process that is kind of most important right now, or like most bottlenecked.

And so it is maybe able to do a good job at predicting these outcomes to like typical questions on these forecasting platforms. So maybe it's able to do a good job once you formulate the question. You get it, you give it to the AI, and then there's some pipeline for it to do research and figure things out. But of course, I mean the corpus of such questions are quite small. And so you might as well just give them to researchers who you expect to be very well calibrated about these types of things. And so I expect we will continue. We will probably not use LLMs for any kind of internal forecasting. We have pretty good forecasters at Epoch. We have good research analysts. I imagine hedge funds or other types of organizations that would benefit from forecasts would probably continue using… and they talk to internal researchers for these types of things. But maybe in some other domains where you might be constrained by having good forecasts, this might be useful. But I don't see GPT-4 contributing a whole lot today.

Jakub Kraus | 101:47.113

Got it. And before we close, was there any last points you wanted to bring up or anything you wish I had asked about?

Tamay Besiroglu | 101:57.817

No, I think we covered a lot of ground here. And I thought your questions were really great.

Jakub Kraus | 102:03.875

Thank you. Where can the audience go if they want to learn more about your stuff?

Tamay Besiroglu | 102:09.680

Sure. So Epoch's website, so epochai.org. On my website, which is tamaybesiroglu.com. I'm on Twitter, TamayBes. Yeah, so those are the places for people to find me.

Jakub Kraus | 102:26.020

Great. Tamay, thank you so much for coming on the show.

Tamay Besiroglu | 102:31.335

Yeah, great to be here.

Jakub Kraus | 102:36.457

Thanks for listening to the show. You can check out the Center for AI Policy podcast Substack for a transcript and relevant links. If you have any feedback, you can send me an email at jakub at AI policy dot us. And looking ahead, next episode will feature Kelsey Piper from Vox discussing OpenAI's recent NDA incident. I hope to see you there.

A guest post by

Jakub Kraus

Tarbell Fellow writing about AI and policy

Center for AI Policy Podcast

Discussion about this post