Interconnects

Nathan Lambert

Science Technology

Latest episode

Available Episodes

5 of 120

Thoughts on The Curve
I spent the weekend debating AI timelines, among other things, at The Curve conference. This translates as spending the weekend thinking about the trajectory of AI progress with a mix of DC and SF types. This is a worthwhile event that served as a great, high-bandwidth way to check in on timelines and expectations of the AI industry.Updating timelinesMy most striking takeaway is that the AI 2027 sequence of events, from AI models automating research engineers to later automating AI research, and potentially a singularity if your reasoning is so inclined, is becoming a standard by which many debates on AI progress operate under and tinker with. It’s good that many people are taking the long term seriously, but there’s a risk in so many people assuming a certain sequence of events is a sure thing and only debating the timeframe by which they arrive.I’ve documented my views on the near term of AI progress and not much has changed, but through repetition I’m developing a more refined version of the arguments. I add this depth to my takes in this post.I think automating the “AI Research Engineer (RE)” is doable in the 3-7 year range — meaning the person that takes a research idea, implements it, and compares it against existing baselines is entirely an AI that the “scientists” will interface with.In some areas the RE is arguably already automated. Within 2 years a lot of academic AI research engineering will be automated with the top end of tools — I’m not sure academics will have access to these top end of tools but that is a separate question. An example I would give is coming up with a new optimizer and testing it on a series of ML baselines from 100M to 10B parameters. At this time I don’t expect the models to be able to implement the newest problems the frontier labs are facing alone. I also expect academics to be fully priced out from these tools.Within 1-3 years we’ll have tools that make existing REs unbelievably productive (80-90% automated), but there are still meaningful technical bottlenecks that are solvable but expensive. The compute increase per available user has a ceiling too. Labs will be spending $200k+ per year per employee on AI tools easily (ie the inference cost), but most consumers will be at tiers of $20k or less due to compute scarcity.Within 3-4 years the augmented research engineers will be able to test any idea that the scientists come up with at the frontier labs, but many complex system problems will need some (maybe minimal) amount of human oversight. Examples would include modifying RL implementations for extremely long horizon tasks or wacky new ideas on continual learning. This is so far out that the type of research idea almost isn’t worth speculating on.These long timelines are strongly based on the fact that the category of research engineering is too broad. Some parts of the RE job will be fully automated next year, and more the next. To check the box of automation the entire role needs to be replaced. What is more likely over the next few years, each engineer is doing way more work and the job description evolves substantially. I make this callout on full automation because it is required for the distribution of outcomes that look like a singularity due to the need to remove the human bottleneck for an ever accelerating pace of progress. This is a point to reinforce that I am currently confident in a singularity not happening.Up-skilling employees as their roles become irrelevant creates a very different dynamic. The sustained progress on code performance over the next few years will create a constant feeling of change across the technology industry. The range of performance in software is very high and it is possible to perceive relatively small incremental improvements.These are very complex positions to hold, so they’re not that useful as rhetorical devices. Code is on track to being solved, but the compute limits and ever increasing complexity of codebases and projects (ie. LLMs) is going to make the dynamic very different than the succinct assumptions of AI 2027.To reiterate, the most important part of automation in the discussion is often neglected. To automate someone you need to outcompete the pairing of a human with the tool too.Onto the even trickier argument in the AI 2027 standard — automating AI research altogether. At the same time as the first examples of AI systems writing accepted papers at notable AI venues, I’m going to be here arguing that full automation of AI research isn’t coming anytime soon. It’s daunting to try and hold (and explain) this position, and it relies on all the messy firsthand knowledge of science that I have and how it is different in academia versus frontier AI labs.For one, the level and type of execution at frontier labs relative to academic research is extremely different. Academia also has a dramatically higher variance in quality of work that is accepted within the community. For this reason, we’re going to be seeing incredible disruption at standard academic venues in the very near future, but the nature of science at frontier labs will remain heavily intertwined with human personalities.Models will be good at some types of science, such as taking two existing fields and merging ideas and seeing what happens, but awful at what I consider to be the most idolized version of science, being immersed in the state of the art and having a brilliant insight that makes anywhere from a ripple causing small performance gain to a tsunami reshaping the field.I don’t think AI will fully automate our current notion of an AI researcher in the next 5-10 years, but it could reshape what science means altogether and make that role far less relevant to progress. The researchers grinding out new datasets at frontier labs will have dramatic help on data processing scripts. The researchers coming up with new algorithmic ideas will not expand the rate at which they come up with ideas too much, but their ability to test them is far higher.A large part of science is a social marketplace of ideas. Convincing your colleagues that you are right and to help you double down on it is not going to change in its core nature. Everyone will have superpowers on making evidence to support their claims, but the relative power there stays the same.At a dinner during The Curve I went through a lot of these points with Ryan Greenblatt, Chief Scientist at Redwood Research, and a point he made stuck with me. He summarized my points as thinking the increase in performance from these largely engineering tooling improvements will be equalled out by challenges of scaling compute, so the resulting progress will feel much more linear rather than exponential. A lot of our discussions on automation we agree on, with slightly different timelines, but it didn’t feel like it captured my entire point of view.What is missing is that I expect an inherent slowdown as our AI models get more complicated. Our models today needs tools, more complex serving systems, products to wrap them, and so on. This is very different than the age when just model weights were needed for the cutting edge of AI. There’s an inevitable curse of complexity, a death by a thousand cuts, that is going to add on top of the obvious compute costs to slow down progress.2026 will be a big year on the compute rollout front, and shipping meaningful improvements to users will be essential to funding the progress that comes after. I’m not sure the economy can keep shifting even more of its weight behind AI progress, where most people bought into fast timelines think of it as a default position. Peter Wildeford wrote a summary of the situation that I resonate with:Here’s how I think the AI buildout will go down.Currently the world doesn’t have any operational 1GW+ data centers. However, it is very likely we will see fully operational 1GW data centers before mid-2026. This likely will be a part of 45-60GW of total compute across Meta, Microsoft, Amazon/AWS/Anthropic, OpenAI/Oracle, Google/DeepMind, and xAI.My median expectation is these largest ~1GW data center facilities will hold ~400,000-500,000 Nvidia Blackwell chips and be used to train ~4e27 FLOP model sometime before the end of 2027. Such a model would be 10x larger than the largest model today and 100x larger than GPT-4. Each individual 1GW facility would cost ~$40B to manufacture, with ~$350B total industry spend across 2026.He continues with estimates for 2028, and saying he’s fuzzy on 2029, but my fuzziness cuts in a bit earlier depending on adoption and performance across the AI industry.Where I feel like in the long run it’ll look like a very consistent pace of progress, that feels like a bunch of big jumps and periods of stagnation in the short-term. I have fairly large error bars on how the price of intelligence — and therefore adoption — is going to evolve over the next 2-4 years, with it obviously becoming far cheaper over the following decades.As for my recent articles on timelines and key debates in the field, I encourage people to comment and dig in on what I wrote below.Interconnects is a reader-supported publication. Consider becoming a subscriber.Other thoughtsSomething crazy about this conference is no one is talking about how the models actually work or are trained, and everyone here is totally convinced that AGI is coming soon.One of my new friends at the conference described this tendency as “an obsession with the problem.” This is a feeling that many AI obsessors are more interested in where the technology is going rather than how or what exactly it is going to be. Helen Toner gave a great talk at The Curve related to this, arguing how the current and future jaggedness of AI — the fact that similarly difficult tasks when assigned to a human will either be easily mastered by AI or barely showing any competence (her will appear later on her great Substack). It is the idea that AI capabilities evolve highly randomly across potentially similar tasks.This original figure on jaggedness comes from work with the popular AI Substacker Ethan Mollick.The relation of Helen’s talk is that she gets many forms of arguments that only the endpoint of AI matters, but that doesn’t account for the messiness of the trajectory and how unsettling that could be for the world.I agree with Helen.One of the things that I am confident will exist in about two years is a Sora 2 style model that can run on a MacBook without copyright, personal opt-in, or other safety filters. On this, Epoch AI has a wonderful plot showing that local models lag behind in capabilities by a fixed amount of time:With trends like this, it is so obvious that we need to stay on the front foot of open models and not reacting to international parties that are far harder to predict and engage with. This is where I get renewed motivation for American DeepSeek / The ATOM Project. For example, I still get many adamant questions that we should consider banning open models altogether. The state of discourse, study, investment, and everything in between on open models in the U.S. is still in a quite underdeveloped state.China’s rise in open models was something I expected to be a bigger topic at the conference, but it seemed like it was too orthogonal to the overall pace of progress to be front of mind. There were many discussions of the Chinese chip ecosystem, but less on what it enables. Not focusing on this could have costly geopolitical consequences as we cede ownership of a global standard to China. This was a large theme of my talk. The recording will be posted here soon and the slides for my talk are here (credit for Florian Brand who helps me with open model analysis here for feedback on the slides). Otherwise:* These messages are very important and I will work to spend a bit more time engaging with the communities they touch and mastering this type of talk (and analysis)* More people should work in the area, it’s crazy it has just fallen on me where it is my side hustle.For now, I’m just landing at the conference on language modeling (COLM) in Montreal, so I may have some technical hot takes to share later this week! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
11:58
--------
11:58
ChatGPT: The Agentic App
Ever since ChatGPT exploded in popularity, there has been a looming “how” to its monetization plans. Much has been said about shopping and advertising as the likely paths, especially with Fidji Simo joining as CEO of Applications under Sam Altman. Advertising as a business model for AI is logical but difficult to personalize and specialize. We know tons of people spend a lot of time using AI models, but how do you best get the sponsored content into the outputs? This is an open technical problem, with early efforts from the likes of Perplexity falling short.Shopping is another, but the questions have long been whether AI models actually have the precision to find the items you want, to learn exactly what you love, and to navigate the web to handle all the corner cases of checkouts. These reflect a need for increased capabilities on known AI benchmarks, rather than inventing a new way of serving ads. OpenAI’s o3 model was a major step up in search functionality, showing it was viable; the integration was either a business problem — where OpenAI had to make deals — or an AI one — where ChatGPT wasn’t good enough at managing websites for you.Yesterday, ChatGPT launched its first integrated shopping push with Buy It in ChatGPT, a simple checkout experience, and an integrated commerce backend built on the Agentic Commerce Protocol (ACP). The announcement comes with the perfect partners to complement the strengths of OpenAI’s current models. GPT-5-Thinking is the best at finding niche content on the web, and ChatGPT’s launch partner for shopping is Shopify (*soon, Etsy is available today), the home to the long tail of e-commerce merchants of niche specialties. If this works, it will let users actively uncover exactly what they are looking for — from places that were often hard to impossible to find on Google. This synergy is a theme we’ll see reoccur in other agents of the future. The perfect model doesn’t make a useful application unless it has the information or sandbox it needs to think, search, and act. The crucial piece that is changing is that where models act is just as important as the weights themselves — in the case of shopping, it is the network of stores with their own rankings and API.The ACP was built in collaboration with Stripe, and both companies stand to benefit from this. Stripe wants more companies to build on the ACP so that its tools become the “open standard for agentic payments” and OpenAI wants the long-tail of stores to adopt it so they can add them to their ever-growing internal recommendation (or search) engine. The business model is simple, as OpenAI says “Merchants pay a small fee on completed purchases.” OpenAI likely takes a larger share than Stripe, and it is a share that can grow as their leverage increases over shoppers.I’m cautiously optimistic about this. Finding great stuff to buy on the web is as hard as it has ever been. Users are faced with the gamification of Google search for shopping and the enshittification of the physical goods crowding out Amazon. Many of the best items to buy are found through services like Meta’s targeted ads, but the cost of getting what you want should not be borne through forced distraction.OpenAI will not be immune to the forces that drove these companies to imperfect offerings, but they’ll come at them with a fresh perspective on recurring issues in technology. If this works for OpenAI, they have no competitor. They have a distribution network of nearly 1B weekly users and no peer company ready to serve agentic models at this scale. Yes, Google can change its search feed, but the thoroughness of models like GPT-5 Thinking is on a totally different level than Google search. This agentic model is set up to make ChatGPT the one Agentic App across all domains.The idea of an agentic model, and really the GPT-5 router itself, shows us how the grand idea of one giant model that’s the best for every conceivable use-case is crumbling. OpenAI only chooses the more expensive thinking model when it deems a free user to need it and they have an entirely different model for their coding products. On the other hand, Claude released their latest model, Claude 4.5 Sonnet, yesterday as well, optimizing their coding peak performance and speed yet again — they have no extended model family. The reality that different models serve very different use-cases and how AI companies need to decide and commit to a certain subset of them for their development points to a future with a variety of model providers. Where coding is where you can feel the frontier of AI’s raw intelligence or capabilities, and Anthropic has turned their entire development towards it, the type of model that is needed for monetization of a general consumer market could be very different. This is the web-agent that OpenAI has had the industry-leading version of for about 6 months. Specialization is making the AI market far more interesting, as companies like OpenAI and Google have been in lockstep with their offerings for years. Every company would drop the same model modalities with approximately the same capabilities. Now, as hill-climbing benchmarks are no longer providing immediate user value, especially in text domains, the vision for each AI company is more nuanced. I predicted this earlier in the summer, in my post on what comes next:This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step.What I missed is that this applies downward pressure on the number of models labs will release — the value can be more in the integrations and applications than the model itself. Expect releases like today, where Claude released Claude Sonnet 4.5 along with version 2 of Claude Code. The period will still be busy as the industry is on the tail end of the low hanging fruit provided by reasoning models, but over time the hype of model releases themselves will be harder to conjure.Interconnects is a reader-supported publication. Consider becoming a subscriber.Let’s consider the applications that are rolling out today on top of different models. If you haven’t pushed the limits of GPT-5-Thinking, and better yet GPT-5-Pro, for search you really need to, it’s a transformative way of using compute that can find many buried corners of the web. In terms of untapped model capability value, the abilities of search-heavy thinking models like GPT-5 seem far higher than coding agents, which are obviously heavily used. Search-heavy models are an entirely new use, where coding models were the first widespread LLM-based product. As coding agents become more autonomous, they’ll continue to flex and mold a new form for the software industry, but this will be a slow co-evolution. OpenAI is going to focus on its vertical Agentic App where Anthropic (and likely Gemini with Google Cloud) are going to power the long-tail of AI applications reshaping the web and the rest of work. OpenAI will only expand from here. Email, scheduling, travel bookings, and more everyday digital tasks are surely on their roadmap. Their biggest competitor is themselves — and whether their vision can be crafted into something people actually use. If shopping doesn’t work out as the vertical that lets them realize their valuation, they’re positioned to keep trying more. OpenAI has both the lead in the variety of models that power these agentic information tasks and the user base to incentivize companies to collaborate with them.The application paradigm that dominated the mobile era is going to rebound. AI applications started in a form where the user needed to be heavily involved in the work process. The first beneficiaries of this were IDEs and terminal tools. Both of these workplaces allow in-depth and detailed inspection of the process and results. The cutting edge of AI will still work there, but the long tail of casual use will all shift to the standard mode of applications — siloed, simple, and scalable in the cloud. The simpler an AI application is, the wider its potential audience.With this addition of shopping, OpenAI is poised to launch a standalone TikTok-style app with the release of its next video generation model, Sora 2, soon after Meta launched Vibes in their Meta AI app for only AI generated videos with a specific theme to start. At the same time, OpenAI’s Codex web agent is available in the ChatGPT application, which represents an even bigger change in the nature of software work than the addition of coding agents — it allows real websites, and soon businesses, to be built with only a prompt on your phone. In 6-12 months, these agentic applications that feel rough around the edges due to the quality of the AI today, rather than the interface, are going to feel seamless and second-nature to use, despite their complete novelty relative to the past decades of technology.If OpenAI is positioning itself to be The Agentic App, this also opens the door to the near future where many applications we use today shift to an agentic era. Want to schedule a meeting with someone? Let the Google Calendar agent handle that (or some startup that beats them to it). Your email application can find who the next client is and remind them of their appointment. The Banking App will file your taxes in one prompt. The list of these is infinite and across a wide spectrum of difficulty. OpenAI wants to be the one app, The Agentic App, that serves all of these, and the rest of the industry is racing to master their specific vertical before OpenAI gets there. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
9:24
--------
9:24
Thinking, Searching, and Acting
The weaknesses of today’s best models are far from those of the original ChatGPT — we see they lack speed, we fear superhuman persuasion, and we aspire for our models to be more autonomous. These models are all reasoning models that have long surpassed the original weaknesses of ChatGPT-era language models, hallucinations, total lack of recent information, complete capitulations, and other hiccups that looked like minor forms of delusion laid on top of an obviously spectacular new technology.Reasoning models today are far more complex than the original chatbots that consisted of standalone model weights (and other lightweight scaffolding such as safety filters). They're built on three primitives that'll be around for years to come:* Thinking: The reasoning traces that enabled inference-time scaling. The "thoughts" of a reasoning model take a very different form than those of humans that inspired the terminology used like Chain of Thought (CoT) or Thinking models.* Searching: The ability to request more, specific information from non-parametric knowledge stores designed specifically for the model. This fills the void set by how model weights are static but living in a dynamic world.* Acting: The ability for models to manipulate the physical or digital world. Everything from code-execution now to real robotics in the future allow language models to contact reality and overcome their nondeterministic core. Most of these executable environments are going to build on top of infrastructure for coding agents.These reasoning language models, as a form of technology are going to last far longer than the static model weights that predated and birthed ChatGPT. Sitting just over a year out from the release of OpenAI's o1-preview on September 12, 2024, the magnitude of this is important to write in ink. Early reasoning models with astounding evaluation scores were greeted with resounding criticism of “they won’t generalize,” but that has turned out to be resoundingly false.In fact, with OpenAI's o3, it only took 3-6 months for these primitives to converge! Still, it took the AI industry more broadly a longer time to converge on this. The most similar follow-up on the search front was xAI's Grok 4 and some frontier models such as Claude 4 express their reasoning model nature in a far more nuanced manner. OpenAI's o3 (and GPT-5 Thinking, a.k.a. Research Goblin) and xAI's Grok 4 models seem like a dog determined to chase their goal indefinitely and burn substantial compute along the way. Claude 4 has a much softer touch, resulting in a model that is a bit less adept at search, but almost always returns a faster answer. The long-reasoning traces and tool use can be crafted to fit different profiles, giving us a spectrum of reasoning models.The taxonomy that I laid out this summer for next-generation reasoning models — skills for reasoning intelligence, calibration to not overthink, strategy to choose the right solutions, and abstraction to break them down — are the traits that'll make a model most functional given this new perspective and agentic world.The manner of these changes are easy to miss. For one, consider hallucinations, which are an obvious weakness downstream of the stochastic inference innate to the models and their fixed date cutoff. With search, hallucinations are now missing context rather than blatantly incorrect content. Language models are nearly-perfect at copying content and similarly solid at referencing it, but they're still very flawed at long-context understanding. Hallucinations still matter, but it’s a very different chapter of the story and will be studied differently depending on if it is for reasoning or non-reasoning language models.Non-reasoning models still have a crucial part to play in the AI economy due to their efficiency and simplicity. They are part of a reasoning model in a way because you can always use the weights without tools and they'll be used extensively to undergird the digital economy. At the same time, the frontier AI models (and systems) of the coming years will all be reasoning models as presented above — thinking, searching, and acting. Language models will get access to more tools of some form, but all of them will be subsets of code or search. In fact, search can be argued to be a form of execution itself, but given the imperative of the underlying information it is best left as its own category.Another popular discussion with the extremely-long generations of reasoning models has been the idea that maybe more efficient architectures such as diffusion language models could come to dominate by generating all the tokens in parallel. The (or rather, one) problem here is that they cannot easily integrate tools, such as search or execution, in the same way. These’ll also likely be valuable options in the AI quiver, but barring a true architectural or algorithmic revolution that multiplies the performance of today’s AI models, the efficiency and co-design underway for large transformers will enable the most dynamic reasoning models.Interconnects is a reader-supported publication. Consider becoming a subscriber.With establishing what makes a reasoning model complete comes an important mental transition in what it takes to make a good model. Now, the quality of the tools that a model is embedded with is arguably something that can be more straightforward to improve than the model — it just takes substantial engineering effort — and is far harder with open models. The AI “modeling” itself is mostly open-ended research.Closed models have the benefit of controlling the entire user experience with the stack, where open models need to be designed so that anyone can take the weights off of HuggingFace and easily get a great experience deploying it with open-source libraries like VLLM or SGLang. When it comes to tools used during inference, this means that the models can have a recommended setting that works best, but they may take time to support meaningful generalization with respect to new tools. For example, OpenAI can train and serve their models with only one search engine, where I at Ai2 will likely train with one search engine and then release the model into a competitive space of many search products. A space where this can benefit open models could be something like MCP, where open models are developed innately for a world where we cannot know all the uses of our models, making something like MCP libraries a great candidate for testing. Of course, leading AI laboratories will (or have already started) do this, but the ranking will be different in a priority stack.Much has been said about tokenomics and costs associated with reasoning models, without taking the tool component into account. There was a very popular article articulating how models are only getting more expensive, with a particular focus on reasoning models using far more tokens. This is overstating a blip, a point in time when serving costs increased by 1000x for models by generating vastly more tokens, but without improved hardware. The change in cost of reasoning models reflected a one-time step up in most circumstances where the field collectively turned on inference-time scaling by using the same reasoning techniques. At the same time as the reasoning model explosion, the size of models reaching users in parameter count has all but stagnated. This is due to diminishing returns in quality due to scaling parameters — it’s why OpenAI said GPT 4.5 wasn’t a frontier model and why Gemini never released their Ultra model class. The same will come for reasoning tokens.While diminishing returns are hitting reasoning token amount for serial streams, we’re finally seeing large clusters of Nvidia’s Blackwell GPUs come online. The costs for models seem well on path to level out and then decrease as the industry develops more efficient inference systems — the technology industry is phenomenal at making widely used products far cheaper year over year. The costs that’ll go up are the agents that are enabled by these reasoning models, especially with parallel inference, such as the Claude Code clones or OpenAI’s rumored Pro products.What we all need is a SemiAnalysis article explaining how distorted standard tokenomics are for inference with tools and if tools substantially increase variance in implementations. People focus too much on the higher token costs from big models with long context lengths, those are easy to fix with better GPUs, while there are many other costs such as search indices or idle GPU time waiting for tool execution results.When we look at a modern reasoning model, it is easy to fixate on the thinking token aspects that give the models their name. At the same time, search and execution are such fundamental primitives to modern language models that they can rightfully stand on their own as pillars of modern AI. These are AI systems that substantially depend on the quality of the complex inference stack far more than getting the right YOLO run for the world’s best model weights.The cause of thinking, searching, and acting all being looped in as a “reasoning model” is that this inference-time scaling with meandering chains of thought was the technological innovation that made both search and execution far more functional. Reasoning was the step change event that set these three as technology standards. The industry is in its early days of building out fundamental infrastructure to enable them, which manifests as the early days of language model agents. The infrastructure pairs deterministic computing and search with the beauty, power, and flexibility of the probabilistic models we fell in love with via ChatGPT. This reasoning model layer is shaping up to be the infrastructure that underpins the greatest successes of the future technology industry. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
9:22
--------
9:22
Coding as the epicenter of AI progress and the path to general agents
Coding, due to its breadth of use-cases, is arguably the last tractable, general domain of continued progress for frontier models that most people can interface with. This is a bold claim, so let’s consider some of the other crucial capabilities covered in the discourse of frontier models:* Chat and the quality of prose written by models has leveled off, other than finetuning to user measures such as sycophancy. * Mathematics has incredible results, but very few people directly gain from better theoretical mathematics. * The AIs’ abilities to do novel science are too unproven to be arguable as a target of hillclimbing. Still, coding is a domain where the models are already incredibly useful, and they continue to consistently stack on meaningful improvements. Working daily with AI over the last few years across side projects and as an AI researcher, it has been easy to take these coding abilities for granted because some forms of them have been around for so long. We punt a bug into ChatGPT and it can solve it or autocomplete can tab our way through entire boilerplate. These use-cases sound benign, and haven’t changed much in that description as they have climbed dramatically in capabilities. Punting a niche problem in 1000+ lines of code to GPT-5-Pro or Gemini Deep Think feels like a very fair strategy. They really can sometimes solve problems that a teammate or I were stuck on for hours to days. We’re progressing through this summarized list of capabilities:* Function completion: ~2021, original Github CoPilot (Codex)* Scripting: ~2022, ChatGPT* Building small projects: ~2025, CLI agents* Building complex production codebases, ~2027 (estimate, which will vary by the codebase)Coding is maybe the only domain of AI use where I’ve felt this slow, gradual improvement. Chat quality has been “good enough” since GPT-4, search showed up and has been remarkable since OpenAI’s o3. Through all of these more exciting moments, AIs’ coding abilities have just continued to gradually improve. Now, many of us are starting to learn a new way of working with AI through these new command-line code agents. This is the largest increase in AI coding abilities in the last few years. The problem is the increase isn’t in the same domain where most people are used to working with AI, so the adoption of the progress is far slower. New applications are rapidly building users and existing distribution networks barely apply. The best way to work with them — and I’ll share more examples of what I’ve already built later in this post — is to construct mini projects, whether it’s a new bespoke website or a script. These are fantastic tools for entrepreneurs and researchers who need a way to quickly flesh out an idea. Things that would’ve taken me days to weeks can now be attempted in hours. Within this, the amount of real “looking at the code” that needs to be done is definitely going down. Coding, as an activity done through agents, is having the barriers to entry fully fall down through the same form factor that is giving the act of coding re-found joy.Why I think a lot of people miss these agents is that the way to use the agents is so different from the marketing of incredible evaluation breakthroughs that the models are reaching. The gap between “superhuman coding” announcements and using an agent for mini projects is obviously big. The best way to use the agents is still mundane and requires careful scoping of context. For example, yesterday, on September 17, 2025, OpenAI announced that GPT-5 as part of a model system got a higher score than any human (and Google’s Gemini Deep Think) at the ICPC World Finals, “the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems.” Here’s what an OpenAI researcher said they did:We competed with an ensemble of general-purpose reasoning models; we did not train any model specifically for the ICPC. We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.These competitions often get highlighted because they’re “finite time,” so the system must respond in the same fixed time as a human does, but the amount of compute used by GPT-5 or another model here is likely far higher than any user has access to. This is mostly an indication that further ability, which some people call raw intelligence, can be extracted from the models, but most of that is limited by scaffolding and product when used by the general population.The real story is that these models are delivering increasing value to a growing pool of people.For followers of AI, coding with AI models is the easiest way to feel progress. Now that models are so good at chat, it takes very specialized tasks to test the general knowledge of models, or many of the gains are in getting the right answer faster than GPT-5-Thinking’s meandering path.I’m not an expert software engineer and the huge differences between models, and improvements that the individual models and systems are making, have been incredibly obvious. I’ve said many times how Claude Code (or now Codex) are far better than Cursor Agent, which is in turn far better than Github CoPilot. GitHub CoPilot feels borderline drunk at the wheel. Cursor often feels a little distracted while still being smart, but Claude Code and Codex seem on topic and able to test the best of a model’s intelligence on the problem at hand. Yes, even the best agents often aren’t good enough in complex codebases, but it removes the need to go back and forth countless times in a chat window to see if a model can reach the end of the puzzle for you. These CLI agents can run tests, fix git problems, run local tools, whatever. The scope is constantly growing.For the nuanced take of Claude Code vs Codex CLI right now, the answer is expensive. The best has been Claude Code forcing Claude Opus 4.1, but Codex is not far behind and comes in at a much cheaper entry point ($20/month) — Opus requires a $100+/month plan. Codex also has nice features like web search, but it hasn’t been a major differentiator yet in my use. The new workflow is to switch to the other agent when one cannot solve the current problem, and let it see the repository with fresh eyes, much like you pasted a question to another chatbot. The agents are just one tab away, just like the competitors for chat. Interconnects is a reader-supported publication. Consider becoming a subscriber.In the comparison of Claude, Cursor, and CoPilot above, the crucial component is that all of these agents can be tested with the same Claude 4 Sonnet model. The gaps are just as wide as I stated, highlighting how so many of the gains in coding agents are just in product implementations. A second version is slightly embarrassing for me, but follows as I hadn’t updated my OpenAI Codex code when trying the new GPT-5-Codex model, which resulted in an immediate massive jump in performance by changing it. It’s a new phenomenon to have a domain at the cutting edge of AI abilities where the software scaffolding of a model is felt so strongly. Product and prompts matter more than ever and this sensation will expand to more domains. The why of these performance differences — even when using the same model — is worth dwelling on. It’s unlikely that the Claude team is that much better at general software engineering and product design — rather, Anthropic has extensive in-house experience in extracting the most from models. The current shift in models has been about how to take a set of models that are designed for question answering and other single-stream text tasks and break down problems. In my taxonomy on next-generation reasoning models, I called this ability “abstraction.” The need to just slightly shift the model to this task explains OpenAI’s recent specialized model for this, GPT-5-Codex. GPT-5 was primarily a release about balancing OpenAI’s books with a user base approaching 1B active users in the chat format. GPT-5 is a honed tool for a different job. The evaluation scores are slightly better than the general reasoning model for this new GPT-5-Codex, but the main gains are in how behavior is different in coding tasks.GPT‑5-Codex adapts how much time it spends thinking more dynamically based on the complexity of the task. The model combines two essential skills for a coding agent: pairing with developers in interactive sessions, and persistent, independent execution on longer tasks. That means Codex will feel snappier on small, well-defined requests or while you are chatting with it, and will work for longer on complex tasks like big refactors. During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation.And they included this somewhat confusing plot to showcase this dynamic. I’ve certainly felt these changes when I updated the Codex software and the Codex model.This represents another key problem I presented in my taxonomy — calibration, i.e. not overthinking. Having specialized models and specialized products for a use case could make people think that they’re narrowing in to make progress, but in OpenAI’s case it is rather that their hands are tied financially to support the main ChatGPT application. Claude has already fully committed to code. This is due to the size that the space could expand into.These “coding” agents are definitely going to be seen as doing far more than writing code. Yes, their primary ability is going to be writing the code itself and executing it, but what that enables is an entirely new way of working with your computer. In my post Contra Dwarkesh on Continual Learning, I presented a view where agents are going to be given all your digital working context in order to be a research or editorial assistant available 24/7. I’ve begun putting this to use for Interconnects, where I give the agents all of my articles, metadata, interviews, and details, so I can ask them for relevant references and context for future posts. This is very underbaked and early as a project for searching efficiently over my 400K tokens of writing, but I was prompting it a few times to see any interesting references for this post, and it got me something that was useful! This quote from my Ross Taylor interview was spot on for the vibes of using coding agents in July:My main worry with Claude Code is that... people confuse agents making you more productive versus preventing you from exerting mental effort. So sometimes I’ll have a day with Claude Code where I feel like I use very little mental effort—and it feels amazing—but I’m pretty sure I’ve done less work... Where it becomes really bad is when the file size becomes too long. Then the agent tends to struggle and get into these weird line search doom loops.This sentiment is still definitely true for production codebases that are extremely complex, but the doom loop likelihood is dropping in my tests. At the same time, the joy and mental ease still applies.Some examples of what I’ve built with a mix of Claude Code or OpenAI’s Codex CLI recently include:* A raw HTML site for my RLHF book for comparing the responses of SFT vs. RLHF trained models from the same lineage (and improvements to RLHF book itself).* Making a repository with all of the posts and content from Interconnects so I can use coding agents as editorial assistants while writing.* Improvements to the ATOM Project website.* Stripping my personal website out of Webflow’s systems (which was a mistake to sign up for during graduate school), including CMS entries and other detailed pages.* Other small scripts and tools in my day job training models.It’s not just me building extensively with these. There are multiple open-source projects committed to tracking the public contributions of these models — two are PRArena and Agents in the Wild.PRArena’s dashboard shows over a million PRs getting merged from the Codex web agent, dwarfing many of the competitors. This is the power that OpenAI can wield with distribution, even if the web app version of Codex is far from the zeitgeist that is CLI agents today.This comes with a notable asterisk in methodology that can explain many of the gaps in similar dashboards:Some agents like Codex iterate privately and create ready PRs directly, resulting in very few drafts but high merge rates. Others like Copilot and Codegen create draft PRs first, encouraging public iteration before marking them ready for review.The statistics below focus on Ready PRs only to fairly compare agents across different workflows, measuring each agent's ability to produce mergeable code regardless of whether they iterate publicly (with drafts) or privately.The other dashboard, Agents in the Wild, shows that OpenAI’s coding agent is only one order of magnitude behind humans and other automations in PRs merged.Putting this in perspective relative to Gemini or Claude:The context with this is that Claude Code is far more downloaded than OpenAI’s CLI agent Codex, but it doesn’t name its PRs the same clever way by default with the agent name in the branch. Claude Code has over 20X the downloads of Codex in the last week on NPM.Despite the challenges of measurement, it’s clear that coding agents are taking off. The Codex PRs above actually represent the web agent, which has the default branch name behavior, not the CLI agent. This shows the might of OpenAI’s distribution, and it is impressive how many of the PRs are actually merged (over 80%), when thousands of people are trying a new tool for the first time. The primary difference between the web agent and the CLI agent is a reduction in interactivity. The CLI agents propose a plan and ask for feedback, or let you monitor and interrupt. Codex on the web wraps a similar behavior as the CLI agents in one system that runs all the way until it can open a PR.Over time coding is only going to get more asynchronous and OpenAI is poised to capture this transition if it happens soon. Based on all the above evidence of coding models getting more capable, the move to this new UX for software will happen faster than people expect. The transition to fully autonomous coding will happen soon for types of work where coding models already work near flawlessly — scripts, websites, data analysis, etc. Later, complex production codebases will work best at lower levels of the stack — IDEs, CLI agents, and other things that are both interactive and best for absorbing content.Within a few years, the two trends will converge where autonomous agents are functional and the most complex codebases can be improved with AI. Then everything can return to the chatbot window — you only need to open your IDE when you want to understand what’s going on. For most people, not having to look at the code will be a welcome change.Progress in coding feels slower than the “emergent” abilities between model generations past, which makes it easier to keep track of. This is due to how big the range in behaviors that encompass “coding” is, but results in a fantastic area for learning how AI models evolve and iterate. This playbook will be used many times over by frontier labs in the coming years as AI models are taught to solve more challenging tasks.There’s a quiet revolution happening, and in order to truly understand it, you need to partake. Go build something. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
16:18
--------
16:18
On China's open source AI trajectory
Hello everyone! I’m coming back online after two weeks of vacation. Thankfully it coincided with some of the slowest weeks of the year in the AI space. I’m excited to get back to writing and (soon) share projects that’ll wrap up in the last months of the year.It seemed like a good time to remind people of the full set of housekeeping for Interconnects. * Many people love the audio version of the essays (read by me, not AI). You can get them in your podcast player here. Paid subscribers can add private podcast feeds under “manage your subscription” where voiceover is available for paywalled posts.* The Interconnects Discord for paid subscribers continues to get better, and is potentially the leading paid perk amid the fragmentation of Twitter etc.* We’re going to be rolling out more perks for group subscriptions and experimental products this fall. Stay tuned, or get in touch if group discounts are super exciting for your company. For the time being, I’m planning trips and meetups across a few conferences in October. I’ll be speaking at The Curve (Oct. 3-5, Berkeley), COLM (Oct. 7-10, Montreal, interest form), and the PyTorch Conference (Oct. 21-24, SF) on open models, Olmo, and the ATOM Project, so stay tuned for meetups and community opportunities. On to the post!China is maneuvering to double down on its open AI ecosystem. Depending on how the U.S. and its allies change culture and mobilize investment, this could make the dominance of Chinese AI models this summer, from Qwen, Kimi, Z.ai, and DeepSeek, looks like foreshadowing rather than the maximum gap in open models between the U.S. and China. Until the DeepSeek moment, AI was likely a fringe issue to the PRC Government. The central government will set guidelines, rules, budgets, and focus areas that will be distributed and enforced across the decentralized government power structures. AI wasn’t a political focus and the strategy of open-source was likely set by companies looking to close the gap with leading American competitors and achieve maximum market share in the minimum time. I hear all the time that most companies in the U.S. want to start with open models for IT and philosophical reasons, even when spinning up access to a new API model is almost effortless, and it’s likely this bias could be even higher internationally where spending on technology services is historically lower.Most American startups are starting with Chinese models. I’ve been saying this for a while, but a more official reference for this comes from a recent quote from an a16z partner, Martin Casado, another vocal advocate of investment in open models in America. He was quoted in The Economist with regards to his venture portfolio companies:“I’d say 80% chance [they are] using a Chinese open-source model.”The crucial question for the next few years in the geopolitical evolution of AI is whether China will double down on this open-source strategy or change course. The difficulty with monitoring this position is that it could look like nothing is happening and China maintains its outputs, even when the processes for creating them are far different. Holding a position is still a decision.It’s feasible in the next decade that AI applications and open models are approached with the same vigor that China built public infrastructure over the last few decades (Yes, I’m reading Dan Wang’s new book Breakneck). It could become a new area that local officials compete in to prove their worth to the nation — I’m not sure even true China experts could make confident predictions here. A large source of uncertainty is whether the sort of top-down, PRC edicts can result in effective AI models and digital systems, where government officials succeeded in the past with physical infrastructure.At the same time as obvious pro-AI messaging, Chinese officials have warned of “disorderly competition” in the AI space, which is an indirect signal that could keep model providers releasing their models openly. Open models reduce duplicative costs of training, help the entire ecosystem monitor best practices, and force business models that aren’t reliant on simple race-to-the-bottom inference markets. Open model submarkets are emerging for every corner of the AI ecosystem, such as video generation or robotic action models, (see our coverage of open models, Artifacts Logs) with a dramatic evolution from research ideas to mature, stable models in the last 12-18 months.China improving the open model ecosystem looks like the forced adoption of Chinese AI chips, further specialization of companies’ open models to evolving niches, and expanded influence on fundamental AI research shared internationally. All of these directions have early signs of occurring.If the PRC Government wanted to exert certain types of control on the AI ecosystem — they could. This Doug Guthrie excerpt from Apple in China tells the story from the perspective of international companies. Guthrie was a major player in advising on culture changes in Cupertino to better adapt Apple’s strategy to the Chinese market.“When you stake your life, your identity, on and around certain ideas, you sort of fight for them,” Guthrie says. “Xi Jinping kind of broke my heart… I was sitting there, in China, in my dream job, and I’m watching Xinjiang’s internment camps. I’m watching China tearing up a fifty-year agreement over Hong Kong.”Apple, meanwhile, had become too intertwined with China. Guthrie had been hired to help understand the country and to navigate it. And Apple had followed through—very successfully. But it had burned so many boats, as the saying goes, that Guthrie felt its fate was married to China’s and there was no way out. “The cost of doing business in China today is a high one, and it is paid by any and every company that comes looking to tap into its markets or leverage its workforce,” he later wrote in a blog. “Quite simply, you don’t get to do business in China today without doing exactly what the Chinese government wants you to do. Period. No one is immune. No one.”China famously cracked down on its largest technology companies in late 2020, stripping key figures of power and dramatic amounts of market value off the books. AI is not immune to this.The primary read here is that the PRC leadership will decide on the role they want to have in the open-source AI ecosystem. The safe assumption has been that it would continue because the government picked up a high-impact national strategy when it first started focusing on the issue, already seeded with international influence. To formalize these intentions, the Chinese government has recently enacted an “AI+” plan that reads very similarly to the recent White House AI Action Plan when it comes to open models. The AI+ plan idea was first proposed in March 2024 and was just approved in its full text on July 31, 2025. The AI+ plan, when enacted by local officials, lays out goals for the AI industry in how many open models to have at different tiers of performance and some funding mechanisms for nurturing them. This is right in line with other comments from party officials. Chinese Premier Li Qiang, second-ranking member of the Politburo Standing Committee, made comments in March directly supporting open-source models. From the Wall Street Journal:Li pledged that China would boost support for applications of large-scale AI models and AI hardware, such as smartphones, robots, and smart cars.China’s top economic planning body also said Wednesday that the country aimed to develop a system of open-source models while continuing to invest in computing power and data for AI.An excerpt from Beijing’s city plan as part of the overall AI+ initiative, translated by GPT-5 Pro, has interesting, specific goals:By end-2025: implement 5 benchmark application projects at a world-leading level; organize 10 demonstration application projects that lead the nation; and promote a batch of commercializable results. Strive to form 3–5 advanced, usable, and self-controllable base large-model products, 100 excellent industry large-model products, and 1,000 industry success cases. Take the lead in building an AI-native city, making Beijing a globally influential AI innovation source and application high ground.The goal of this is to:Encourage open-source, high-parameter, ‘autonomous and controllable’ base foundation models, and support building cloud hosting platforms for models and datasets to facilitate developer sharing and collaboration.Beyond the minor translation bumpiness, the intentions are clear. The goal of the A+ plan is clear with multiple mentions of both open-source models and an open ecosystem with them where the models can be adopted widely. The ecosystem of models can make the impact of any one individual model greater than it would be alone.The Chinese government having centralized power has more direct levers to enact change than the White House, but this comes with the same trade-offs as all initiatives face when comparing the U.S. vs. China’s potential. I won’t review all of the differences in the approaches here.Where the Chinese Government enacts top level edicts that’ll be harder to follow from the West, there are numerous anecdotes and interactions that highlight in plain terms the mood of the AI ecosystem in China. I’ve routinely been impressed by the level of direct engagement I have received from leading Chinese AI companies and news outlets. Interconnects’ readership has grown substantially in China.Chinese companies are very sensitive to how their open contributions are viewed — highlighting great pride in both their work and approach. The latest case was via our China open model rankings that got direct engagement from multiple Chinese AI labs and was highlighted by a prominent AI news outlet in China — 机器之心/Synced. They described Interconnects as a “high-quality content platform deeply focused on frontier AI research.” (This Synced post was translated and discussed in the latest ChinaAI Newsletter)When intellectuals, influencers, and analysts I follow talk directly to technical members of the AI workforce in China, they sound like what we would expect — people who want to build a great technology. Jasmine Sun had a great writeup on her trip that had some anecdotes on AI in China. She asked “Do you guys worry about AI safety?”“We don’t think about risks at all.” …Continuing from Jasmine:This was the first of several conversations that gave us a distinct impression of the Chinese tech community. Spirits are high, and decoupling policies like export controls only fuel their patriotic drive.At the same time, America still represents a covetable life, despite the current political tumult:To be clear, our researcher friend made clear that working at a top US AI lab was still the most desirable option.In so many ways, trying to precisely map China’s next steps in AI is extremely challenging. Can they convert their lead in energy infrastructure to more total AI compute? Can they build their own AI chips? Will they take the frontier of performance with their talented population and a different approach? All of this is up for debate. The intrigue here is exemplified by the abundant interest in sparse news stories on how DeepSeek is training some AI model with Huawei chips. In many ways, these new chips working would be a bigger story than the original DeepSeek model, but all signs point to expected experiments with domestic chips, where China’s leading AI models are likely to be trained on Nvidia and other Western chips for the foreseeable future. I do not expect DeepSeek R2 to be trained on Huawei’s hardware.China’s hardware investment will take a lot longer to play out than open model strategies, but if China pulls it off — along with its other investments, such as self-driving cars and robots — their practical lead in AI could come for more areas. Open models could be China’s beachhead in a bigger technological resurgence with AI.Without major changes to Western investment in open models, we’re approaching a status quo in 2026 and beyond where:* Chinese open models would continue to increase their lead in performance (and adoption) over American counterparts. This will manifest in many ways. One example is how startups in Silicon Valley built on stronger Chinese models will be offering services that compete with entrenched, handicapped Fortune 500 companies wary of adopting these models in their services. This could make some subareas of AI disruption feel particularly intense. * The Chinese open ecosystem’s density of knowledge and sharing would translate into increased scientific and academic impact. China’s share of conference papers at leading AI conferences is already rapidly on the rise, and having an ecosystem built around substantially better models than their Western counterparts could lead this numerous research growing also to be impactful. Better base models allow more interesting RL and agentic research today, and the list of areas reliant on high-performance models is likely to only grow longer with time.* A proliferation of strong open models would make it difficult to restrict the presence or availability of many forms of AI. We do not have the government tools, incentives, nor culture to successfully prevent digital goods from China (or elsewhere) entering the U.S. economy. Many forms of AI governance and regulation in the United States and the rest of the world may need to be reconsidered, where many jurisdictions have looked to control and understand the development of “frontier AI.” Regulation needs to be approached for a world enmeshed in powerful AI models, rather than trying to control access or the releases of a few.These realities all paint a clear picture that bends the association of open models from “soft power” to just “power.” Continuously releasing strong open AI models could allow Chinese companies to shape the technology interfaces, services and reality around the world. Where 2024 was about research on open models, and 2025 the professionalization of them, 2026 could be where we begin to see clear impacts of their power through endless distribution. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
--------
13:37
--------
13:37

More Science podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

Podcast website

Science Technology