Powered by RND
PodcastsTechnologyInterconnects
Listen to Interconnects in the App
Listen to Interconnects in the App
(7,438)(250,057)
Save favourites
Alarm
Sleep timer

Interconnects

Podcast Interconnects
Nathan Lambert
Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, an...

Available Episodes

5 of 80
  • An unexpected RL Renaissance
    The era we are living through in language modeling research is one characterized by complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well-founded. A day | cannot | go | by | without | a new | reasoning model, RL training result, or dataset distilled from DeepSeek R1.The difference, compared to the last time RL was at the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course, Open Instruct (our tools for Tülu 3/OLMo) to train models like this.When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling, and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT. This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone is already feeling the AGI.Aside: For a history of what happened in the Alpaca era of open instruct models, watch my recap lecture here — it’s one of my favorite talks in the last few years.The goal of this talk is to try and make sense of the story that is unfolding today:* Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models? * How to contextualize the development of RLHF techniques with the new types of RL training?* What is the future of post-training? How far can we scale RL?* How does today’s RL compare to historical successes of Deep RL?And other topics. This is a longer-form recording of a talk I gave this week at a local Seattle research meetup (slides are here). I’ll get back to covering the technical details soon!Some of the key points I arrived on:* RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.* While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.* Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the RL developments. They’re both either inspirations or side-effects of good RL.* There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.For the rest, you’ll have to watch the talk. Soon, I’ll cover more of the low level technical developments we are seeing in this space.00:00 The ingredients of an RL paradigm shift16:04 RL with verifiable rewards27:38 What DeepSeek R1 taught us29:30 RL as the focus of language modeling Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    39:49
  • Deep Research, information vs. insight, and the nature of science
    Article: https://www.interconnects.ai/p/deep-research-information-vs-insight-in-science(sorry about some more audible breaths in this -- I'm going to work on it!)We at Ai2 released a local LM iPhone app for our OLMoE model (1B active, 7B total params), with greatly improved scores! Let us know what you think, or read more here.OpenAI’s Deep Research has largely been accepted as a super valuable tool for knowledge workers and analysts across the economy, but its real engine of economic progress is going to be changing the nature of scientific progress. Science is the fuel of technological revolutions.Deep Research in its current form feels like a beta version of a next-generation piece of technology. It does what it is tasked with — searches the web and processes many resources to create a useful report with referenced sources. Some of my uses include researching model evaluations, recent robotic learning research, and AI for science breakthroughs.Deep Research’s limitations mostly feel like problems of search, where it is prone to returning SEO optimized slop, style, where it returns verbose, low information density writing, and modality, where it does not have the ability to read, process, and return plots and diagrams. All of these are surely solvable and expected features if we look at the rollouts of other AI models in the last few years.This isn’t a product review (you can read Stratechery or Turing Post for more of that) — as the answer is quite simple, if you work in a knowledge intensive vocation you should be using this — but rather asking: So what comes next?The place to start from within AI circles is to revisit the question of “When will AI make novel discoveries?” A good example of this is in the Dwarkesh Podcast episode with Dario Amodei:One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?An example experiment we could do to test this is to train models on time-gated information and see if it can repeat a scientific discovery we already made (yes, this would be difficult to run, but not impossible). Ross Taylor described this on his Interconnects Interview:So an experiment I've never done because I didn't have [the] compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?The dream is for AI to make breakthroughs, and the absence of evidence for this even after the release of Deep Research is driving a reckoning over what language models will ever be able to do. The fork in the road is either believing that scaling (either in parameters or in new training methods) will unlock “insights” or accepting that the current generation of models are very useful tools and nothing more supernatural. Likely the most powerful tool humanity has made yet. Our first power tool for information.Much of science is not about making novel insights but about making progress within established problems of the field. In AI, these are the countless benchmarks we are saturating. A very valuable contribution in AI as a field can be re-using known resources in a simpler way.With AI, we are going to learn the boundary between true insight and scientific progress. A related form of scientific progress is the compression of noisy ideas and experiments into a cohesive trend. Something that Deep Research can likely do, but not something that builds the allure of Einstein and the other scientific greats.To understand this relationship between Deep Research, AI broadly, and the nature of science, we must address:* How to interpret existing “AI for Science” projects like AlphaFold in the bigger context of science,* How reasoning models, AI research systems like Deep Research, and other forthcoming AIs revolutionize existing scientific practices,* How recent developments in AI challenge Kuhn’s formulation of scientific revolutions, and* How current institutions will need to change forever in the face of AI?This (hopefully) series of posts is my attempt to create a worldview around what science means in the face of AI. Today, we focus on the first two — major AI for science projects and how normal science is being accelerated by AI — and hopefully raise urgency within the community to consider the final question.The starting point — grand AI for science projectsThere is a substantial overhang in computational infrastructure and fundamental deep learning capabilities relative to their impact on the broad class of sciences. In order to make a substantial leap in the application of AI to a specific domain, a team must mold the existing underlying capability of AI to the needs of trained groups of scientists.The list of examples people think of in this mold ranges across domains: AlphaFold for protein folding, AlphaGeometry for mathematics, GraphCast and GenCast for weather, and more that lack such prominent branding. They leverage advancements in deep learning and transformer architectures, but tend to have X-factors specific to the domain of interest (see a Deep Research query summarizing this). Such added features are pulling forward AI capabilities to suit a narrow domain.There’s a substantial craft to selecting suitable problems for applying this grand AI for science approach. It requires a field with central elements that are quantitatively focused. Even with this, outcomes are more uncertain than standard AI research or standard research in the domain of choice.The essay A new golden age of discovery from AI Policy Perspectives details how DeepMind sees the opportunity here and showcases some internal ingredients they found that make these projects more likely to be successful.The fact that any of these projects have succeeded shows the overall potential of AI for science. The overall necessity of the approach depends on whether the grand AI for science breakthroughs are pulling forward progress by months or years, or if these models are the single required breakthrough to approach entirely new areas of study.As the broader scientific community embraces AI as “something that works” more of these step changes will happen. They take a very large density of compute and talent on a single problem.These projects fit more naturally into a classical view of science. They take substantial resources and are high risk. Meanwhile, the mass market AI tools that everyone is adopting will dramatically shift the practice of doing science.Towards instantaneous Ph.D.’sWe have two tools that dramatically shift the nature of scientific exploration. They will only get better.* AI models that excel at code, mathematics, and reasoning: OpenAI’s o3, DeepSeek R1, Gemini Deep Thinking, etc.* AI systems to rapidly parse and summarize existing literature: OpenAI’s Deep Research, Gemini Deep Research, Ai2’s Scholar QA (specific to academic papers), and many more that will come soon.These tools are dramatically accelerating the most time-consuming aspects of research, particularly in computationally intensive fields. In a few years, the only gating factor on the impact of a scientist will reduce to their access to cutting edge tools, understanding the gaps in AI, and asking the right questions. The final point is well established as a trait of the most successful scientists, is what goes hand in hand with the idea of “insight,” and where the differentiation among scientists will only increase.Computational super-scientistsAll scientific fields that rely heavily on computational infrastructure as a bottleneck for progress are going to experience a dramatic acceleration in the near future. In AI and closely related computer science fields this is evident from the abundance of soon-to-be superhuman coding assistants and an exponential (short-term) increase in compute available.Most AI research is severely bottlenecked by the compute available, the time to implement the intervention, and the implicit efficiency of the idea-implementation interface. Future siblings of OpenAI’s o1 models are going to be used extensively to streamline this. This worldview barely accounts for the ability of these reasoning models to decide on which problem to solve and to interpret the results. These sorts of research assistants running in the cluster are a central component of the vision of Anthropic CEO Dario Amodei’s view in Machines of Loving Grace, and it is one that requires far less optimism in magical breakthroughs than the grand AI for science projects.Reasoning language models (RLMs) have in their first year of existence shown major progress on all of the evaluations the AI field put forward as fundamental challenges for the field. Accumulating iterations of this should transfer to scientific decision-making, but we don’t exactly know how.The fundamental unit of progress in science, which can be viewed as one Ph.D.’s worth of progress (same goes for one paper), is reducing so quickly to redefine many methods of experimentation and deciding on what is or is not possible. Multiple efforts are already documenting how RLMs can be used to find errors in existing literature — a process that will likely be automated in the next few years. Rather than science proceeding with a high velocity, it feels as if science is proceeding with a high acceleration.The pace of progress necessitates a reinvention of most of our scientific institutions. What happens when the time it takes to create a Ph.D.’s worth of knowledge is substantially smaller than the amount of time it takes to get peer review feedback from the fields’ journals?Changing the structure of scientific revolutionsMost of the ideas behind this piece originated over the holidays when I was trying to understand how science, as a practice and institution, would be changed by AI. The announcement of OpenAI’s Deep Research only solidified the view that the current wave of AI tools are not one that will ever fundamentally be insight-driven. They are extremely powerful and efficient computing engines, not insight engines.A pseudonymous account on Twitter, Michael, summarized the debate we started with, on if LMs can make insights, perfectly:To an LLM, a novel discovery is indistinguishable from an error.This is a limitation that letting language models act in a wet lab does not solve. Insight is far separate from information, but both are accepted as being crucial to scientific progress. Information is the engine for insight.Still, Deep Research and the AI models we have access to are a substantial accelerant of science. Modern scientists have always had to wear many hats, advising, reading, implementing, thinking, communicating, etc. The list getting shorter, so long as these new technologies are broadly available, is a wonderful boon for normalizing access to scientific careers. This should be a wonderful bloom of scientific progress (at least, production of results), but we need to do substantial work to prepare our already strained institutions.For my readers in academic or science driven circles, the comment section of this post is open to free subscribers, please share the biggest opportunities and pinch-points in the current scientific ecosystem as we embrace these AI tools. We need to be more ambitious and level-headed than complaints about AI written peer reviews. In the long-term AI peer review will be needed to review all of the progress. We need incentives for managing it, resources to share across a broad community, and much, much more.Interconnects is a reader-supported publication. Consider becoming a subscriber.Further ReadingThomas Kuhn’s The Structure of Scientific Revolutions is taken as one of the most influential non-fiction books of all time by describing the behavior of and defining terminology for how modern science evolves. The core idea is that scientists form paradigms around certain theories and ideas and use them to uncover new knowledge until questions or limitations require the emergence of another. Kuhn makes it clear with many historical references that science is done by a community and slowly builds out the frontier of knowledge — rather than filling in a known space of potential knowledge. The necessity of this can be summarized as:To reject one paradigm without simultaneously substituting another is to reject science itself.This pattern can be seen again and again building on the likes of Galileo and others in the 17th century. A core fact of this is that scientific knowledge is a process and not one that sets specific ideas in stone:If these out-of-date beliefs are to be called myths, then myths can be produced by the same sorts of methods and held for the same sorts of reasons that now lead to scientific knowledge. If, on the other hand, they are to be called science, then science has included bodies of belief quite incompatible with the ones we hold today. Given these alternatives, the historian must choose the latter. Out-of-date theories are not in principle unscientific because they have been discarded. That choice, however, makes it difficult to see scientific development as a process of accretion. The same historical research that displays the difficulties in isolating individual inventions and discoveries gives ground for profound doubts about the cumulative process through which these individual contributions to science were thought to have been compounded.The biggest challenge to Kuhn’s theories of change is the emergence of AI. We must grapple with questions akin to “How will the dynamics of science change with powerful AI systems?”The open question is if these accelerations unsettled the Kuhnian nature of science by making progress happen faster than paradigms themselves can be established.In addition to the posts referenced throughout, I built my knowledge on top of many recent pieces on this area that you may find interesting:* Dario Amodei on Lex Fridman podcast.* Levers for biological progress on Asimov Press.* X thread on theories of change in sciences.* The white-paper Artificial Intelligence, Scientific Discovery, and Product Innovation.* The Dwarkesh Patel Podcast with Adam Brown.* AI Policy Perspectives piece A new golden age of discovery.* Owl Posting checking recent NeurIPS ML for bio paper results (based on an idea from Ethan Mollick). Am I Stronger Yet? has a series on this topic too.* Scientific Models in Philosophy of Science.* Kuhn’s The Structure of Scientific Revolutions.* The Intrinsic Perspective on why great scientists follow beauty.* Dean W. Ball’s review of Deep Research, Knowledge Navigator. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    14:08
  • Making the U.S. the home for open-source AI
    As many of you know, this weekend I appeared on the Lex Fridman Podcast with my friend Dylan Patel of SemiAnalysis to cover DeepSeek and the implications on the AI ecosystem. I recommend you check it out.This post was tricky to pull together. I decided to share it anyways given the timeliness of the topic and other more exciting things I have to get to. The minor, thematic contradictions on motivations, costs, and trajectories are exactly indicative of why analysis and productionization of open-source AI is so hard. In that, it is a valuable lesson that building open-source AI will come with a lot of ups and downs, but now is the best time to do so.The DeepSeek moment represents the end of the first chapter of AI's recent takeoff as told through the emergence of ChatGPT. It reminds us, that while substantial resources, coalitions, brands, and trends have been established, the narratives we have been championing are not set in stone. DeepSeek, especially with R1, resets all the narratives around open vs closed, US vs China, scaling and commoditization, etc. as we prep for yet another acceleration in the diffusion, progress, and adoption of AI.Of all of these debates, the focus on open vs. closed AI models is the one least driven by economic factors and most driven by vibes. The open-source AI community is driven by a future vision where AI is not held by a few super-rich companies, a future where more people get to partake in the building of AI, a future where AI is safer, etc. These are ideals and building the tools and systems that make this vision a reality is a monumental challenge. Building strong AI models is far, far easier than building a sustainable open-source ecosystem around AI.Building a better, truly open ecosystem for AI has been my life’s work in the last years and I obviously want it to flourish further, but the closer you are to the core of the current open-source ecosystem, the more you know that is not a given with costs of doing relevant AI training skyrocketing (look, I know DeepSeek had a very low compute cost, but these organizations don’t just fall out of the tree) and many regulatory bodies moving fast to get ahead of AI in a way that could unintentionally hamper the open. Yes, efficiency is getting better and costs will come down, as shown with DeepSeek V3, but training truly open models at the frontier isn’t much easier.Building the future ecosystem of openAs a perfect case point, consider Meta. Meta, as a platform serving content to billions of users, is extremely well-positioned to use AI to make its services more engaging and more profitable for advertisers. The Llama project is not needed for that vision. Yes, it will be easier for them to integrate and optimize an AI that they train, but in a world where AI models are commoditized, what’s the point? The most compelling reasons for openly releasing the Llama models are not business reasons but rather ideological reasons. Mark Zuckerberg revisited this on the recent Meta earnings call:I also just think in light of some of the recent news, the new competitor DeepSeek from China, I think it’s one of the things that we’re talking about is there’s going to be an open source standard globally. And I think for our kind of national advantage, it’s important that it’s an American standard. So we take that seriously and we want to build the AI system that people around the world are using and I think that if anything, some of the recent news has only strengthened our conviction that this is the right thing for us to be focused on.The pro-America messaging from Zuckerberg long predates the new administration (especially given that all of Meta’s major apps are banned in China), even if the language is amplified now. This is purely an argument of “we are doing this because we should.”This argument is extremely similar to that used by DeepSeek AI’s CEO Liang Wenfeng. In an interview translated by ChinaTalk, Wenfeng described the need for Chinese leadership in open-source AI (in addition to a clear commitment to keep releasing models openly).Liang Wenfeng: Because we believe the most important thing now is to participate in the global innovation wave. For many years, Chinese companies are used to others doing technological innovation, while we focused on application monetization — but this isn’t inevitable. In this wave, our starting point is not to take advantage of the opportunity to make a quick profit, but rather to reach the technical frontier and drive the development of the entire ecosystem.…We believe that as the economy develops, China should gradually become a contributor instead of freeriding. In the past 30+ years of the IT wave, we basically didn’t participate in real technological innovation. We’re used to Moore’s Law falling out of the sky, lying at home waiting 18 months for better hardware and software to emerge. That’s how the Scaling Law is being treated.But in fact, this is something that has been created through the tireless efforts of generations of Western-led tech communities. It’s just because we weren’t previously involved in this process that we’ve ignored its existence.The interview has many other comments making it clear that the way this will be done is by training powerful AI and releasing it for the world to use.Both of these arguments, from Zuckerberg and Wenfeng, rely on the optimism that we, as a community of users of open AI models, will figure out how to create a valuable ecosystem around them. Right now, the vast majority of AI usage for applications comes through various API calls. Yes, some of this includes the usage of open-weight models like Llama and DeepSeek R1, but it does not give clear positive attribution to the fact that the model was open as a reason said the model was used.The nationalistic comments regarding open-source AI are only likely to grow stronger as governments more deeply integrate with their leading AI companies.One of the main arguments why American AI leaders believe that the AI ecosystem should be built on a Western foundation is the risk of China “poisoning the well” of our future computational infrastructure. To be very clear — there is absolutely no evidence of this to date, but it is a simple proposition that the Chinese Communist Party (CCP) could build ties to the leading Chinese AI laboratories and require them to train for specific behaviors or train in some sort of back door through model weights into American infrastructure. America has been reeling with the potential of this sort of influence on TikTok. If AGI is to be a real thing that can be steered to ideological outcomes, a bill titled Protecting Americans from Foreign Adversary Controlled Applications Act (the bill banning TikTok and forcing a divestiture) will operate at entirely the wrong level of abstraction. American companies raced to host R1 in a competitive frenzy. This is how open-source works and it will be far easier to incentivize better open models from Western labs than it will be to ban companies from adopting Chinese technology.As of the release of DeepSeek R1, Chinese AI companies didn’t have clear links to the government, but after said release, DeepSeek’s CEO met with the Chinese Premier Li Qiang (approximately second in command) to discuss their work.AI is obviously far more in the radar of American leadership as a priority and has been for some time. This is a major advantage that the U.S. has in terms of a fast reaction to changing needs for open models.In a recent Reddit AMA soon after his appearance on stage with Trump for the announcement of the Stargate project, CEO of OpenAI Sam Altman even acknowledged that their strategy “may be on the wrong side of history” here with respect to openly sharing AI components. OpenAI should get no credit until their actions change, but DeepSeek and a new government administration have made many forces re-evaluate their relationship to the open ecosystem. The current imperative of open-source AI is to create feedback loops where open models become more useful than their closed counterparts. Given that AI is very expensive and slow to train, this cannot look like the accumulation of small security and reliability improvements like done with open-source software. There’s a chance that there is an algorithmic innovation that makes this possible, but for now, the solutions need to be more imaginative. Two examples I am currently interested in include:* Feedback loops from data to model behavior. If exposing the data to users, either from pre-training or post-training, makes it easier to control a model, then open models can win.* Finetuning advancements. Currently, finetuning any model to target a specific task is extremely hard. This is with both open-source code and fine-tuning APIs. If open-source code can be made to enable feedback loops of cheap synthetic data with verifiers to make very targeted models, open models can win.This is just two examples. We need more than these if we want open-source AI to continue once the bubble of AI advancement cracks. We don’t know when this comes, but if investment is driven by ideological reasons rather than monetary ones, public companies only have so much leeway to continue indefinitely.These days I classify myself as an advocate for more openness for AI (which never means absolute openness), but I’m not going to describe myself as also being an optimist for it “winning.” As scaling continues to push the limits of multiple training regimes and recipes become more expensive, open models drift from the frontier. This DeepSeek moment happened once in the 2+ years since the release of ChatGPT. We need to change incentives if we want it to happen regularly.Why DeepSeek R1 is so close to the frontier is that, on top of being extremely skilled, they have a way faster release process than the likes of OpenAI and Anthropic who do extensive safety testing and even pre-screen releases by federal governments. Meanwhile, DeepSeek seems to finish the model, take a week to write the paper, and immediately distribute it to the entire world. Iteration speed has always been an advantage of open-source and they happen to be aligned here.Interconnects is a reader-supported publication. Consider becoming a subscriber.Restricting the distribution of open models is a losing propositionIf we geolock websites like HuggingFace to only work within the United States, we lose. If we geolock model weights with threats of jail time as proposed in this recent bill, we lose (not views of my employer etc. etc.).If we add substantial penalties to companies who release models that end up being used in China, we lose.DeepSeek R1 is the closest to the frontier that a language model has been with a permissive license. Models like Llama 405B have been open weights and at the frontier of performance before but weren’t picked up because they were very hard to use.The only reason we should limit the ability of people to release model weights openly is if we see measurable negative impacts of doing so onto society. This means immediate harms to information ecosystems or other infrastructure.In the time being, if we (as Western countries) decide to stop training and releasing open models another power will happily do this. This is not just China — there are countless countries with the wealth to create the current generations of frontier language models. While it is understandably a stressful proposition, it has been said for at least a year that “Whether or not we want it, open language models are likely here to stay.”We are battling over who creates the most popular open language models, which could have a substantial influence on societies given the socially rich context that we use language models in today, not if they will exist at all.Open-source processes succeed with a multiplicity of options for users. This is what gives users options to fiddle and improve the ecosystem as a whole, learning from each other, and improving together.Open-source AI is something that people will not benefit from uniformly. Export controls are not about preventing access to AI, but rather about preventing proliferation (and use) of AI. We have already seen this potentially kick in with the DeepSeek R1 launch, where DeepSeek had to pause account signups and there were widespread issues with their API.The amount of compute used to train models is increasing today, and rapidly, but as AI becomes more powerful the rate of AI usage will increase even further. Most of the datacenters under construction in the United States are for inference of AI technologies. Right now, most of these are going to existing things like Google services or Meta serving ads, but more of these are becoming language model workflows.The clusters you hear about with hundreds of thousands of GPUs are those for training and are indicative of a lab’s ability to train models at the frontier (reminder, training needs the fast interconnects between as many GPUs as possible). Training infrastructure can be thought of separately from the reality of how AI transforms society and becomes extremely impactful — by being used.Because cluster buildouts have such a long lead time, it is very likely that we get super powerful AI models where demand vastly outstrips the supply of usage. The narratives that AI CEOs parrot about “AGI in a few years” are exactly this. An AI working in the background, solving tasks autonomously, especially if on top of these reasoning models that are driving inference costs up with long-context generation (which is far more expensive than long-context inputs due to quadratic compute increase during sampling), will be very expensive to run. For example, OpenAI’s o3 pro costs dollars per query on the now famous ARC-AGI task.All of this is to say that restricting open-weight models has a much smaller impact on the diffusion of AI technology than most commentators thought in the last few years. Most thought that training cutting-edge AI models was so expensive and sophisticated that only American companies could do this. It turns out that the cost of training can fall faster than the demand for inference. Yes, releasing open-weight models contributes to the former reduction in cost (and definitely doesn’t drive it), but if concerns around openly releasing AI models are driven by IP controls, it is an empty argument.Attempts to restrict a adversaries ability to impact the world on AI should focus nearly entirely on compute rather than model weights. There is of course a chance that models continue to progress and eventually are so simple that only the weights are needed to deploy a destabilizing application. For now, and in the near future, powerful AI systems require substantial inference compute and even more custom software infrastructure.For example, Deep Research announced by OpenAI is restricted to 100 queries per month for only users on the $200/month pro plan. Where a query to standard ChatGPT costs on the order of $.01, this new system costs on the order of ~$1 (estimated based on monthly cost and OpenAI “losing money on pro subscribers”). We should expect this cost to increase by 100X again in the near future. Inference for a task that costs $1 in compute and takes ~10 minutes to complete implies that OpenAI would be using the equivalent of about 6 GPUs for that one user for those ten minutes (amortizing $2/hour/H100 GPU). This is a lot — each GPU is worth ~$40K! In reality, it is likely that multiple users are sharing even more GPUs to hit critical batch sizes, but that’s an aside for another time. Using very powerful AI is very expensive. It is making the argument even clearer that open weight models are not a major risk to be the target of soon-to-be export controls.We need to keep releasing models openly so that our allies can build on them and create a thriving ecosystem.What’s actually new in open-source?From the 10,000 foot view, China released an AI model and countless Western companies immediately adopted it so that the AI ecosystem can build on it. This is the system working as intended. A new “open-source” tool in the pipeline showed up and caused prices to be slashed. This is the system working as intended.When focusing solely on the angle of open vs. closed models, DeepSeek R1’s release is new solely as a high watermark for the existence of models closest to the frontier of progress with a permissive license. Developers themselves still aren’t really using open models because they have clear benefits in adaptability, privacy, etc. This is the hole in the system. While DeepSeek R1 is open weight, the vast majority of the usage is in applications like Perplexity AI or other hosted APIs.The grand problems facing open-source haven’t changed much, but we have the best tools we ever had in order to make dents in them. It’s time to reinvest and refocus on the positive futures such progress could provide. Western governments should make sure their programs for funding the development of AI extend beyond just the largest companies trying to solidify their moats through capital expenditures. It is the time to invest in open research and public-sector coalitions who are building the Western alternatives to DeepSeek. Research is the foundation of the next decades in progress. Building openly is the only path forward if there is concern over our future ability to leverage open models from China. For further reading on this topic, see Thom Wolf’s post which is very related and complimentary to mine, or this rebuttal to Dario’s post on export controls. For more on open-source AI, see my previous writing on the area. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    16:13
  • Why reasoning models will generalize
    This post is early to accommodate some last minute travel on my end!The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains. By the end of 2025 we should have ample evidence of this given the pace of the technological development.If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:* Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.* Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.* Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.Some evidence for generalization of reasoning models already exists.OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.An unsurprising quote from the deliberative alignment release related to generalization:we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development. The field of LLM-as-a-judge (and related synthetic data pipelines) only really became stable with models at the level of GPT-4. Reasoning models trained as a judge are a very natural fit because the exact token for a predicted reward or ranking is crucial — CoT is essential. All of the progress here relies on continued progress on both generators and verifiers. o1 et al. were likely trained with mostly explicit, code verifiers. They spawned far more powerful generators, which will enable new types of verifiers. Then, we can train better models (and so on).Onto another example of unexpected performance of new reasoning trained models. DeepSeek-R1, the new open-weight o1 replication has been showing up at the top of many random benchmarks as top overall, above Claude 3.5 Sonnet, Gemini, and GPT-4o, and alongside o1. Examples include a creative writing and humor leaderboard or the brand-new, extremely challenging benchmark from the Center for AI Safety and Scale AI — Humanity’s Last Exam. Oh, and yes, it’s best on both accuracy and the new metric “calibration error” which is designed to have the model express its own uncertainty. Calibration is a long-sought behavior in traditional LMs and turns out maybe reasoning training helps it?A lot of my friends find o1-pro to be clearly the most useful AI model in their daily workflows (one example here and a similar R1 example here). ChatBotArena has all of the new models, from o1, Gemini-Thinking, and R1 as some of the top models these organizations have in the best “normal use” evaluation the AI community has. These reasoning models are definitely absorbing the other lessons learned in post-training across the AI industry.The explosion of R1 caused arguably the biggest general awareness of AI moment since the original ChatGPT. DeepSeek’s App has been the number one overall free app in the U.S. and non-technical users are getting meaningful value out of seeing the reasoning process. What was a niche training process is bringing many more types of benefits than expected.All of this is just on “day 1” of this technology. Reasoning models are going to proceed at a rate far, far faster than most expect.These models will not be state-of-the-art on every domain, but probably far more than you expect. Language models are a complex technology and they will never be one size fits all, but the ground is being reshaped under us.Especially, where the standard models match the reasoning models abilities, you’ll be paying way more for the same performance. At the same time, so many domains are going to be open to the “if you pay a little bit more, the reasoning model will get you a bit more performance,” which will accrue so much value over time.These are trade-offs that many in the AI industry see at face value. Many ask where Anthropic’s reasoning model is, but they may never explicitly have one. Before o1 launched, Claude was already using extra tokens hidden from the user to improve the quality of responses. Anthropic CEO Dario Amodei commented on their approach in an interview with Joanna Stern of the WSJ recently:To say a little about reasoning models, our perspective is a little different, which is that there’s been this whole idea of reasoning models and test-time compute as if they’re a totally different way of doing things. That’s not our perspective. We see it more as a continuous spectrum — the ability for models to think, reflect on their own thinking, and ultimately produce a result. If you use Sonnet 3.5, sometimes it already does that to some extent. But I think the change we’re going to see is a larger-scale use of reinforcement learning, and when you train the model with reinforcement learning, it starts to think and reflect more.It’s not like reasoning or test-time compute — or whatever it’s called — is a totally new method. It’s more like an emergent property, a consequence of training the model in an outcome-based way at a larger scale. I think that will lead to something that continuously interpolates between reasoning and other tasks, fluidly combining reasoning with everything else models do.As you’ve said, we’ve often focused on making sure using the model is a smooth experience, allowing people to get the most out of it. I think with reasoning models, we may take a similar approach and do something different from what others are doing.The newest Claude 3.5 Sonnet models are very likely already trained to some extent with RL on verifiable outcomes. Just days before o1 was launched, Claude’s behavior of “I’m thinking about that” was the biggest indicator we had of consumer companies trading more compute for better responses. Anthropic hasn’t shifted their strategy here, and you can decide how much weight you want to put in their CEO’s recent comments.Interconnects is a reader-supported publication. Consider becoming a subscriber.The techniques are here to stay, and it took revolutionary new models to show us that.Like many new technologies, we needed to be shown what was possible, and then it can be folded back into normal experience. o1 was this breakthrough, and the benefits of reasoning training will now expand out into all of the AI products we are using day and night.To end, I leave you with a quote from the DeepSeek R1 paper, where the authors reflect on their experience with the model(s):One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.Thanks to Ross Taylor and Hamish Ivison, for discussions that helped inspire this post. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    11:37
  • Interviewing OLMo 2 leads: Open secrets of training language models
    We're here to share the story of building our Open Language Models (OLMos) and what we improved to build the OLMo 2 7B/13B model that is competitive with the Llama 3.1 8B model. This is all about building an effective, small language modeling team that can share all it learns with the scientific community. Dirk, Luca, and Kyle are some of the people I learn the most from and have more knowledge (and entertainment) to share than we have time. Some questions were pulled from Twitter, but please comment or get in touch if you want us to cover anything in the future episode(s)!Main topics:* Pretraining efficiency and our quest for stability after a not-so-secret failed 70B run early in 2024,* What the role of OLMo is in the broader AI landscape and how that is, or is not, changing,* Many little decisions that going into building language models and their teams (with a focus on NOT post-training, given I already talk about that a ton).Play with the models we build here: playground.allenai.org/For more history of open language models (OLMos) on Interconnects, see my first post on OLMo, my coverage of OLMoE, OLMo 2, and why I build open language models. If you have more questions or requests, please let us know (especially the researchers out there) and this can be one of N, rather than a one off celebration.Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.ContactsDirk Groeneveld — https://x.com/mechanicaldirk // https://bsky.app/profile/mechanicaldirk.bsky.socialKyle Lo — https://x.com/kylelostat // https://bsky.app/profile/kylelo.bsky.socialLuca Soldaini — https://twitter.com/soldni // https://bsky.app/profile/soldaini.netGeneral OLMo contact — [email protected] / models / codebases discussed* OLMo 2 paper* OLMo 1 paper* OPT models and talk from Susan Zhang* BLOOM* Red Pajama V1 Dataset* Falcon LLM * C4: Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach * Maximal Update Parametrization (muP) is from Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer * Spike No More: Stabilizing the Pre-training of Large Language Models * LLM360: Towards Fully Transparent Open-Source LLMs — Amber model* EfficientNet * MegaBlocks * A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Kyle said Hitchhikers)* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models ChaptersChapters: Here is a list of major topics covered in the podcast, with timestamps for when the discussion starts:* [00:00:00] Introduction* [00:02:45] Early history of the OLMo project* [00:15:27] The journey to stability* [00:25:00] The evolving role of OLMo and pretraining research* [00:29:00] Pretraining Q&A (µP, scaling laws, MoE, etc.)* [00:40:40] How to think about pretraining data work* [00:54:30] Role of pre-training vs mid training vs post-training* [01:02:19] Release strategy and wrapping upTranscriptThis is generated by AI and lightly edited for clarity. Particularly, the attribution per-speaker was poor on this time around.Nathan Lambert [00:00:07]: Hey, welcome back to Interconnects. In this interview, we're bringing one that I've hinted at for a while, which is interviewing some of the other leads on the OLMo team at AI2. So essentially, this covers the story of OLMo from its early days where we got our compute, kind of our path to stability and some failed runs along the way, the role of OLMo and the broader AI ecosystem, and really just a very long tale of technical details and decision making and considerations that you have when actually training language models that you're trying to have at the frontier of performance relative to peers like Llama, etc. This is a fun one. There's less post-training than normal because this is me interviewing some other co-leads at the Allen Institute for AI. So there's three people in addition to me, which is Dirk Groeneveld, who is the lead of training, handles most of engineering, Kyle Lo, and Luca Soldaini, who are the data leads. So we have a pre-training engineering lead and two data leads with me who has done a lot of the post-training. This is just a part of the team. And I hope you enjoy this one. We can do more of these and bear with the fact that I'm still expanding my podcasting tech equipment. But I think the audio is definitely good enough and enjoy this episode with me, Kyle, Dirk, and Luca.Hey, everyone. Welcome to the AI2 office. We're finally talking more about some of our OLMo things. Too much work to do to actually get all the... the information we want to share out into the world. So I'm here with Dirk, Kyle, and Luca. We can also talk so people identify your voices so people are not all on video. Hi, I'm Dirk.Dirk Groeneveld [00:02:01]: I am the lead of the pre-training part of OLMo.Kyle Lo: Hi, I'm Kyle. I work on data.Luca Soldaini [00:02:08]: Hello, Luca. Also work on data with Kyle.Nathan Lambert [00:02:13]: Okay, so we're kind of going to maybe go through some of the story of OLMo to start. And then just get into as many nerdy details until we get tired of OLMo 2. Which, in my state, this will probably be mostly about pre-training. You can ask me post-training questions as well. But I'm not going to sit here and be like, ask myself questions that I'm not going to answer. Because that is an absolutely ridiculous thing. You can ask me one question. Okay. One question. It's like, why shouldn't you post-training with all the compute?Nathan Lambert [00:02:45]: But I wasn't here for when OLMo actually started. So I think it'd be good to tell people, I mean, like, broadly what AI2 was like at the time, what language modeling was like at the time, what it may or may not have been risky.Kyle Lo [00:03:01]: Yeah, you should probably get this.Dirk Groeneveld [00:03:03]: Yeah, I think it all started in the fall of 2022.Dirk Groeneveld [00:03:10]: We were talking to AMD at the time about some sort of collaboration. We're scoping out some stuff. And at the time, we wanted to take the Bloom model. And put 300 billion extra tokens in. And we wrote up a proposal and we sent it to AMD and it disappeared into a black hole. And we never heard from them again. And then ChatGPT came out a couple months after that. And suddenly everybody was very excited. And two, maybe one month after that, AMD came back to us and said, now let's do it. And that kicked off a very busy period for us. At least the three of us were involved at the time. Plus some of us. Some more people trying to scope out exactly what the project would be. Putting 300 billion tokens into Bloom wasn't that cool anymore. The field had moved on. So we needed to find something else that would work both for us and for AMD.Dirk Groeneveld [00:04:07]: And that's exactly what we did. We figured it out. We figured out who would be on the team, how exactly to do it. We had to get the data from all of that stuff and then started working on it.Luca Soldaini [00:04:16]: I think it was, let's look it up. And the official birthday of all of us. Almost is February 2nd, 2023. That's when we had like a big sort of half day. Summit workshop and a bunch of researchers self-organized a long discussion. I'm foreseeing maybe like 40, 50 of us try to scope down a potential language model project at AI2.Kyle Lo [00:04:48]: Yeah, it was also extremely bottom. Up because we were all like, nobody, it was not on anyone's radar. We were working on, everyone's working on different projects that we had promised for the end of the year. This was very much just like a side gig for us. We had no compute other than this mysterious AMD GPUs that just came. It was like, oh, it's possible. And everyone was just like, yeah, I'll work on this on the side. Let's just start hacking together some stuff.Nathan Lambert [00:05:14]: How far along the line until you decided on 7B? Like, were these things obvious at the time?Luca Soldaini [00:05:20]: I think the size of it. This is where Llama's size was. Yeah, we started with seven because seven was the smallest Llama size. This was Llama one. Yeah, Llama one was like first couple months of 2023. Yeah, we started, we started scoping before Llama one. And then when Llama one came out, it made sense to have a configuration that was just sort of close to what they were doing. So it's not too much reinventing. I think seven was.Dirk Groeneveld [00:05:52]: Yeah, I mean, I think the original scope was recreate Llama one, which would be a 7B at 1.4 million tokens. What were we staring at? OPT.Kyle Lo [00:06:03]: We were staring at OPT also, right? During around that time.Dirk Groeneveld [00:06:07]: For inspiration. Yeah. And for what not to do in many cases. Was OPT even like in the many tokens regime or was that still like when people did the booms and booms?Luca Soldaini [00:06:18]: I think OPT and booms were.Luca Soldaini [00:06:22]: They were not, they were not over trained at the end were both a scope to Chinchilla that they both had extensive logs and so they were very useful because both of them have hundreds of pages of like, whatever can go wrong during pre-training. Yeah. I mean, OPT was amazing as a resource for figuring out, you know, we knew nothing, so we needed to know what's important. And yeah, I remember there's also avoidance and so on. There's that. It's like Susan has this talk.Dirk Groeneveld: I'll come load parallels of training OPT and yeah, I think the original ones, I always feel it's kind of a shame because the OPT models are not very good, but, but they were first, like they figured all that stuff out for the first time. I have huge amounts of respect for that.Nathan Lambert [00:07:11]: And what's the like open source angle thing at the time, or like, had you already identified that there was no open pre-trained data sets for these models?Kyle Lo There definitely wasn't any open pre-trained data sets. I think we were basically looking at. The gopher paper that had most documentation and then Llama one had enough documentation about what data sources were using, where we were like, okay, let's try to reconstruct what it was. And I think roughly around the same time, Red Pajama V1 and then shortly after it was like Falcon, Falcon, the first Falcon, we were all kind of concurrent works at the time, but basically starting from, I don't know, Grab, Common Crawl, grab a bunch of sources to try our best.Luca Soldaini [00:07:50]: The funny thing, like we had conversation of like. Like, uh, there was like, boy, it would be good if we didn't have to do the data. This would be one fewer thing to do, but at the time, like even when, uh, Falcon dropped, they released like a small preview that wouldn't match like the token budget that we wanted for a training run. So it was not even like, you know, it was good work and like, oh, maybe we just switched to this one. And then we quickly arise, not, not big enough for the two trillion. So I think it was like, maybe. Yeah. Yeah.Dirk Groeneveld [00:08:22]: I mean, we did the C4 data set way before any of this. Um, and so my first idea for how to do data was to just run C4, but on all the Common Crawl, um, instead of just whatever the most recent one was at the time. And I actually started writing a repo for that, but then ended up not doing it. This is the C5 repo. Yeah.Nathan Lambert This was C4's side of data cleaning practices.Dirk Groeneveld Yes. That's exactly a re-implementation of C4. And, um, for it to touch it, we'd run on slightly different hardware, um, with more dApps and that was, that was going to be the entire story until we found we could do better.Nathan Lambert Yeah. And, um, for general timelining, I joined pretty much like almost 7B was, I think mostly done training or wrapping up pre-training and the like instruction tuning at the time was like basic SFT with a sprinkle of DPO. Yeah. So I think a lot of that story gets cut. Compressed. Like I'm guessing the actual pre-training happened in like the second half of the year, mostly. So it's a lot of prep to get a language modeling system to exist. Yeah.Luca Soldaini [00:09:32]: I think we handed off the one of Dolma. So the data set that we used for pre-training is like end of June, I think, 2023. Grab Common Crawl, end of March. Yeah. So all the source acquisition was March, April. Let's see March and then yeah, a few months. There.Nathan Lambert [00:09:52]: Um, if someone wants to do the same thing today, which is like, we should train a language model, how much faster would it be to like, is OLMo actually making that much of like, would it be a week with OLMo stuff now, or would it still take a lot of time to set this up?Luca Soldaini [00:10:07]: I think if, if you want to, um, if you want to train exactly on OLMo data, you know, data, it's much faster, um, training, I think it requires a little bit more finesse and dirt. Yeah.Dirk Groeneveld [00:10:23]: If someone gives you a cluster to, to run on, just figuring out the mechanics of getting your thing to run, just so setting all the environment variables and having the drivers loaded and so on, it might take you a week or so if you're, if you've done that kind of thing before. Um, so that's very different, but you can take a trainer that already works and just, just use it.Luca Soldaini [00:10:45]: Um, it really depends like where, where you start. It's like, if, if you're spinning up your cluster from. Scratch, then you acquired a hardware, then that hardware has burning periods. So the first three months stuff will fail and that has nothing to do with the model itself. It's just, your hardware is also brand new.Dirk Groeneveld [00:11:06]: Yeah. I mean, I am eternally grateful for AMD for giving us the compute to get started, but it was kind of difficult to run on.Nathan Lambert What was the exact amount of compute? Like, I think when I arrived, that wasn't even what we're using where it's like Lumi discussions and the original amount.Dirk Groeneveld Of compute was, uh, 2 million hours on Lumi.Nathan Lambert So, so 2 million GPU hours.Dirk Groeneveld [00:11:29]: Um, that's we're training way bigger now than that. Yeah. So I think I did the math recently. It's like the order of a million hours is if you do a thousand GPUs concurrently, like 20 days. Uh, I don't have that math in the top of my head, but, um, the first, the first end to end run for the 7B that we did took, uh, 35 days. We can now train that same. Model again in three days. So things have changed a lot since then. Yeah.Luca Soldaini [00:11:58]: Well, some rough, rough stats for almost two anyways, seven and 13, just the final ones, um, was a little bit over 5 million GPU hours combined. And then we have roughly 5 million hours worth of experiments.Dirk Groeneveld [00:12:15]: Um, these are, uh, A100, H100. Might be surprised. Oh, it's the case too high or too bad to do some, it's way too high.Luca Soldaini [00:12:33]: Um, it's like, how do you encamber overhead then?Dirk Groeneveld Oh, combined.Luca Soldaini [00:12:36]: It's some of them plus the ultimate training. They're also not using the new one core quickly.Dirk Groeneveld [00:12:42]: So, yeah, but I'm just thinking if it's, let's say conservatively 7,000 tokens per second, four months on a thousand. Do you think it's less than that?Nathan Lambert Like, okay, let's just go and track those number down. I think it's interesting. It's like, what percentage, what is the percentage of improvements still? Like how much of all the two being better is just by the compute being more stable just by doing more experiments. And that lets you test things like stability and just get the ducks in a row rather than like the data being so much better. It's an impossible question.Luca Soldaini [00:13:20]: It's that it was like. And, you know, the trigger part with using that AMD hardware at the time, specifically that cluster, was that cluster was being brought up online at the same time as we were experimenting with it. So we were helping that cluster being set up. So it's because of that, there's a lot of things where we had to second guess ourselves, whether that was an issue on our side, the hardware side.Nathan Lambert [00:13:58]: Isn't this always going to be an issue with new GPUs coming into the world? Does Microsoft plug in opening eyes GPUs and they just work?Luca Soldaini [00:14:06]: I think it was, yeah, it's always tricky. It's a combination of like getting both new GPUs. At the time, AMD was a relatively new vendor, plus the cluster itself being new. So it's like stacking, you know, risky, risky things on top of each other in a way that it's like, oh, if you can, if your cluster is solid, that, you know, the GPUs are brand new. Well, the network is not going to cause issues, but if the cluster is new and the GPUs are new, who knows where the problem sits. Yeah.Nathan Lambert [00:14:44]: We'll go down the... Yeah. We'll go down the whole stability round the hole. Dirk, how close are you to a number?Dirk Groeneveld Five trillion tokens at 7,000 tokens per second, which is what we get for the 7 billion, more or less, over the long run, is only about 200,000 hours on each one. So our first estimate was way off.Luca Soldaini [00:15:05]: It was... Check the top. I think maybe my memory was wrong. Maybe my thing was... This is why I have this laptop here.Luca Soldaini [00:15:18]: Oh, no, I was misremembering. Okay. My name is 500K. I remember flying... 500K. Yeah, yeah, yeah.Nathan Lambert [00:15:27]: So it's like from the first AMD grant of a few million GPU hours on AMD to what we have today. It's like it's gone from multiple million AMD hours to training a model over five times the tokens in half the GPU hours. That's right. Yeah. Like, where do we...Dirk Groeneveld I mean, the biggest one is that the MI250 that Lumi has on, like, the MI250 is the AMD GPU that Lumi has, is of the A100 era. It's comparable to an A100 in price and capacity. But now we train on H100s, and they're just...Nathan Lambert What percentage of tokens... It's just a newer GPU. Yeah, what percentage of tokens in OLMo 1 code versus OLMo 2 code are lost at, like, a 7B, so a scale that we're reliable on? What percentage of tokens in OLMo 1 code versus OLMo 2 code are lost to spikes?Dirk Groeneveld I think it was OLMo 1 losing a considerable amount against the spikes game. That's impossible to estimate, because there's so many other differences at the same time between OLMo 1 and OLMo 2.Nathan Lambert Can you summarize the architecture differences? There's a list in the paper. We don't have to be exhaustive.Dirk Groeneveld That's going to be a lot of stuff. The biggest difference is the init. So I guess now we're getting into what did we actually discover?Nathan Lambert These are some audience questions. OLMo 1 and OLMo 2. Finbar, who you might know specifically, asked, like, how do you write an init N(0,0.02) to an init? I'm like, I don't know.Dirk Groeneveld That particular init is the default in Megatron. And the init that we had in all one was just trying to be too clever. We stole that init from OpenOLM, and they took it from somewhere else, actually. And I don't remember what the original source is.Nathan Lambert What is the actual decision-making on an init that's too clever? You, like, think that you can get a better learning region by bundling with something?Dirk Groeneveld We tried it. We ran it for, you know, 100 billion, 200 billion tokens, and we looked at which one is better. And scaled init is absolutely better for a long time. So scaled init is the original. It's the OLMo 1 init. Works better for a long time. You have to train for a really long time before you see it come apart. You have 2 trillion tokens for a 7Bmodel. And then things get a little bit dicey. So this is why, you know, this is why we used it for OLMo 1, because it looks quite good for a long time.Nathan Lambert Which of our OLMo models did we figure out that the init was a change?Dirk Groeneveld Because we did a few through the year. We tried that same init with a 7D model, and that did not work. That model stalled out around 1.3 trillion, 1.4 trillion, something like that,Dirk Groeneveld [00:18:12]: which gets at the heart of the stability. So we started to think about the stability investigation. So I think that was one of the audience questions, right? And how do we even go about the stability investigation? starting from the point of we're training the 7DB and it's not working anymore, what did we do? The first step was to identify the issues that we see in the metrics and see them in a smaller model. And the two issues we saw were lots of spikes that we call them fast spikes. So the model recover. They recover quickly, but they just happen more and more the longer you keep training. And at some point, even the fast spikes kill you.And the other thing was a growth in GradNorm. It seemed very much that the 7DB would always start blowing up once the GradNorm got to 0.4, regardless of what intervention we did, it would get a little bit further. And then as soon as we hit 0.4 GradNorm, it would blow up again.Nathan Lambert So you lowered the learning rate and it was up again.Dirk Groeneveld So fortunately, yeah. Yeah. So we would do things like that, increase the batch size, change the late decay, blah, blah, blah, but quickly it gets back to 0.4 and then blows up again. So fortunately, both of those phenomena also appear at the 7DB, even though the 7DB trains fine, it has both of those traits. So we decided to focus on those two because it's too expensive to try all these experiments at 7DB. But these are two things we could fix at 7DB and then see how it goes. So that was, that was the first step. But now. Now we have a metric where we can pretty quickly, within 12 hours or so, do a run, find out if our numbers are better and then change something and do it again. And the second component was we took another model that successfully trained that didn't show these issues, that didn't show the slow GradNorm growth and it didn't show the spikes either. And we ablated against that. So that was the LLM-360 Amber model. They're like all very open. So we could take their data. We could take their setup and look at it in great detail.Dirk Groeneveld [00:20:22]: And we basically tried things one by one, sometimes two by two or so to not run too many operations. But we tried things until we got to a stable setup. There are some other insights at the time. I was really into the Spike No More paper, which is all about the magnitude of this. So we tried embeddings. So we tried some stuff there.Dirk Groeneveld [00:20:48]: Pete Walsh on our team tried some other stuff involving Adam W settings that made things even better. And then we took a lot of inspiration from the Chameleon models because we were talking to that team on a semi-regular basis and they had a lot of stability issues. They found some solutions that we also tried and some of them worked for us and some of them didn't. And we took the ones that worked for us. So it's always ablating at the 70 scale until our numbers look super smooth and super nice.Nathan Lambert How specific do you think these are to our setup? Are these all OLMo specific insights or is it just kind of a process you have to walk down? We've heard some of these things before. It's like all these developments are you have to do the previous type of thing before you can go bigger, do a more complicated model. Do you think that's actually true or is there just best configurations at the time?Dirk Groeneveld I really don't know the answer to that. It's hard. But something I want to know, something I want to do for OLMo three is walk back a few of these things and see in retrospect which ones are actually necessary. And in particular, I'm hoping that some of those are not necessary and they're costing a bit of performance, you know, just to boost our own efficiency a little bit.Luca Soldaini [00:21:54]: In general, I don't know, you can tell me if there's a useful summary, but it seems like the space of intervention you can take is so big. And other model, they're not going to translate perfectly, but the hit rate to like find a good solution is higher if you start from that model and you explore around it versus like try to explore like the full space of possible solutions. Yeah. And then some things will not pan out once you try to rerun them on your setup. And I don't think that's an indication of like necessary . Yeah. You know, we can mistakenly reimplement their thing, not in the way they're supposed to be. It's more like some things translate, some things don't. But it's a good starting point.Dirk Groeneveld [00:22:55]: Yeah. I mean, we are a fairly conservative bunch with this, right? Because even the 7B runs are actually kind of expensive. So make small changes from a known baseline by and large. Yeah. I mean, everyone has.Nathan Lambert Yeah. And risk is pretty obvious when you look at the cost numbers and like who you are trying to beat or not. And it's like we are trying to try to plot or people can build on it. And it's much better to keep making small progress than it is to go for glory runs and just hope that works. I think both works. The more compute you have, you can have a bigger distribution of investments, but it's not that surprising.Dirk Groeneveld I mean, I hope that we can be a lab that is a little bit more risk tolerant than others. For one thing, we don't have Meta's resources. So we should be a little bit more aggressive. You know, it would make me much more nervous if I had to bet a billion dollars on our next run than the amounts that we can bet. So we can try a little bit more. I also feel and I hope that our management agrees with this. I feel that if we always, if we're always safe, if every one of our runs works. That means we're not trying hard enough, right? We have to occasionally crash and burn.Nathan Lambert I think there's a few every year that you should crash and burn. I think these crash and burns at the big scale get a lot of attention from media and stuff. But it's like, what do you expect them to do? If they haven't, you're walking up a line and might as well try to take three steps at once every so often. Exactly. But I do agree. I think that's a cultural thing that we're trying to navigate. It's like, how do we do more interesting stuff and not just fall into the trap of being the best? Open model. No one else is doing this. Like, okay, you could do that for a while, but it's not as motivating.Dirk Groeneveld And it's not just because it's more interesting to do that, but also just the fastest way to make a better model. The fastest way to calibrate your risk tolerance properly. You have to sometimes be over. Yeah. It's inevitable.Nathan Lambert [00:25:05]: Any follow ups on risk?Kyle Lo Yeah. I'm thinking now it's like, because the 70B crash was so sad. Yeah. And I'm wondering if you look back on it now, it's like, that was the greatest thing for us. We learned so much from that.Dirk Groeneveld [00:25:19]: It was very important to love too. I do a little bit. So, I mean, we felt terrible, right? Like this was an awful time for us. I was like, I'm done. Let's get good questions. No, we were the training team that couldn't train at all. I felt so bad. But the work we did following up is some of the proudest I've been about the stuff I've done in my time at AI2. Yeah.Luca Soldaini [00:25:47]: In general, my thing about the role of OLMo sort of keeps evolving, right? It was very natural to have OLMo as these models designed to help others do research and language models. That's how we initially, it was a big part of OLMo 1. You just release all the components because it's important to have these tools available to everyone. To study language models. And I think we serve that community well. One thing that it's, I hope we can do with OLMo more is that there are like some interesting aspects of language models. Interesting capability, interesting architectural decisions that for a myriad of reasons, they sort of get overlooked in like say a company or like in a framework where, you know, you have certain constraints in your model. But it's still there. They are important. And there are questions around like what a model should be able to do, how it should operate, and things like that. But I think we can take a role where like we have in general this recipe that both enables research and language model and for like subset of model capabilities that we think are fundamental. No one is touching. It's our space to do work there. I think the prime example that I keep repeating these days is what we did with MOLMo andLuca Soldaini [00:27:25]: vision team was mostly working on it. And MOLMo is very good vision language model in general. It benchmarks up there. It's not the best, but it benchmarks up there with open models. And then it has this like this interesting point. Pointing capability that no other vision language model has. And that pointing capability is, turns out, is fundamental for a lot of language models and robotics that you want to build. It's a core capability the same way that a text model should have long context. And it was cool to make, to sort of emphasize that of like, oh, we have the specific capabilities that would enable all these applications. And so more people should work on like the specific aspects. So I think that's a cool way to like work on things that folks haven't had a chance to touch on yet.Nathan Lambert [00:28:24]: I think it's like trying to parse out why this type of situation could happen is not easy. Because you generally, everybody would want to do this. Like everybody wants to come up with a new capability that expands the scope of what X type of AI model can do. And I think it's most of like probably goes down to the culture of where people have space. To think about stuff in a more interesting way. It's like, because obviously everyone wants to have breakthroughs and open AI and Anthropic that copy. But it's like sitting at a boundary between doing just the same stuff and doing more researchy stuff that you need to have. I have more architecture questions. One is MUP. Multiple people are asking about it. I still don't really intuitively know what it is. But are we going to use this?Dirk Groeneveld We have done a fair bit of work into it. And it hasn't worked for us yet.Nathan Lambert Can you explain what it is?Dirk Groeneveld MUP is mainly a way of setting the learning rate, but also some other hyperparameters. By training only small models and then having a guarantee or at least a pretty good idea that it will work also for larger models.Dirk Groeneveld [00:29:33]: We have implemented this. We've experimented with it. So far in our setup, it works across model sizes. So the learning rate that it predicts you should use, it doesn't predict the learning. It just gives you one learning rate. Basically, the good learning rate for the small model is also the good learning rate for the big model. That works if we change the size of the model. It does not so far work if we change the length of the training run. And that's why we haven't been using it so far.Like number of tokens.Yeah. Or longer. If we double the length of the training run or we 10x the length of the training run, the optimal learning rate is different in our setup.Dirk Groeneveld [00:30:21]: It seems like this might be a bug. It should work, but it doesn't.Nathan Lambert And the positive gain is just that better scaling because you don't have to fiddle with the certain. You know you're getting the right learning rate, which is a crucial hyperparameter.Dirk Groeneveld Yeah. It's just a better way of setting learning rate. And it works for a few other hyperparameters too.Nathan Lambert But there are other open models that use this. Explicitly. Pretty sure. I mean, open weights model. Yeah. Those are linking. Like Llama and stuff using this. Llama does not, I think. But I don't know for sure. We'll always see with the next iteration. Even Llama3 felt like they were still building their org and their infrastructure so fast. It's just like get in what you can get in and there will be more models in the future.Dirk Groeneveld Yeah. I mean, MUP is a shortcut, right? Like you can for many settings where MUP wouldn't work. Or you have to just establish scaling laws and predict what it will be. You could do the same thing for the learning rate. Just MUP lets you do this with even fewer runs. You know, you don't even have to extrapolate anything anymore. You just use MUP and your setting will work. That's the idea.Dirk Groeneveld [00:31:29]: But you kind of already need a scaling law set up anyways for things that MUP doesn't work for. You know, like architecture changes and so on. Yeah. So in that sense, it's not that important. It's still pretty important. And we're going to keep trying to make it work for us. Maybe just find the bug. But it's not absolutely critical.Nathan Lambert How does scaling laws actually tell you the way to change like the width? Do they actually tell you the change in width or the depth, like the proportions of the network relative to the size? Like what are the actual output variables? Or how are you controlling the architecture you're going to use in the scaling laws? Well, like I know what it's trying to predict, the accuracy, but are they on set architecture things?Dirk Groeneveld You would usually vary one thing.Dirk Groeneveld [00:32:17]: Like you don't vary anything. You establish how it scales with size. Yeah. And you set your size according to a certain formula. Like you might say, I will go 1.4x the depth and 1.4x the width. So I have roughly 2000 pixels. That's a bigger model. And you do that a few times and you draw it on a graph. Then you change your architecture. You do it again. You draw a different graph. You lay them over each other and you hope that the lines don't cross. And one of them is clearly better than the other.Nathan Lambert Yeah. I definitely have known that there's some, it's like one of the obvious things architecture design and the not obvious things. It's like you obviously make the model bigger, but the subtlety of like how tall versus wide. I think we're talking about like a client that's like much deeper than ours, our model architectures. And it's just like, I'm around these things and I don't have an intuition for if tall or wide is better. And I think it's like what works.Dirk Groeneveld There are some early results from Google, I think. I think they're called efficient net or something. That suggests that over a wide range, it doesn't matter whether you go wide or deep. It's not that surprising. That's pretty old results now. We're following up on a particular result right now. Actually, so OLMo 2 is a 7 and a 13, right? But there also was a 1 that didn't work very well. And we're trying to find out why. And one thing about that model was it was pretty wide and not very deep. So we're checking whether that is the reason why it wasn't very good. So we're sort of in the middle of double checking this assumption that it doesn't really matter whether you go wide or deep.Nathan Lambert Yeah, that makes sense. I think that is something that doesn't matter to most people. They're probably very interested in it. Just like how they have these blocks and how do they decide. And it's like just one of us decides.Dirk Groeneveld And it's like, eh, seems right. There are other concerns, right? So we train with FSDP, with 0.3 sharding. So we can try to choose these sizes such that they utilize the GPU in the optimal way.Dirk Groeneveld [00:34:29]: Which has nothing to do with the sort of abstract training dynamics. It's just the practicality of getting this thing into 80 gigabytes of memory. So then those concerns might take over. There's other stuff like all your tensors, all your tensor dimensions need to be multiple of 64, 128, things like that. GPU math stuff. Yeah, exactly.Luca Soldaini [00:34:53]: It's really hard to argue against things that are practically making you run fast. Because it means that if I find something that is 20% faster, your big run trees fast. All the experimental cycles are 20% faster. So it's not very glamorous. But everyone is really happy when we find one of these. Like, oh, this is a shortcut.Dirk Groeneveld [00:35:16]: I find it super glamorous. I mean, when did you ever have such a clear sign of impact that you can say, I wrote this thing and it is not 20% faster? No, the impact is very good. Yes.Nathan Lambert The numbers you're changing are not necessarily glamorous. It's just detailed stuff.Kyle Lo [00:35:34]: I also think the experimental cycle thing is probably the biggest thing for me. What we're seeing consistently is the more experiments you run for a particular idea, the more likely it is to just work out. It's just a function of trying more things.Nathan Lambert [00:35:47]: It seems like in the pre-training, there's very few, like, you just get the idea. I mean, well, I said post-training more. But literally, like, we had a meeting with John Schulman. He was like, everyone, lead labs, train RL and athletes do this. And we got, like, a three-month head start on one step. But pre-training, all that stuff. I think it's evaporated.Kyle Lo [00:36:05]: The human intuition piece is just gone. I think once you do v0, you can kind of do everything with intuition. It's like, oh, look at data. This kind of makes sense. This seems like . And then after you get to, like, v2 of something, it starts becoming really hard to make sense of what is good for a language model or not. So you kind of just need to just try a bunch of stuff.Dirk Groeneveld [00:36:29]: And then there comes a game of stacking improvements that are worth 2% to 5% each.Nathan Lambert I think it's very compounding, at least in all the math, works out over a year. I think I want to ask about MOEs as well, if you have a different thing you want to say. But it's mostly, like, it seems like we have a OLMOE, which, if you look at the plots on paper, it's like this MOE architecture beats all of our own things and carry efficiency. But it seems like we had a path we needed to go down to make sure dense works really well and get all these improvements. And then you have to, like, feed back in. And you, like, merge the MOE streams. We have DeepSeek. We have Minimax. There's countless other MOEs that get really high eval scores. Like, they're not as easy to do research with because they have tons of total parameters. And people need bigger clusters to fine-tune them, blah, blah, blah. But it's like, is MOE something that you think we just need to do to make better models?Dirk Groeneveld Well, it's a complicated question, and we haven't quite answered it yet for ourselves.Dirk Groeneveld [00:37:34]: We did investigate doing a bigger MOE. And we found that the engineering is somewhat difficult. And at the time, we came to the conclusion that we could do that engineering, but then who's going to run that thing later? They also have to have a team of engineers on top of it to make sure they can train this.Nathan Lambert What does the engineering look like? It's not, like, CUDA-level kernels. It's how you distribute parameters?Dirk Groeneveld It's a little bit like... It's a little bit CUDA-level kernels in that... If Mega Blocks by itself isn't enough for you, then it gets really complicated. And we ran into that situation where if it had to be significantly bigger than what we did, it just got too complicated.Luca Soldaini [00:38:22]: There is an inference. These very big models that really get advantages by... If you tailor them to, like, where you're going to do inference with them. So if you're a big company, you start thinking about, like, how to batch request, how to, like, serve the model. But if we could do it ourselves for the place where we're running, but then you start thinking, like, oh, folks who want to use their model in their hardware, they're better served by advanced model than also redoing this engineering on top. Like, there is, I think, a clear advantage if you are... Also providing an API to an MOE. Yeah. Very clear cut.Dirk Groeneveld [00:39:10]: It depends on how we think of the product of ALMO. And the number one is still it's an item to be researched. So other people need to be able to train on it and to modify it and so on. And that is just much easier if you have a dense model. Yeah. If you think of it as something that gets put into a product. And people will run tons of issues. But if you have a lot of inference on and you only really care about the final score that it gets, then maybe the MOE starts making a lot more sense again.Nathan Lambert Yeah. That's a good answer. I think it's, like, I think people can fill in the blanks of, like, what we may or may not do.Luca Soldaini [00:39:53]: And I mean... I mean, like, different, like, I'm curious, like, what, like, folks at Llama, the Llama team think about MOE.Nathan Lambert [00:40:03]: If the Meta AI exists, they're 100% going to do an MOE.Luca Soldaini [00:40:06]: I mean, it's interesting, right? It's, like, if they're serving few, if they're expecting that the Llama users are going to be, in fact, one of the better smalls are few large companies that can figure out inference, then MOE makes sense. But if they're thinking about more, like, this model that wants to, it's great if it's adopted by a million developers, large and small, then, you know, they're still going to reach a lot of dense model. Yeah. Exactly. That development is so easy, so much easier for people to set up their own inference with a dense model.Nathan Lambert [00:40:40]: Yeah. I think we've gone surprisingly long without asking about data. It's, like, how much more, is it just an infinite hill to climb on data? It's finding good data and filtering bad?Kyle Lo [00:40:53]: I mean, I think it's an infinite hill to the extent to which everything else is also, and you can kind of keep improving, right? But yeah, it's the main threads constantly are. Got to get more data, because if you're working with larger pools of data that you can't actually get easily new data that's not in your distribution, it's probably interesting to study how that adds in. And you have more to work from. So if you have, like, a strict quality filter, you can still get your high token yield if you start with a much larger pool and filter down. So getting more data is really, really critical, especially if you can target specific pockets that you think is missing. You can always keep iterating on better filters. Understanding how those filters affect performance. And everything kind of interacts with each other. Like, safety filters interact with quality filters, interact with deduplication, interact, like, all these together. So there's an infinite, even ordering, search space between these operations. So keep throwing more things at it.Luca Soldaini [00:41:53]: Yeah, it's very much just stacking small improvements. Yeah, shots on goal. I think the way it looks is, like, it's... For each... Now that we have, like, these multiple stages of pre-training, we think about, like, what kind of improvement you want to get from data at all the various stages. Like, clearly, the improvement you want to get from data you put at the end of training is different than the improvement that you want to see at the beginning. It comes with a different set of requirements. One thing that is really useful is... Intuitions are always often wrong. But one thing that it's worth spending time on is figure out... If you have a data ablation idea, what is the fastest way to disprove it, which requires a little bit of experimental design. And then, yeah, you've got to fiddle with, like, especially, you know, when you do the first version so that you can take a very... It's very easy to measure improvements. And then as you start thinking, like, refined version, then, you know, you've got to think of, like, how you measure your improvements or so. But, yeah, it's... There's no, like, big... After you're done, you know, the basic stuff, your V1 is done. There's never, like, a big, like, thread of, like, this is the one data thing. It's more, like, stacking your Lego bricks to get to a better model.Nathan Lambert [00:43:18]: Do you think you can iterate faster on, like, end of pre-training, whatever you want to call it, like, highest quality bit training and the only data? Yeah. Have you, like, started that recently?Luca Soldaini [00:43:28]: I think it depends on the... What we're getting, you know... We... We need a little bit more evidence of this, but it depends on the role of data. Like, it's very much... The reason why we started doing mid-training at all is because we were interested in having base models be primed with certain capabilities that we didn't get during the long pre-training phase. And for those, it's really easy to iterate on new data sources that would improve on those capabilities at the end, pre-trained. But during, like, the pre-training phase, why not the important aspect that we think about is, like, efficiency of your data is, you know, if there is a version of your data that is where train on it and the model gets to performance X on 20% faster, it means that you can train 20% longer, right? Or run more experiments. Or run more experiments. And so... But for those, it's, like, you know, it's... In some cases, you can use mid-training as, like, a proxy for this. In other cases, it doesn't quite make sense, so you have to come up with, like, maybe experiments through scaling laws, maybe experiments through some other technique. But yeah, it really depends on, like, what role a data set plays into, like, the various stages of pre-training.Nathan Lambert [00:44:53]: So it seems like, like, compared to Dolma 1, which is, like, do the thing, it's all targeted abilities. It's, like, we want to be better at things. We put people on this. It's, like, targeted abilities or where we think we can get a lot of data.Kyle Lo [00:45:05]: Like, a certain data source that hasn't been mined for stuff. Yeah. Yeah. We have to be opportunistic because it's so hard to get data. And for us, especially if we want to be open with the data, it's, like, we have to also do it by due diligence. Like, we're going to study this data, put all this effort in, and we're still going to be able to share it with everyone. So...Nathan Lambert [00:45:22]: If you were in a lab that didn't release data, do you think you could make more progress on it? Like, how, like, how much is that actually?Kyle Lo [00:45:27]: Oh, yeah. Oh, my God. Such a time sink.Luca Soldaini [00:45:31]: I mean, it's, like, it's a little bit of a mistake that we put in. Yeah. Like, and this is not even, like, doing, you know, getting data that managed to not legal, right? You could form partnership. You know, you have people knocking at our door all the time saying that you want to buy this data set. And they're, like,Nathan Lambert [00:45:48]: I've been contacted by one ML owner to try to facilitate a data deal.Luca Soldaini [00:45:52]: Oh, yeah. Twitter. Oh, my God. But only the first, the first follow-up is, like, are you cool if we release the data? Of course, they're not. Yeah. So, it's, like, it's, it's, even, like, there's plenty of data that you could acquire from people, but then you can't release it. So, that's, that's a complication to, to progress.Nathan Lambert [00:46:15]: Yeah. This is more of a self-question, but, like, how much do you think mid-training should be, like, a philosophical shift in how we organize teams? Because it's very easy to do. I mean, we've already consolidated, like, our training and data to base, which is not surprising. But this is mostly hypothesizing on what other people do. It's, like, how close do you think this kind of end of pre-training to post-training handoff should actually be?Kyle Lo [00:46:40]: I think it's, it makes sense as a thing if, I think these things are, in theory, arbitrary, but you can think of, like, in the extreme, if you had a perfectly oiled machine, you have a very smooth transition between pre-training to mid-training to post-training, and it's actually, there's no boundaries. Like, that's, like, a theoretical. You can probably squeeze a ton of performance by smoothing that out. But in real world, stuff, stuff is messy. So the real world is your three trillion tokens into your base model run, and then you signed a new data deal. You got to do something with this, and you're going to undo your training one. Well, you got to figure out something. So maybe that's mid-training, right? Mid-training is when you have an opportunistic need for something, or you're training something and someone catches a bug, which happens all the time, like a data bug or some training bug, and you're like, oh, I had to patch it. So then there's the shift fundamentally. You got to know how to deal with this. So just because these things aren't, these large training runs aren't super repeatable, and they take so much time that the world state changes all the time, you always need some strategy on how to deal with, oh, I'm near the end of pre-training versus I'm near the beginning of pre-training versus... Yeah.Nathan Lambert [00:47:47]: It's like, we're obviously trying to solve long context, so this fits right into this. It's like, we're going to do this thing. Does it go, where does it go? Some people do it in post-training. Yeah. There's some component during pre-training.Kyle Lo [00:48:00]: It's kind of just like, you have to follow a few recipes and figure out what works for your team. Yeah. And so much of it is just, if it's expensive, try to push it off as much as possible. Because if it's risky, push it off as much as possible. If you can intervene to get the same result much later, huge win. You can try a bunch more things. If you have to intervene because it's some core thing that has to be baked into pre-training time, you're kind of... It's a sad space to be in. But then that's the thing where you have to intervene. That's the pre-training data.Dirk Groeneveld [00:48:29]: There's a big question that I'd love to get an answer to, but I don't even really know how to think about it. But the question is, what makes a pre-training model a good candidate for mid-training fine-tuning? Because all we really try to do is we try to maximize our metrics, but we don't really know that those metrics are what makes a good step zero for post-training.Nathan Lambert I think a relevant thing, I don't even know if I've told you this, but I don't know how to take action on this, is we got advice that we have the multiple stages of post-training. In this instruction tune phase, we got advice that's like, eh, it could be a little broken. You can have some crap in there. It'll get fixed later on. And it's like, why is that okay?Nathan Lambert [00:49:14]: It might be the same thing in pre-training. It's like, you want to get in the right... It's more important to get in the right ballpark than the right exact number. Yeah.Luca Soldaini [00:49:21]: It feels like it's more about not how to make a good model for post-training. But what to avoid so you don't have a bad model post-training. Yeah.Nathan Lambert [00:49:33]: There's a whole other question, which is how to make a base model that's easy to fine-tune in general, versus one that, if with the right finagling, can get the absolute best numbers. Which I think, for OLMo, would be really great to be like, here's a super stable platform. A lot of people have complained about specifically That Llama Instruct, it's hard to fine-tune. Which, after most of the post-training. Because this is where people at companies start. They're like, this is the best open-weight model. I want to add a little thing in it. And a lot of people have difficulty in fine-tuning it. It's different at the base, because most people can't do this full instruct thing. But for researchers, having a stable platform at base is way more valuable.Kyle Lo [00:50:12]: There's an interesting... About this, like, what makes a base model a good base model. There's this interesting, I guess, debate that we've had a bunch of times. We've also had with other people. Which is, it seems like there's like two hypotheses on what the role of this... How do you think about data as an effects-based model behavior? There's one hypothesis, which is, you need quality data so that you don't get any spikes. You have stable training. You have no bugs. And once you pass that level of quality, as diverse as possible. It's just about an init to the model, so that it can go in literally any direction. And so, diversity is the next. That's one hypothesis. The other one is, it's all domain effects. The only reason why... Like, you can just keep climbing. There's a notion of quality. But you... And you can keep getting more and more and more as long as you're very clear about what target domain or target application you are. You just keep getting closer and closer. Well, there's a lot of suite learning. Yeah. Well, this goes into, like, the continue pushing. I just like... It's just domain effects all the way down. If you're only evaluating on this particular stuff, you can always get your base model to be better for that. Just keep climbing on it to get it more and more similar. As opposed... And, like, think about, like, I care about this application, this suite of applications, all the way through. From base model... Can you not kind of have both? I feel like I'm confused with how, like, actual generalization fits into this. It's... It's... It's... It's competing ideologies in terms of, like, if you believe in the first one, then you're all in on diverse data acquisitions. And how you set up your team. Yep. You're all in on efficiency and stability for your pre-training. And then you just get as much different data as possible. And you're post-training all the time. If you believe in the latter one, you solve backwards from, this is what I want the model to do. And I make all the changes everywhere to try to squeeze performance out of this class of problem. In the big... In the data, in the bit-training data, bit-training data, et cetera.Nathan Lambert [00:52:01]: How important do you think the actual, like, multi-tag category of every data document is? Like, know that someone... Like, that these people have really advanced tagging of all their pre-training documents. Like, do you... Like, does it essentially say, like, doing that and choosing them? Which is, like, a very much, like, crafting a, like, recipe for your pre-training versus, like, just good numbers. So, like, just get a good classifier and roll with it.Kyle Lo [00:52:27]: We have tags. That's fine.Luca Soldaini [00:52:31]: The tags are useful even if you get this idea of, like, let's use as much as possible. You know, diversity is important. A lot of web data comes with absolutely no useful metadata. You have, like, URLs. URL is very, like, you have to do things on top of it to make your URL useful. It doesn't add much. So, the more you have in terms of, like, categories, metadata information, you can start using this as a tool to try extra technique on it. Maybe it is extra technique to mix your data in a certain way. Maybe it's filtering out things. Maybe it's, like, designing benchmarks. Try to correlate with those. Yeah. Otherwise, it just seems to have this giant bucket with maybe, like, one quality knob. And it's, like, it's very hard to make progress if all you can adjust is, like, one number we cut for quality here. So, it's, I'm not surprised that, you know, the big labs, they almost have these tags. I want to know how they use them. That's, like, the part that's not good. That's the part that's not good. Yeah.Kyle Lo [00:53:51]: But it's also not just you have more levers to pull and then, you know, the more things you can try, the better. It's also you want tags that are actionable, right? So, like, if you had a sensible notion of a tag and you realize, oh, more of this data as you keep adding more of this lever, performance keeps going up. At some point, you might be, like, we're out of that data. We need to go get more of that. Without that tag, you want that tag to be something that's understandable so you can go and negotiate another deal, do synthetic generation, et cetera, of that type of data.Nathan Lambert [00:54:13]: Do you think most of the synthetic data gen, is for very specific things at pre-training? I mean, it kind of has to be. Probably, yeah.Kyle Lo [00:54:25]: You can't just be, like, oh, it's generative data. Like, that's not something, I don't know what that procedure is.Luca Soldaini [00:54:30]: It's probably to prime the model to whatever you need during post-training. Like, you know, we've seen, like, normally with math, it's much better if your model has an elementary knowledge of math to, like, improve on that. It's quite the same with everything that it's, like, oh, I want to do RL on this. If the model is completely random on it, you're going to have a very hard time.Nathan Lambert [00:54:52]: Yeah, it's, like, I guess a good transition. It's, like, what do you three think post-training is, should be, or, like, is not doing?Kyle Lo [00:55:02]: It's elicitation.Nathan Lambert I'm coming around to this view that it seems that you can extract abilities from the model.I think it's totally elicitation. Like, the Hitchhiker's Guide to Data paper from Google, yeah, that one was very, that one had, like, a very specific experiment. But it seemed like that was pretty strong evidence towards it. It's, like, you filter out all of this type of data, you literally can't fine-tune that model. You can never recover that. There was a history detection, right?Nathan Lambert [00:55:28]: I think if you do more flops, you potentially can. I mean, it's obvious, like, we're not talking about, like, O1 stuff things here. But, like, there are even datasets that have, like, 15 million math-only instructions. Are they going to be able to really start doing a ton of math? At some point, yes. Yeah. But I think that most of it, or it's almost easier to operate. I mean, it's just like, assume that capabilities are in this model and are post-training to get it out.Luca Soldaini [00:55:53]: Sometimes there's this very large set of, like, things that you do in pre-training because you have a sense of, like, how they play an application. I think one day it's, like, very obvious. It's like, code model, you want to do, you want them to do completion, you're going to add, fill in the middle of loss, maybe at the beginning of pre-training. It's like, oh, then I can play my entire pipeline around like that. So it's all about... So far, it seems all about that. I don't think we have cracked a good recipe to do the same for things that are not capabilities, but they're, like, recalling facts. Oh, yeah. Or, like, long-term knowledge.Nathan Lambert [00:56:29]: Yeah. It's, like, all of us, like, all know, or, like, I don't know, at least people out there have MLMU numbers that go up in X stage. Like, instruction tuning, boosting MLMU, I'm like, what are you putting in there?Dirk Groeneveld [00:56:42]: What do you think of mid-training then? Is that a manifestation or... Mid-training? I think it's still...Kyle Lo [00:56:47]: I think it's still positive knowledge. I think mid-training is, it's just, it's still pre-training, but with strong domain effects. It's just smoothing out the boundary between, you have a very, very sharp distribution shift when you do post-training, and we know from, like, kind of ML101 from the past five, six years, that smooth, smoothing out, helping, like, transition between major domain shifts helps. But we don't have a clear example of where, like, it helps with specific knowledge acquisition. Yes. For them, we don't know how to do it. But for, like, you that are really easy to evaluate, things that are really big progress on, it's like, yeah, smooth this out.Nathan Lambert [00:57:30]: So, like, why is post-training important to the release site? Some of you guys came around to, like, post-training being important for getting traction later on. Is that just, like, an ML ecosystem, how it works?Dirk Groeneveld Oh, I mean, the base model is kind of useless, right? Yeah. There's only so many next tokens you need to know about. Yeah.Luca Soldaini [00:57:50]: But it's like, you know, we've seen papers that use all the research, for example, where the idea for that research only came by comparing base model with, you know, instruction team model, like, the one where folks, they were involved around, like, certain pattern of speech and OLMo 1. Where do they come from? Do they come from pre-training? Do they come from post-training? And, like, even if you just want to do research, it's kind of useful to being able to compare side by side. So it feels wrong to put a model out that it, like, cuts sort of the problem that you can study in half until you have the post-training ready. And it's useful to have all in one package so you can use it right away.Kyle Lo [00:58:40]: Post-training is just, like, a really, really long eval loop. Yeah. And that's a lot like, oh, base model, you know, a few shots, a few shots on some benchmarks, like, no, no, no. We eval it by post-training it and then eval it in post-training.Nathan Lambert [00:58:54]: Yeah. I mean, to some extent, it is kind of true. I mean, that's how we should think about it.Dirk Groeneveld [00:58:59]: If we could do that cheaply, we would totally hill climb on that metric.Kyle Lo I think that's the metric. Because if base model is the good in it for the post-training, which is the model people actually want to use, then we evaluate it on its own. And on its status as a good in it.Nathan Lambert [00:59:16]: Yeah. So it's like, how do we... And then the question is, like, how important do you think research for post-training on the specific checkpoint is? It's like, how important is genealogy versus, like, general recipes? Because I think we... I openly think we under-index on using one model. Because much like the path to stability, which is a eight to ten month really specific thing, I'm guessing if you're really just, like, in a narrower regime, you can just keep kind of turning these little things. Yeah. Hopefully at some point we can do better with new models. Yeah.Nathan Lambert [00:59:52]: Okay. We're kind of going to some wrap-up things. How do you think about release decisions? Like, should AI2 release everything that we ever tried? Or is it, like, when should we actually get models out the door?Dirk Groeneveld I mean, I would love to do that, actually. Especially the failed runs. You know, like, where else could you get a repository of failed runs? Yeah. I mean, I think it's just a matter of giving other people the possibility of looking into these failed runs and finding out exactly when they failed. In practice, that's super difficult. Because just releasing something is hard. You know, you need to upload the checkpoints and translate them in a different format. And you have to describe what you were even trying in some way that makes sense to people outside of the org. Give them access to the weights and biases and to the logs. And it's just a lot of work. And there's always something else that seems more pressing than that.Nathan Lambert Seems like a scaling. Like, how much we can share is capped by how we can scale our org. Which, like, we're not going to have a complicated management hierarchy or, like, an entire org that is just support. And everything you upload, you build as a support burden. It's like, literally, we just have seen the envelope grow, grow, grow. It's like, more people use our things, you get, like, boring support. Like, people want to use it. That's the cost of it.Dirk Groeneveld I guess it's a great problem to have. People want to use it. People want to use us.Luca Soldaini [01:01:15]: And it's funny. To make a checkpoint where, like, some are very useful, you need the person who was involved. You have to fill, right? You need the person who was involved in it to sort of pour their knowledge into a format that then, you know, people can consume outside, right? Otherwise, you know, we would just open up our S3 bucket, the checkpoint, and it would be, like, utterly useless. Because what if you wanted to know more of the parameters, so, like, as long as we optimize for release, then we have the bandwidth to provide, like, the support around that. If people want the 70B fail run enough, you know, I'm sure we can release it.Nathan Lambert [01:01:57]: It seems like it's just finding the right medium to release things. Like, I think long-time people reports are really good for the stuff that we do, because it just puts everything in one place for people, and it almost makes on-demand easier in the future. Whereas, like, we could just drip and drag models out all the time, but, like, that's not something we can't do. It's just, like, not... In terms of making progress in things that are easy to build on, it's probably just not worth it.Kyle Lo [01:02:19]: In fact, there's even a cost to it, right? The big example here is we had a release of OLMo 1 0724, or July 1. Yeah. I think research, using that for research, that has been probably one of the tougher models, because it didn't come with a blog post, it didn't come with, like, some docs. And so, yes, it still waits for checkpoints and everything, but comparatively, usually, even when people come to us, we're like, oh, we recommend you use 0424. And now with OLMo 2, we're like, oh, that's the one we recommend, because it has all the documentation. So just dropping something doesn't seem like it really helps.Nathan Lambert [01:02:56]: I would say we should move faster than, like, the 1-2 iteration. But the in-between is not necessarily even worth it. Which is very odd, when you think about being fully open. It's just, like, kind of with the costs of doing business.Kyle Lo [01:03:10]: It's like being fully... You want to be fully open, but you don't want to add noise. And you don't want to waste people's time. Right? So if you drop something that's kind of half done or half baked, and people start spending time on it, only to get frustrated later, you've cost them something.Nathan Lambert [01:03:22]: How does this relate to, like, how pre-training is changing? Like, do you think we need to invest in... Like, openly, a lot of startups are changing their relationship to training. And if they're going to use Llama or pre-training or customer data, and then we have X compute budget, and does any of this come into play? Or is, like, it's all the same with the talking? It's, like, continue to hill climb, do what you can, reasonable trade-offs, and who will actually use the models? It's, like, not too different.Luca Soldaini [01:03:54]: I think that the... So for me, the cutoff point is, like, is there something useful and generally interesting to add if you pre-train? The case of Llama, all this, like, mid-train things that we concluded, we done. It couldn't be as clean if we started with an already pre-trained model. So it's, like, is there really something useful to add to the conversation if you pre-train? If we get to the moment when the answer is no, or, for work, like I was saying. But it feels there's still value to add to the conversation. At least in the research side, like, pre-training, there is tonight a question of, like, we know how to help researchers. We want to help more than just researchers with the models we put out. And if we think there is this application that we can do a very good job, or just this use case, a very good job by starting with someone else's pre-trained model, we shouldn't waste compute on pre-training from scratch. Just saying. We can solve that. But it's an ever-evolving question, really. It's, like, I don't know. We can make decisions six months out, maybe? Maybe a year?Kyle Lo [01:05:24]: Well, that's what I would say.Kyle Lo [01:05:27]: I know. You're the pre-training. You're the hardcore who's pre-trained some models.Dirk Groeneveld [01:05:34]: There's lots of runway left in pre-training. The big labs are fairly conservative because they have to be. But that doesn't mean that we're done. I mean, it's not that we're done. I also feel that the point of all is to make pre-training research accessible to more people, because even if you don't have the resources to pre-train the whole thing from scratch, you can still use our checkpoints and use our code to prove out some sort of improvement. And as we've seen in other areas, even Microsoft tries to push .NET or Apple tries to push Swift or whatever. They try to, like, it's a really big effort for them, and they try to push this. And the open-source community says, I don't care. We're going to use Python. And Python wins. So if you can somehow enable the vast resources of a million people banging on a thing, even a company like OpenAI or Meta cannot compete with that. And with OLMo, I'm hoping to capture that a little bit, that if we can capture something with some of the open-source enthusiasm and the academic enthusiasm.Nathan Lambert Do you think it'll get better this year? Because a lot of academics are bringing on tens of hundreds of H100 clusters around the country. Like, before, it was like just Harvard had 500 and MIT or whatever. But now it's like the long tail of universities. Like, there are a lot of people.Dirk Groeneveld [01:07:12]: And then, you know, if you have 200 H100s, you can at least establish scaling laws for your idea. So, like, what I'm hoping is someone uses OLMo to try some new thing, establish the scaling laws up to a 3B model or whatever. Then we take it and we prove it up to 30B or whatever our computer allows. And if it still works, then, and if it's probably going to open, then I take it. And let them win. Let's not win. Yeah.Nathan Lambert [01:07:36]: I mean, they would never tell us that they'd win. Yeah. Like, what do we need to achieve this? Do we need resources and compute and certain people? Like, do we need more feedback from the community? Do we need feedback from people at labs telling us which things to do?Kyle Lo [01:07:48]: Compute and people, for sure. That is undeniable. If you have more computes, you can try more things. We can go bigger. If you have more people just trying more things, especially like on our artifacts, we'll just learn so much more and not have to just like, we spend so much time guess working, trying to piece together things from other people's pieces. And sometimes it's nice to just get something out of it. So if they did this on OLMo, we can immediately start working off of it. So people, compute, always, for sure.Luca Soldaini [01:08:20]: One thing that I, we get a lot of feedback, but it's like, I really like AI2. I would like to use OLMo, but it's missing this feature, which is great. I love that feedback. It's helped us a lot in prioritization. If we could get more, I would love to also get like aspirational feedback of like, none of the models is doing this. But I have a good case for this. Yeah. Those to me are always like very inspiring to read. Whether we'll do it or not, it's, you know, it's a question of like, can we do it and how it works with other things.Kyle Lo [01:08:55]: But those are always very, very welcome. You know what would be really cool? I think what would be really cool is like more projects in space that you can't do unless you have some sort of fully open constellation of artifacts. Yeah.Nathan Lambert [01:09:09]: The thing that Dirk, does anyone ever do the thing where you load the model into one GPU and like iterate through the batches to find the one that, what happens when it blows up and the, or like when a loss spike happens?Dirk Groeneveld I mean, to some degree we did this ourselves. Yeah. But it's like something that people can do. It's not like we wrote a paper about that, but, but yeah, I would, I would love to see a detailed write-up of like millisecond by millisecond, what happens in a retention when a loss spike happens. You know, how, how does it actually happen? These are the things that people can do.Nathan Lambert And it's like, you just have to keep, keep zooming into a specific level of details in what happens.Dirk Groeneveld Yeah. I mean, right now we're having, someone is using the various checkpoints to see how a certain metric that we're interested in develops throughout pre-training. Yeah. And it's like, you can do that with fairly minimal compute. You don't have to be AI2. Yeah. It's like one of my favorite weird language law papers. It's the Sander Land fishing for Magic Karp paper. And it's like, you can get much more actual feedback looking at weird tokenization. Yeah. You can get much more actual feedback looking at weird tokenizer impacts and tokenizer data interactions on old mode than just picking API models and figuring it out.Kyle Lo [01:10:20]: A lot of also, there's a lot of really forward looking at the checkpoints that we have with the data patches and trying to do something like, okay, let's replay this, the, between everything between these steps by rejecting some different data or manipulating the data between these two checkpoints, just to see how it turns to something different. How big of a fork does it go through? Yeah.Nathan Lambert [01:10:39]: Like if you add the same intervention, like how big does it go through? Exactly. And just to see how it turns to something different.Kyle Lo [01:10:43]: So it's like reconverge. Or early in pre-training versus later in pre-training same interventions, messing with the data. It's just like, that stuff is really cool.Dirk Groeneveld [01:10:49]: I mean, I think there's, I've complained about this for a long time. Grad students, I think, are a little bit hesitant to go into pre-training stuff because they need to publish four papers a year. And it's pretty difficult to do that when your cycles are so long. But on the flip side, it's a bit less busy a field. Yeah. So less likely to get scooped if the field doesn't change out from under you while you're in the middle of your project. Yeah. Post-training is not quite as much as it happens on your side.Nathan Lambert It makes no sense. It's just like, pick something you want to do and people will probably do it. That's okay.Dirk Groeneveld [01:11:31]: So I'm hoping that by publishing all of this stuff and making all the checkpoints available and the data and so on, we can enable more people to work in that side as well.Nathan Lambert Yeah. Anything else you guys want to add?Kyle Lo [01:11:49]: Like, comment, subscribe.Kyle Lo [01:11:52]: Yeah, I think that's it.Nathan Lambert [01:12:01]: Okay. Thanks for listening. If you have questions for any of us individually, the Blue Sky and Twitter handles for everyone in this podcast are below. And you can reach out to the general OLMo contact at allenai.org. That's an email address. Or we're really happy to help and we want to keep building this kind of open scientific ecosystem of language models. So all the best. Bye bye. Bye. Get full access to Interconnects at www.interconnects.ai/subscribe
    --------  
    1:12:43

More Technology podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Podcast website

Listen to Interconnects, Dwarkesh Podcast and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.7.0 | © 2007-2025 radio.de GmbH
Generated: 2/14/2025 - 7:15:44 PM