Open app Discover Prime

Top stations Podcasts Live sports Near you Genres Topics

Podcasts TechnologyGoogle SRE Prodcast

Listen to this podcast in the app for free:

radio.net

Sleep timer

Save favourites

Download for free in the App Store

Google SRE Prodcast

Salim Virji

Google SRE Prodcast

Latest episode

51 episodes

The One With Damion Yates and Building AI systems
26/02/2026 | 31 mins.
How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems?
In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments.
Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.
The One With Carla Geisser and Crisis Engineering
11/02/2026 | 25 mins.
Join us for a discussion with Carla Geisser of Layer Aleph, a company focused on "crisis engineering". Carla distinguishes a crisis from a standard incident by noting that a crisis is novel and lacks a playbook. She outlines five criteria for a true crisis: fundamental surprise, broken critical functions, high visibility, a rigid deadline (unlike internal tech deadlines), and perception breakdown. Crises often arise in organizations that struggle to admit computers control core decisions, leading to complex, glued-together systems. Carla emphasizes that SRE-adjacent skills are essential for connecting the dots and exposing the full system. The key takeaway for SREs is to recognize when a true crisis is happening, as leadership will only be willing to "break rules" and enable substantive change once three of these criteria are met.1
The One with Parker Barnes, Felipe Tiengo Ferreira, and AI
05/02/2026 | 36 mins.
This episode of the Prodcast tackles the challenges of maintaining AI safety and alignment in production. Guests Felipe Tiengo Ferreira and Parker Barnes join hosts Matt Siegler and Steve McGhee to discuss AI model safety, from examining content to emerging security risks. The discussion emphasizes the vital role of SREs in managing safety at scale, detailing multi-layered defenses, including system instructions, LLM classifiers, and Automated Red Teaming (ART). Felipe and Parker dive into the evolving world of AI safety, from core product policies to the groundbreaking Frontier Safety Framework. The guests explore the need for SRE principles like drift detection and context observability. Finally, they raise concerns about the velocity of AI development compressing long-term research, urging the industry to collaborate and share vocabulary to address rapidly emerging risks.
The One With Shannon Brady and Operating Systems
28/01/2026 | 24 mins.
In this episode of the Prodcast, guest Shannon Brady speaks with hosts Jordan Greenberg and Florian Rathgeber about managing Google's vast fleet of internal devices. Shannon explains how Google's Linux platform uses core SRE principles—specifically testing, canarying, and monitoring—for weekly stage rollouts of its Debian-based distribution. Configuration is efficiently managed using Puppet to ensure the right setup for a diverse user base. The conversation pivots to "the year of Linux everything," underscoring its widespread adoption. Discussing AI, Shannon identifies its greatest utility for SREs in rapidly analyzing signals and generating complex queries to resolve outages. This episode reinforces that practicing SRE fundamentals is paramount, demonstrating that you can be an SRE at heart, regardless of your official title.
The One With Denia Del Cid and AI
21/01/2026 | 29 mins.
Curious about the real impact of AI on Site Reliability Engineering? In this episode of The Prodcast, Google SRE Denia del Cid breaks down how her team is leveraging AI to transform production workflows.
Denia details practical applications like early outage detection, incident similarity analysis, and toil reduction. She explains the critical importance of validating against "golden data sets" and keeping humans in the loop to build trust. Discover how SREs are evolving from skepticism to strategic adoption with Gemini.
Tune in for a pragmatic, measured look at the future of reliability.

More Technology podcasts

About Google SRE Prodcast

SRE Prodcast brings Google's experience with Site Reliability Engineering together with special guests and exciting topics to discuss the present and future of reliable production engineering!

Podcast website

Listen to Google SRE Prodcast, The AI Daily Brief: Artificial Intelligence News and Analysis and many other podcasts from around the world with the radio.net app

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Google SRE Prodcast

Scan code,
download the app,
start listening.

Company

Legal

Service

Contact
Apps
Help / FAQ

Apps

Social

v8.7.2 | © 2007-2026 radio.de GmbH

Generated: 3/4/2026 - 2:28:02 PM