PodcastsEducationData Engineering Podcast

Data Engineering Podcast

Tobias Macey
Data Engineering Podcast
Latest episode

509 episodes

  • Data Engineering Podcast

    The AI-First Data Engineer: 10–50x Productivity and What Changes Next

    07/04/2026 | 59 mins.
    Summary
    In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests.
    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026

    Interview
    Introduction
    How did you get involved in the area of data management?
    What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?
    What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?
    How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Blog Post
    Datafold
    Claude Opus 4.5
    Harry Potter - Muggles
    Jevon's Paradox
    Modern Data Stack
    Dagster Compass
    Gravity Orion
    MCP == Model Context Protocol
    Qwen

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Treat Metering Like Finance: Building Data Platforms for Consumption Economics

    29/03/2026 | 50 mins.
    Summary
    In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support sophisticated pricing, the importance of treating metering like a financial system, and the architectural foundations - event schemas, durable ingestion, normalization/validation, a usage ledger, and clear serving layers - needed to power near-real-time visibility with fine-grained drilldowns. He also digs into anti-patterns and reliability concerns such as late or duplicate data, time zone pitfalls, SLAs, and automated policy decisions for pipeline failures. Himant shares practical guidance for capturing usage events from products and logs, balancing push vs. pull and real-time vs. batch processing to manage costs. He highlights configurable metering and rate-card versioning for rapid onboarding of new products, and the cultural shift required for finance, product, and engineering to co-own metering.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Himant Goyal about how data platform investments support consumption based business models
    Announcements
    Introduction
    How did you get involved in managing the data products or data management?
    Can you start by outlining the types of businesses and products that are "consumption based" and the impact that it has on the economics of the company?
    What are the unique operational challenges that are presented by having consumption as the unit of cost?How does the availability and accessibility of metering data impact the level of detail/nuance that the business can employ in their pricing strategies?

    When we talk about the infrastructure for usage tracking, it often feels like a high-stakes stream processing problem. What are the core architectural components required to build a reliable metering pipeline?How do you think about the trade-offs between "push" models (application emits events) vs. "pull" models (the platform scrapes resource usage)?

    Accuracy is non-negotiable when data is tied directly to revenue. What are the strategies for ensuring idempotency and handling deduplication in the ingestion layer?How do you address the "late-arriving data" problem in a usage-based world, especially when dealing with monthly billing cycles or credit exhaustion?

    From an uptime and reliability perspective, should the metering system be in the critical path of the service itself?If the metering service is down, do you "fail open" and provide free service, or "fail closed" and impact availability? How do you build for that kind of resilience?

    One of the common pitfalls is treating metering like logging or observability. How do you ensure that usage metering is treated as a first-class product priority rather than an afterthought for the platform team?What does the interface look like for product engineers to "register" a new billable event without breaking the downstream data contract?

    Once you have this data, there is often a requirement for real-time visibility for the end user. What are the data modeling requirements to support both "high-volume ingestion" and "low-latency querying" for customer-facing billing dashboards?How do you bridge the gap between the raw event stream and the aggregated "billable unit" in the data warehouse or lakehouse?

    What are the most interesting, innovative, or unexpected ways that you have seen usage-based metering used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on building consumption-based data platforms?
    When is usage-based metering the wrong choice? (e.g., When does the complexity of the data platform outweigh the economic benefits?)
    What are your predictions for the future of consumption-based data architectures?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Links

    Hackernoon Post
    COGS == Cost of Good Sold
    Medallion Architecture

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

    22/03/2026 | 42 mins.
    Summary
    In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative.
    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduce

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you describe what your interest is in reproducibility of scientific research?
    What role does data play in the set of challenges that plague reproducibility of published research?
    What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?
    Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?
    How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?
    Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?
    What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?
    Are there any areas of study that you are more susceptible to friction and siloing of their data?
    What does a typical engagement with a research group look like as you try to improve the accessibility of their work?
    What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?
    What are the next set of challenges that you are focused on addressing in the research/reproducibility space?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Continuous Science Foundation
    Curvenote
    Zenodo
    Dryad
    HDF5
    Iceberg
    Zarr
    Myst Markdown
    Jupyter Notebook
    ArXiv
    Journal of Open Source Software (JOSS)
    Data Carpentry
    Software Carpentry
    Open Rxiv
    Bio Rxiv
    Med Rxiv
    Force 11
    JupyterBook
    Open Exchange Architecture (OXA)

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Beyond Prompts: Practical Paths to Self‑Improving AI

    16/03/2026 | 1h 1 mins.
    Summary
    In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments.

    Interview

    Introduction
    How did you get involved in AI/ML?
    Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent?

    How would you differentiate between an agent/agentic system vs. a self-improving system?
    One of the components that are becoming common in agentic architectures is a "memory" layer. What are some of the ways that contributes to a self-improvement feedback loop? In what ways are memory layers insufficient for a generalized self-improvement capability?

    For engineering and technology leaders, what are the key architectural and operational steps you recommend to build AI that can move from pilots into scalable, production systems?
    One of the perennial challenges for technology leaders is how to build AI systems that scale over time.
    How has AI changed the way you think about long-term advantage?
    How do self-improvement feedback loops contribute to AI scalability in real systems?
    What are some of the other key elements necessary to build a truly evolutionary AI system?
    What are the hidden costs of building these AI systems that teams should know before starting? I’m talking about enterprise who are deploying AI into their internal mission-critical workflows.
    What are the most interesting, innovative, or unexpected ways that you have seen self-improving AI systems implemented?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on evolutionary AI systems?
    What are some of the ways that you anticipate agentic architectures and frameworks evolving to be more capable of self-improvement?

    Contact Info

    LinkedIn

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Parting Question

    From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

    Links

    Symphony AI
    Reinforcement Learning
    Agentic Memory
    In-Context Learning
    Context Engineering
    Few-Shot Learning
    OpenClaw
    Deep Research Agent
    RAG == Retrieval Augmented Generation
    Agentic Search
    Google Gemma Models
    Ollama

    The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
  • Data Engineering Podcast

    Orion at Gravity: Trustworthy AI Analysts for the Enterprise

    08/03/2026 | 1h 5 mins.
    Summary
    In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analytics

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?
    How does the semantic layer relate to and differ from the physical schema of a data warehouse?
    In generative AI and agentic systems the latest term of art is "context engineering". How does a semantic layer factor into the context management for an agentic analyst?
    What are some of the ways that LLMs/agents can help to populate the semantic layer?
    What are the cases where you want to guard against hallucinations by keeping a human in the loop?
    Beyond a physical semantic layer, what are the other elements of context that you rely on for guiding the activities of your agents?
    What are some utilities that you have found helpful for bootstrapping the structural guidelines for an existing warehouse environment?
    What are the most interesting, innovative, or unexpected ways that you have seen Orion used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?
    When is Orion the wrong choice?
    What do you have planned for the future of Orion?

    Contact Info

    LucasLinkedIn

    DrewLinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Gravity
    Orion
    Looker
    Semantic Layer
    dbt
    LookML
    Tableau
    OpenClaw
    Pareto Distribution

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More Education podcasts

About Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Podcast website

Listen to Data Engineering Podcast, Tom's Talks and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

Data Engineering Podcast: Podcasts in Family

Social
v8.8.6| © 2007-2026 radio.de GmbH
Generated: 4/8/2026 - 11:56:29 AM