Gilad Lotan
Gaurav Saxena
Celina Wong
James Winegar
Josh Wills
Junran Yang
Søren Bramer Schmidt
Stay tuned for announcements on additional speakers,
workshop registration, and the full agenda!
Benn Stancil
Hands-On Workshops
@ Convene, 4th Floor
12:00 PM
Grab some lunch and settle in for some action-packed workshops!
1:00 PM
SESSION #1
Register for your
top choice.
In this workshop, we will dive deep into the art of building simple yet powerful data warehouses using MotherDuck & dbt. By the end, you'll be equipped to quack the code of efficient data warehousing without the overhead and turning simplicity into your greatest strength.
Want to get started using video in your AI apps? We'll build an app that has an AI narrate a video or scene of your choice. You'll get examples of how to do video processing, frames extraction, and sending frames to AI models optimally, and leave with a deployed app that costs $0 to run. We'll talk about how to keep costs low as you scale.
Go through the process of preparing a dataset, uploading it to Fireworks.ai through the CLI, fine-tune a model (e.g., Llama 3.1) for conversation, and finally, deploy and publish a model. Participants will learn the process and best practices of fine-tuning an LLM for function/tool use. We'll also walk through the role of inference engines like Fireworks.ai in compound systems and agentic workflows.
Quarto is an innovative, open-source publishing system from Posit that transforms Jupyter Notebooks and plain markdown into polished, professional outputs. This workshop will show how Quarto helps data scientists create high-quality, shareable dashboards with static and interactive features while following an accessible and reproducible workflow.
4:00 PM
4:30 PM
SESSION #2
Register for your
top choice.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required.
By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
In this hands-on workshop, participants will learn how to implement per-user and per-tenant database applications using Turso. You'll build a sample application inspired by popular platforms like Reddit and Mastodon, demonstrating crucial concepts such as:
Participants will gain practical experience with Turso's APIs and SDKs, learning how to:
By the end of this workshop, you'll have the skills to architect scalable multi-tenant applications using SQLite, and Turso.
Prerequisites
Users should have some familiarity with SQL/SQLite, and APIs. Examples shown during the workshop will be using TypeScript and React, but it's possible to use any language since we'll be using the API/CLI.
In this workshop, you’ll learn how to build an end-to-end data app, including connecting to a public data source, interacting with a DuckDB-powered SQL engine, building polished visualizations, and deploying your app online.
Requirements:
Learning Objectives & Outcomes:
In this workshop, we will explore the creation of Retrieval-Augmented Generation (RAG) data pipelines from the ground up using Dagster for orchestration and MotherDuck for efficient, scalable data processing. Attendees will learn how to integrate these tools to design data pipelines that enhance the performance and accuracy of generative AI models. We will cover key concepts, including data ingestion, pipeline management, and optimization techniques to streamline the deployment of RAG-based systems.
7:30 PM
Join us for some beverages and lite bites to close out Day 1 of Small Data SF.
Workshop registration is now OPEN!
Technical Talks and Sessions @ Convene, 5th Floor
8:00 AM
Check-in, grab breakfast, have your questions answered by technical experts.
It's going to be an awesome day.
9:00 AM
Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: simplifying our work. Some of us may have Big Data, but our quest to derive insights from it is oeasured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.
Jordan Tigani, MotherDuck
9:20 AM
In this talk, I will offer my perspective on the modern data tools landscape and in particular user-facing tools for interactive data science and data exploration.
The latest trends of composable data systems and embeddable query engines like DuckDB and DataFusion create both challenges and opportunities to create a more coherent and productive stack of tools for end user data scientists and developers building data systems. You'll come away with tactical tips and leveraged learnings from my own experience to carry forward in your work!
Wes McKinney, Posit PBC
9:40 AM
Directed Acyclic Graphs (DAGs) are the foundation of most orchestration frameworks. But what happens when you allow an LLM to act as the router? Acyclic graphs now become cyclic, which means you have to design for the challenges resulting from all this extra power. We'll cover the ins and outs of agentic applications and how to best use them in your work as a data practitioner or developer building today.
Julia Schottenstein, Langchain
10:00 AM
10:30 AM
Is big data a thing? Or can it be seen as a collection of small data? If we were more intentional about what that collection entails, could we build bigger and faster?
In this talk we will explore how massive multitenancy with SQLite - the king of small databases - can be used to deliver fast and memorable OLTP experiences for your API.
Glauber Costa, Turso
10:50 AM
It's finally possible to bring the awesome power of Large Language Models (LLMs) to your laptop. This talk will explore how to run and leverage small, openly available LLMs to power common tasks involving data, including selecting the right models, practical use cases for running small models, and best practices for deploying small models effectively alongside databases.
Jeff Morgan, Ollama
11:10 AM
As a head of data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.
Lindsay Murphy, Women Lead Data
11:30 AM
How we thought we had Big Data and we built everything planning for Big Data but then it turns out we didn't have Big Data and while that's nice and fun and seems more chill, it's actually ruining everything and I am here at this conference asking you to please help us figure out what we are supposed to do now.
12:00 PM
lunch
1:00 PM
Data warehouse benchmarks, just like database systems, need to become more holistic and stop focusing solely on query engine performance in favor of customer-centric indicators of usability and performance. Database research and development is heavily influenced by benchmarks, such as the industry-standard TPC-H and TPC-DS for analytical systems. However, these twenty-year-old benchmarks neither capture how databases are deployed nor what workloads modern cloud data warehouse systems face today.This talk will summarize well-known, confirm suspected, and unearth novel discrepancies between TPC-H/DS and actual work-loads using empirical data and telemetrics from Amazon Redshift.
Gaurav Saxena, AWS
1:20 PM
The Small Data movement is being driven in part by long running cyclical trends in edge computing. Richard’s talk will explore this history and connect it to his work at Tableau and DuckDB Labs.
Richard Wesley, DuckDB Labs
1:40 PM
Spreadsheet and Python lovers, rejoice! PySheets is an open-source, browser-contained technology that enables you to run Excel in Python. It installs from PyPi and runs data science workflows in the browser using PyScript and WebAssembly (Wasm). Sheets are stored in the browser using IndexedDB, while code completions for Pandas analysis and Matplotlib visualizations are generated using Ollama. The result? A local-first data science environment without kernels, cloud storage, or remote AIs where all data stays entirely local on users' machines; unlike Jupyter notebooks, your workflow dependency graph can finally be visualized as a familiar spreadsheet.
Chris Laffra, PySheets
2:00 PM
GEMMA 2: ENABLING NEXT-GEN CONVERSATIONAL AI ON SMALLER DEVICES
Imagine building powerful AI applications that run seamlessly on your laptop or even your phone. With Gemma 2, that vision becomes a reality. This session explores the Gemma 2 family of lightweight, high-performance open models designed to unlock new possibilities on smaller devices. With models ranging from 2B to 27B parameters, Gemma 2 delivers performance comparable to much larger models, while requiring significantly fewer resources.
Kathleen Kenealy, Google Gemma
2:30 PM
By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to join us at Small Data SF to find out.
Gilad Lotan, Buzzfeed
3:00 PM
break
3:20 PM
Modern applications are often huge, complex engineering projects - but they don't have to be. Applications increasingly need to work with data on local devices to support real-time collaboration and offline support. 2025 will be the year of Local-First development, and we’ll demonstrate how new ways to deploy infrastructure can help us make this a reality.
Søren Bramer Schmidt, Prisma
3:40 PM
Data visualization should allow users to quickly understand their data and intuitively communicate key insights that inform decision-making. However, the process is still not as straightforward, performant, or intuitive as it should be.
With increased data volumes and usage in industry and science, visualization toolkits must support scalable computation for interactive exploration and adapt to audiences with diverse design and technical backgrounds. This talk will discuss common scalability and usability challenges related to visualization toolkits and propose enhancements to make them easier to use, more flexible, and ultimately more beneficial for systems built on top of them.
Junran Yang, University of Washington
4:00 PM
Businesses run on data, from Big Tech to Consumer, the Enterprise, and the Public Sector. But is data volume really the key to success? Moderated by Ravit Jain, our panelists will share insights and stories from the trenches as users, leaders, and data people building with analytics and AI to deliver meaningful business value.
Panelists:
Moderated by:
4:30 PM
closing reception & happy hour
Join us for a closing happy hour to mingle with the small data and AI community!
12:00 PM
welcome lunch
Grab some lunch and settle in for some action packed workshops!
Grab some lunch and settle in for some action packed workshops!
8:00 AM
breakfast & registration
Check-in, grab breakfast, have your questions answered by technical experts.
It's going to be an awesome day.
Check-in, grab breakfast, have your questions answered by technical experts.
It's going to be an awesome day.
1:00 PM
SESSION #1
Register for your top choice.
9:00 AM
BIG IS NOT A NUMBER: DISPELLING THE MYTHS OF BIG DATA
Over the last decade, Big Data was everywhere. Let's set the record straight on what is and isn't Big Data. We have been consumed by a conversation about data volumes when we should focus more on the immediate task at hand: simplifying our work. Some of us may have Big Data, but our quest to derive insights from it is measured in small slices of work that fit on your laptop or in your hand. Easy data is here— let's make the most of it.
Jordan Tigani, MotherDuck
EFFICIENT DATA WAREHOUSING with DuckDB, MotherDuck & dbt
In this workshop, we will dive deep into the art of building simple yet powerful data warehouses using DuckDB, MotherDuck & dbt. By the end, you'll be equipped to quack the code of efficient data warehousing without the overhead and turn simplicity into your greatest strength!
In this workshop, we will dive deep into the art of building simple yet powerful data warehouses using MotherDuck & dbt. By the end, you'll be equipped to quack the code of efficient data warehousing without the overhead and turning simplicity into your greatest strength.
9:20 AM
RETOOLING FOR A SMALLER DATA ERA
In this talk, I will offer my perspective on the modern data tools landscape and in particular user-facing tools for interactive data science and data exploration.
The latest trends of composable data systems and embeddable query engines like DuckDB and DataFusion create both challenges and opportunities to create a more coherent and productive stack of tools for end user data scientists and developers building data systems. You'll come away with tactical tips and leveraged learnings from my own experience to carry forward in your work!
Wes McKinney, Posit PBC
LET YOUR AI MODEL WATCH VIDEOS AND REACT TO THEM
with Tigris, Ollama, and Fly.io
Want to get started using video in your AI apps? We'll build an app that has an AI narrate a video or scene of your choice. You'll get examples of how to do video processing, frames extraction, and sending frames to AI models optimally, and leave with a deployed app that costs $0 to run. We'll talk about how to keep costs low as you scale.
In this workshop, we will dive deep into the art of building simple yet powerful data warehouses using MotherDuck & dbt. By the end, you'll be equipped to quack the code of efficient data warehousing without the overhead and turning simplicity into your greatest strength.
MAKING LLMS GO VROOM: FINE-TUNE AND DEPLOY A FUNCTION-CALLING CAPABLE LLM with Fireworks.ai
Go through the process of preparing a dataset, uploading it to Fireworks.ai through the CLI, fine-tune a model (e.g., Llama 3.1) for conversation, and finally, deploy and publish a model. Participants will learn the process and best practices of fine-tuning an LLM for function/tool use. We'll also walk through the role of inference engines like Fireworks.ai in compound systems and agentic workflows.
9:40 AM
10:00 AM
AN EVOLVING DAG FOR THE LLM WORLD
Directed Acyclic Graphs (DAGs) are the foundation of most orchestration frameworks. But what happens when you allow an LLM to act as the router? Acyclic graphs now become cyclic, which means you have to design for the challenges resulting from all this extra power. We'll cover the ins and outs of agentic applications and how to best use them in your work as a data practitioner or developer building today.
Julia Schottenstein, Langchain
Refill your coffee at the espresso bar, get your questions answered by technical experts, and grab some sweet threads in the Swag Shop. What additional surprises await you? There's only one way to find out.
FROM NOTEBOOKS TO DASHBOARDS with Quarto
Quarto is an innovative, open-source publishing system from Posit that transforms Jupyter Notebooks and plain markdown into polished, professional outputs. This workshop will show how Quarto helps data scientists create high-quality, shareable dashboards with static and interactive features while following an accessible and reproducible workflow.
In this workshop, we will dive deep into the art of building simple yet powerful data warehouses using MotherDuck & dbt. By the end, you'll be equipped to quack the code of efficient data warehousing without the overhead and turning simplicity into your greatest strength.
10:30 AM
GIVE EVERY USER THEIR OWN DATABASE! UNLEASHING THE UPTAPPED POWER OF SMALL DATA
Is big data a thing? Or can it be seen as a collection of small data? If we were more intentional about what that collection entails, could we build bigger and faster?
In this talk we will explore how massive multitenancy with SQLite - the king of small databases - can be used to deliver fast and memorable OLTP experiences for your API.
Glauber Costa, Turso
4:00 PM
10:50 AM
BUILD BIGGER WITH SMALL AI: RUNNING SMALL MODELS LOCALLY
It's finally possible to bring the awesome power of Large Language Models (LLMs) to your laptop. This talk will explore how to run and leverage small, openly available LLMs to power common tasks involving data, including selecting the right models, practical use cases for running small models, and best practices for deploying small models effectively alongside databases.
Jeff Morgan, Ollama
4:30 PM
SESSION #2
Register for your top choice.
11:10 AM
SQUEEZING MAXIMUM ROI OUT OF SMALL DATA
As a head of data or a one-person data team, keeping the lights on for the business while running all things data-related as efficiently as possible is no small feat. This talk will focus on tactics and strategies to manage within and around constraints, including monetary costs, time and resources, and data volumes.
Lindsay Murphy, Women Lead Data
GENERATIVE BACKENDS: AI-POWERED APPLICATION DEVELOPMENT with Outerbase
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
11:30 AM
BI'S BIG LIE
How we thought we had Big Data and we built everything planning for Big Data but then it turns out we didn't have Big Data and while that's nice and fun and seems more chill, it's actually ruining everything and I am here at this conference asking you to please help us figure out what we are supposed to do now.
SCALING SQLITE: BUILDING PER-USER AND PER-TENANT APPLICATIONS with Turso
In this hands-on workshop, participants will learn how to implement per-user and per-tenant database applications using Turso. You'll build a sample application inspired by popular platforms like Reddit and Mastodon, demonstrating crucial concepts such as:
Participants will gain practical experience with Turso's APIs and SDKs, learning how to:
By the end of this workshop, you'll have the skills to architect scalable multi-tenant applications using SQLite, and Turso.
Users should have some familiarity with SQL/SQLite, and APIs. Examples shown during the workshop will be using TypeScript and React, but it's possible to use any language since we'll be using the API/CLI.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
12:00 PM
lunch
INSANELY INTERACTIVE DATA APPS with Evidence & DuckDB
In this workshop, you’ll learn how to build an end-to-end data app, including connecting to a public data source, interacting with a DuckDB-powered SQL engine, building polished visualizations, and deploying your app online.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
1:00 PM
KNOW THY CUSTOMER: WHY TPC IS NOT ENOUGH
Data warehouse benchmarks, just like database systems, need to become more holistic and stop focusing solely on query engine performance in favor of customer-centric indicators of usability and performance. Database research and development is heavily influenced by benchmarks, such as the industry-standard TPC-H and TPC-DS for analytical systems. However, these twenty-year-old benchmarks neither capture how databases are deployed nor what workloads modern cloud data warehouse systems face today.This talk will summarize well-known, confirm suspected, and unearth novel discrepancies between TPC-H/DS and actual work-loads using empirical data and telemetrics from Amazon Redshift.
Gaurav Saxena, AWS
BUILDING RAG DATA PIPELINES FROM SCRATCH with Dagster and MotherDuck
In this workshop, we will explore the creation of Retrieval-Augmented Generation (RAG) data pipelines from the ground up using Dagster for orchestration and MotherDuck for efficient, scalable data processing. Attendees will learn how to integrate these tools to design data pipelines that enhance the performance and accuracy of generative AI models. We will cover key concepts, including data ingestion, pipeline management, and optimization techniques to streamline the deployment of RAG-based systems.
Ever wondered what it takes to create a complete backend? In this workshop, you will use AI to generate entire backend applications from simple prompts. Together, we will create database schemas, populate them with realistic data, and build working API endpoints— without writing a single line of code. All you need is a laptop with a web browser; no programming experience required. By the end, you'll have a fully functioning backend ready to power your next project, plus a new perspective on how AI is reshaping the development landscape.
1:20 PM
PADDLING IN CIRCLES: THE RETURN OF EDGE COMPUTING
The Small Data movement is being driven in part by long running cyclical trends in edge computing.
Richard’s talk will explore this history and connect it to his work at Tableau and DuckDB Labs.
Richard Wesley, DuckDB Labs
7:30 PM
evening reception
Join us for some beverages and lite bites to close out Day 1 of Small Data SF.
1:40 PM
PYSHEETS: THE SPREADSHEET UI FOR PYTHON
Spreadsheet and Python lovers, rejoice! PySheets is an open-source, browser-contained technology that enables you to run Excel in Python. It installs from PyPi and runs data science workflows in the browser using PyScript and WebAssembly (Wasm). Sheets are stored in the browser using IndexedDB, while code completions for Pandas analysis and Matplotlib visualizations are generated using Ollama. The result? A local-first data science environment without kernels, cloud storage, or remote AIs where all data stays entirely local on users' machines; unlike Jupyter notebooks, your workflow dependency graph can finally be visualized as a familiar spreadsheet.
Chris Laffra, PySheets
Workshop registration is now OPEN!
2:00 PM
GEMMA 2: ENABLING NEXT-GEN CONVERSATIONAL AI ON SMALLER DEVICES
Imagine building powerful AI applications that run seamlessly on your laptop or even your phone. With Gemma 2, that vision becomes a reality. This session explores the Gemma 2 family of lightweight, high-performance open models designed to unlock new possibilities on smaller devices. With models ranging from 2B to 27B parameters, Gemma 2 delivers performance comparable to much larger models, while requiring significantly fewer resources.
Kathleen Kenealy, Google Gemma
2:30 PM
WHERE DATA SCIENCE MEETS SHREK: HOW BUZZFEED USES AI TO CAPTIVATE AUDIENCES WORLDWIDE
By introducing a range of AI-enhanced products that amplify creativity and interactivity across our platforms, Buzzfeed has been able to connect with the largest global audience of young people online to cement its role as the defining digital media company of the AI era. Notably, some of Buzzfeed's most successful tools and content experiences thrive on the power of small, focused datasets. Still wondering how Shrek fits into the picture? You'll have to join us at Small Data SF to find out.
Gilad Lotan, Buzzfeed
3:00 PM
break
3:20 PM
BUILDING LARGE APPS WITH TINY DATABASES
Modern applications are often huge, complex engineering projects - but they don't have to be. Applications increasingly need to work with data on local devices to support real-time collaboration and offline support. 2025 will be the year of Local-First development, and we’ll demonstrate how new ways to deploy infrastructure can help us make this a reality.
Søren Bramer Schmidt, Prisma
3:40 PM
ENHANCING THE SCALABILITY AND USABILITY OF VISUALIZATION TOOLKITS
Data visualization should allow users to quickly understand their data and intuitively communicate key insights that inform decision-making. However, the process is still not as straightforward, performant, or intuitive as it should be.
With increased data volumes and usage in industry and science, visualization toolkits must support scalable computation for interactive exploration and adapt to audiences with diverse design and technical backgrounds.
This talk will discuss common scalability and usability challenges related to visualization toolkits and propose enhancements to make them easier to use, more flexible, and ultimately more beneficial for systems built on top of them.
Junran Yang, University of Washington
4:00 PM
DATA MINIMALISM: DELIVERING BUSINESS VALUE FOR THE 99%
Businesses run on data, from Big Tech to Consumer, the Enterprise, and the Public Sector. But is data volume really the key to success? Moderated by Ravit Jain, our panelists will share insights and stories from the trenches as users, leaders, and data people building with analytics and AI to deliver meaningful business value.
Panelists: Jake Thomas, Okta; Celina Wong, Data Culture; James Winegar, CorrDyn; Josh Wills, DatologyAI
Moderated by Ravit Jain, The Ravit Show
4:30 PM
closing reception & happy hour
Join us for a closing happy hour to mingle with the small data and AI community!
Need help convincing your boss to let you attend?
Need help convincing
your boss to let
you attend?
It's time for 'Boss Mode' -
we've got you covered.