Explore23: Web application for exploration of a large genomic research cohort
Speaker

Teague is Director of Computational Biology Systems at 23andMe, making the breadth of the multimodal 23andMe research-consented cohort explorable and interpretable to researchers of different backgrounds. His team has used DuckDB and DuckDB-based tools extensively to process, summarize, and ultimately visualize the many billions of datapoints in the dataset.Previously he worked in genetics and informatics at BioMarin Pharmaceutical and was the lead architect and developer of decidedly not-so-small ZINC chemical database at UCSF. He has also begun authoring multiple DuckDB community extensions including duckdb_mcp (for AI integrations), webbed (for XML and HTML parsing), and duckdb_yaml (for YAML file compatibility).Despite a career in massive biological datasets, he maintains that the right small dataset beats a big one every time.
0:11I'm Teague Sterling. Um, and I had a a funny thing happen since when I signed up for this and uh today. How many of you have heard of 23 and me?
0:22How many of you have heard of the 23 andMe research institute? Okay, many. A lot less of you. probably because that's only existed for a couple of weeks now.
0:32Um, we went through a little bit of a change. Um, well, this will be
0:38we'll leave it at that, but there'll also be maybe a few hints along the way.
0:42Um, this will be a story of how we used some of the ideas of big data, huge data, and making it small to accomplish some of the challenges we faced. and maybe part of why we've changed to an organiza a different organization. Um, in the spirit of transparency, my colleague Larry Hangel, who is somewhere out here in the audience, um, has maybe
1:07talked to some of you, um, is really the one who's responsible for most of this work. I just get to take credit for it and talk about it up here, but, um, point all criticism at him.
1:21Uh, all right. So, we, uh, I've I've been working in by and large biomedical data sets for 15 or so years now and I've always been reminded of this mantra of making things small to solve the problem. That's not a surprise to anyone here. Um and so when I saw this big data or this small data conference uh I was
1:44really excited. I like I like small data. I have big data. I like oxymorons.
1:49And we did have a problem. So let's think small. Uh so I'll talk through a little bit about the problem we have. Uh we will talk about why we built this little app that I'll talk you through. Um a small aside about tiny parquet which I know Larry has heard me talk about adnauseium but you haven't so I'm sorry. And uh then a
2:13discussion about visualizing large genetic variants. So we built this tool. It's called explore 23. Um it's a it's will allow people researchers to understand what the 23 andme research institute has available in a very privacyforward perspective that is also easy to use and coincidentally powered by ductb pretty much everywhere.
2:38Um it might be a little surprising to hear about genetic data at a small data conference. Who has worked with genetic data in here?
2:50It's not really small. Um, we have 11 million customers and not only do we have genetic data, we also have a lot of phenotypic data um that is used for really exciting health research and validation.
3:05So, blah blah blah, we've probably all heard big data is hard. We have 11 million participants all consented or removing consent constantly. We have four billion data points just from our survey data alone. We cover almost two million genetic variants and we've done a lot of existing research on that.
3:25We've run thousands of analyses across all of those 172 million variants over and over and over again in all different populations. We have it's really exciting data especially if you are into G-W was results at all. Um but we had a problem. It's actually hard to find and understand that data and not really many people knew we even had it.
3:49So, we had to solve that problem without challenging our without raising any privacy concerns.
3:57But first, I want to thank everyone who has or continues to participate in the 23 andMe research cohort. Uh, it makes a really big difference and it continues to push really exciting health advances forward. spitting in a tube and then filling in a survey saying, "Yeah, I have asthma. No, I don't have cancer." Actually has had a huge impact when it's
4:20done over over 10 million people. And there are real advances in medicine that have come out of that.
4:28So, we built a little app to with to focus on big data. Larry built a big an app and it's not really big data. We have small data representing big data.
4:38That's probably not a surprise to many of you. We've all built those kinds of apps. We had survey answers. We have 150
4:45surveys that have been answered by millions of people. It's covering 10,000 questions. We have electronic health records that have been donated to us to use in this research. We have what we call cohorts, which is assembling those survey answers into different genetic traits or or excuse me, different traits that we can then analyze to have G-W was
5:06results where we can look at the the association of a trait to genetic variation and potentially find causes or related factors and then use that for for more research. We've done this a lot of times. We have just probably hundreds
5:23of just G-W was runs that cover thousands of results. All of that needed to be accessible. It's stored archived in databases somewhere in parquet files and flat files, but unless you literally knew where to find it, you wouldn't know that it's there. We also had a lot of genetic variance, which is really the meat of what we're going to talk about
5:44today. I mentioned we have 172 million
5:48variants in 11 million people in our research cohort. That's almost two quintilion genotypes, but we're not dealing with that. That's way too much data and we're not going to deal with it directly. We will come back to this on variance. So what did we do here? Well, I think you've probably all heard we just used duct DB and Postgress almost
6:09everywhere. Um but also we didn't we
6:13premputed a lot of it like all of it and also pulled from data that was already premputed.
6:21The whole stack is materialized data along the way. Again probably not a surprise to a lot of you here. Um anybody surprised to hear that?
6:32Okay cool. So we built an architecture using duct DB at pretty much every stage of the process and again this is done in a very privacy ccentric perspective. We need to make sure that the data as it's analyzed at an individual's level can never be identified back to that person and frankly we don't even want to include
6:55that data in the app as at all. Part of the reason why we materialize this into summaries makes things a lot easier. So in our research environment which is an enclave data can't leave without literally being manually reviewed by external parties or parti or internal within the organ in different parts of the organization. We generate a ton of
7:18summaries and I'll talk a little bit about one of my favorite techniques we've come up with. Then we assemble that into a duct DB database that we package alongside Larry's beautiful application code and it and deploy into uh into a single container.
7:35Conveniently, Larry's application code is a pretty small application. It's just a thin layer that gives easy access to a
7:46WASOM duct DB instance and populates it that populates itself from that API. Um, this works surprisingly well and lets us not think about the back end at all. Um,
7:59I promised you I was going to talk about tiny parquet. I can't help this. It's my favorite thing. Um, we have 11 million people. We have thousands and thousands of what we call phenotypic data points.
8:10It turns out maybe it was surprising to me that um, parquet compresses really, really, really well, especially the V2 parquet. So, we can pack a huge number of these data points, you know, millions of of rows into positionally aligned paret files with one column each and it takes up a couple of megabytes.
8:33And you can merge those together with the convenient positional join. And in a fraction of a second, you could summarize everyone who's reported having anemia or not having anemia or no answer across genetic ancestry, age, and genetically reported sex. Um, consuming only 16 megabytes uh down to like a bite and a half per participant, which was pretty cool to me. And this is
9:00really fast and really flexible. There's a whole other talk uh which I look forward to one day either collaborating or talking to somebody about on how this can be combined with duck pgq and graph databases to do some really creative exploration of um the phenotypic space of medical of self-reported medical data along with like say inheritance. All
9:23right. Why did we do that? It didn't take a lot of work to create this data set. It was really really fast. takes, you know, a fraction of a second for each phenotype. We have a few thousand of them. We have a nice big HPC cluster that's done pretty quick. Uh it bundles nicely into a Docker container. We don't
9:43have to think about running a backend. And it's literally less than 100 megabytes. It's client first. We care about making this in explorable and available to our collaborators, our internal and external scientists. We want to make it look pretty for them so that they can find what they want and find what they need, not focus on backend server. And we need something
10:05that, as I mentioned before, can be verified. A small data set is a lot easier to have somebody review than a big data set. And there's a lot less room for error, especially when you have a strict schema to verify against.
10:17As I mentioned, we have a lot of variance. Um, for those of you that have worked with genetic data, visualizing uh genetic variant can be a pretty tricky problem. We're looking at a position in the genome and we are looking at the uh the alil frequency of the minor alil which is like the less uh the less
10:39common variant um which changes but there can be a lot of these um and it's a challenge when researchers really care about maybe one or a small number of these variants and know a lot about them and don't care about the others for the most part and what they care about changes based on the question that they're asking. It's a kind of a
11:00challenging problem for small data because you need to have everything available to answer whatever questions being
11:09expect.
11:17We test for half a million to a million of these and then imputee up to 170 million more or more more of them. um the frequencies and impact depend on ancestry and location and researchers will care a lot about a specific gene.
11:36So uh because we also have an AI component to this that whole slide I decided to ask Claude uh to turn into something that you could look at that's a little bit easier. So we go from genome to genes, some of which may overlap and then into parquet files. Um, which really is the data model that we're talking about for this variant
11:57viewer. And pretty conveniently because it's just files, we can store them in parquet. We can populate those onto and it's less than a 100 gigabytes. we can populate that onto a a an an EBS volume that can be shared and then we can stream that right into the browser through an arrow um an arrow stream and it works pretty
12:25darn well. Um m making it possible to do all the filtering in the client for a given gene. Um, part of that required some exciting and sometimes frustrating work with uh getting Mosaic and Spelt, which was the UI framework that we decided to use, uh, to play nice together. So,
12:49this is Larry's work. Uh, it took a lot
12:53of back and forth to get sort of the the data readiness right, but it works pretty well when we have updates to different um, UI elements. then coming back to Mosaic as the coordinator to then update different plots. Um, again, you can please send all exciting questions to Larry who can raise his hand.
13:14I'm sorry, Larry. And then we can take a look at what actually comes out here at a large scale. Here we have uh a gene if you've, you know, PCSK9, a popular gene among people who like to demonstrate genes.
13:29Um, we are displaying it across the genome. Here we have uh the green box representing the gene boundary. The purple sections are different exxons that show up. And then we show all variants within 500,000 positions of the gene boundaries there.
13:45So this in our data set covers 64,000 almost 65,000 variants, but you don't want to look at 65,000 dots. So we've used Wasom and Mosaic and spelt to create this toolkit to quickly render what's really of interest here.
14:03Uh also we have nice faceting on the side because sometimes you might care if say you a variant caused a a uh a change
14:12or a particular type of loss of function or not. You can filter down to just those.
14:18And then we have uh as we zoom out the genome is really large. Drawing more and more dots becomes problematic. So I don't think you want to do that. Um so we've turned it into a heat map which is a nice toggle that really kind of highlights how cool the uh connection between mosaic and spelt and the
14:38database can all play nice to render these very very quickly. Um the slowest
14:45part of this was drawing dots. It can be just transmitting data over o over the the wire which is pretty I guess pretty cool when you only need to get the small amount of data or pull in more starts to more transit time. Um a small fun aside again I told you I'm obsessed with compressed parquet. I'm sorry. I said we
15:06have 172 million variants. Um, that packs down into 96 gigabytes, which I think is also pretty incredible when you think that that's 16 bytes per variant.
15:16And all this nonsense that I hope you don't read is what we are including for every variant. So, that's a lot of data to be packing into 16 bytes. I'm not quite sure how it's working, but it's amazingly efficient. Um, and it makes for transmitting that data to the client really easy.
15:32All right. So, um, this was a lot of,
15:37uh, fun for us. Part of the reason we decided to use DUTDB is well, we like it. Um, but we learned some either painful or convenient things along the way. Um, you can probably imagine we didn't start with the don't have a backend, just stream data. We tried a backend. It was painful. It didn't work. It caused
15:57problems. Um, so my advice, if you can
16:01just premp compute it and send it to the client, um, I hope that I've gotten across the point that at least I think, you know, partitioned and compressed parquet is pretty incredible as a file format and that you can send it over arrow right into Wasom and have even more fun. Um,
16:21nobody really wants to look at 500,000 elements. So, if you're going to make your data small, when you're going to render it, make it even smaller. Use your facets and filters. Um, because the slowest part of this whole thing when we first started was simply drawing 50,000 dots or even a thousand dots on the screen. It was not pleasant. Um, and
16:41Mosaic worked really really well. Um, it we had to do some magic around getting the reactivity model and for and spelt to play nice. Um, but it made filtering and working together. There was some boiler plate to write. because there's probably some exciting potential for open- source um enhancements or contributions there. Um this isn't there are those who have worked with genomics
17:05there are other tools out there that are faster that hold more data maybe 40% more than ours um and load faster but ours runs with no with almost no backend it's a single small instance uh there's no graphql there's no elastic search there's nothing you need to deploy aside from just a single container um it runs you can run it off a flash drive it's
17:28100 gigabytes of data plus the application code um for the variants which is you know it's locally usable like we develop it locally and you can it's interactive you can change you can explore and expand all the variants along the way uh and look at their alo frequencies across all different populations which is one of our really
17:47exciting features uh so if I'm to leave you with uh something that I have had to learn over the years over and over again is if you have big data try to make it as small as possible as soon as possible. And then what I learned recently is if you're trying to put it in a UI, make it even
18:06smaller because you will uh crash your browser until you accept that fate. And also, nobody wants to look at a billion rows or a million rows or even a thousand rows. That's just not useful.
18:17It's not what people, especially scientists, care about. Thank you.


