From 1 parameter model to 100 billion parameter models. How progress on one benefits all
Speaker

Ravin has been applying generative model to the real world problems since before it was trendy. He focuses on large language models (LLMs), trust and safety, and the responsible deployment of AI systems. With over a decade of experience applying mathematics, statistics, and computation to real-world challenges, his work spans diverse domains including aerospace, food systems, and open source.Ravin is a core contributor to PyMC, and is deeply committed to making complex math accessible and practical.
0:00[music]
0:05[music] Hello folks. Uh I'm Ravine. I am a senior researcher at Google Deep Mind and I'm going to tell you how small models and big models share the same DNA.
0:17And so in this talk I'll just briefly go over what I'm going to go uh through.
0:21I'm going to talk about how we started with models, how it's going right now, and then at least my predictions of what's to come.
0:29And so when putting this talk together, I really was just like, what should I cover? The theme seemed pretty fun, but I didn't but the uh the talk title said engineers or the conference was focused on engineers. So I just went with a technical tutorial, but I figure I'll try and also give a history lecture at the same time and at least a
0:48motivational speech of where things are going to go. So we're going to jam that all in.
0:53And so who am I? Um in my small data era, I was a Beijing data scientist. I worked at uh SpaceX and Sweet Green um and I wrote this book which is free online and it covers a lot of the stuff that I did when I was at those companies. I'm now in my massive data era. I work at Google De Mind as I said
1:11and I work on the Gemini and Gemma series. And so sometimes I know it can feel like this. Um AI and LM are just taking up so much space in the conversation. They're taking up all the space on your GPUs.
1:25Um, and small data can feel like it's this little thing to the side and uh, we're really just talking about massive massive models.
1:33Uh, but the takeaways I want you to take from this particular talk is that small models and big models, they all share the same fundamentals. There really isn't this tension that big and small aren't mutually exclusive. There's plenty of space for both. And I really do believe it is the best time for small models, better than it's ever, ever
1:51been. And so I was going to try and live code but I couldn't use my laptop here. So all the code and everything is these links. You can also just use the tiny URL. And so um there's nothing to hide and it all runs in one collab uh in the spirit of small models.
2:07So let's talk about where it started. There was an era before the current big AI era and it was maybe 1957. This is the Dartmouth sort of famous uh uh the famous summer where a bunch of folks said they're going to solve AI. Uh, clearly they haven't yet and we're still waiting. But I'm going to go back even
2:26further to the 1700s with Thomas Bass. Said I was Beijian and I wasn't kidding.
2:32This is B formula. I write it down. This is the uh the basis for everything that I believe in. Um, and what we're going to do is we're going to build a language model together. This is a small data set. There's five or so data points here. It's a language model because we're going to be working on language.
2:47And if I create a prompt that is welcome to the question is what is it going to predict? It's 200 years ago. We don't have uh computers. So we're going to use pen and paper. And so this is our corpus again. You have you know three things that say welcome to small data SF. And then a couple that say hello from. We
3:06can just list out the probabilities. This one is 2/3. This one's 1/3. This one's 1/2. And this one's 1/2. Wasn't didn't take too bad. But it is a language model. You have these prompts.
3:16you have completions. It came from base formula. There's nothing crazy or magical or even new about this. It came from an idea that's 300 years old at this point.
3:25And so it doesn't take like modern TPUs to do all this stuff. You could build if you had a lot of pen and paper and a lot of pencils a large language model using very very old ideas.
3:39And so the fundamentals are the same. And so one thing I struggled with on this talk is I could come and talk about some technology that comes up today. Um but I think the better takeaway is that if you learn the fundamentals which I think I heard in the panel as well you're going to be good for the rest of
3:53even this this large data era. It's not like everything is being reinvented. The first principles are really stick around. So now let's talk about how it's going.
4:03I'm going to talk about say 2012 to about now. If you remember the two 2010s which feels like forever ago almost 300 years ago if you ask me uh there was this job title the data scientists are the sexiest you know job of the 21st century there was a lot of hype about data science style stuff there also
4:18though was this AI neural network track as well uh this is Alex net which also came out in 2012 this was one of the first models in this like age of modern um uh large larger neural networks it happened to win an image competition and kicked off um this sort of u dense model boom.
4:38And so we had different tools. Um there was these DS tools, Scikitlearn, XJ Boost, PMC, things like that. And you also had these the stack of AI and neural network tools like PyTorch or TensorFlow that were used to build dense uh neural networks.
4:53And so for me, I had my track at the time was build rockets at SpaceX. And the tools that I used were whatever gets the job done. It didn't really matter whether it was a big tool or small tool.
5:04Um, at the time these these are rockets that are actually built at SpaceX. These are first stages from a Falcon 9 rocket uh in Hawthorne in Los Angeles. Uh, this was my data set. Um, so what's funny is I talked to a lot of folks here and they're telling me like I have small data. It's only like 60 terabytes. I'm
5:20like guys, there was a point where this literally was all the data that I had.
5:23Like your data is massive. The question though which always came up is how long does it take to build a rocket? because I had a famously impatient boss who has really strict deadlines and has a lot of questions about things getting done faster and this is how long it took to build these rockets. One of them took 5 days, one of
5:40them took 10 days, one of them took eight, one got delayed, took 20, 43, 23.
5:45Like I needed to make sense of this so we can make big business decisions. And the tool that came up for me was PMC.
5:49It's a Beijian probabistic programming um language uh and software and it was it was really applicable to my use case.
5:56So I built a model like this. This is a small data model built on literally again six data points. And what this model helped me do is take those five data points and come up with this which is a posterior predictive distribution of lead time. And so what this helps me do is say like what are the
6:11probabilities that the rocket's going to take 10 days or 20 days or 60 days or 100 days? And we can make business decisions. Is it worth paying $20,000 to speed up a part? Is it worth investing $20 million to build uh to get a new machine? or are we going to spend $100 million and get a new factory? Because
6:29uh again, my particular boss at the time was very impatient and uh would want things to go quicker and quicker and quicker. And one of those ways was building more factories um or new tooling or things like that. But we couldn't just do everything. We had to make decisions and ideally decisions using data.
6:46And so what we're going to do is we're going to take our small model, our small language model this time and we're going to rebuild um this thing here with the 2/3s, 1/3s, and 1/2 So here's what it looks like in PMC again. Um you have uh our two parameter model a dear slay distribution prior on a categorical um likelihood and you get
7:06the same answer pretty much. You get 50% for the top two you get 1/3 uh and 2/3 at the bottom. You get some more uncertainty because we get to do sampling and life's uh life's amazing.
7:17The great news and this is part of it as well is that the tools that were used to build the big language models theo and and Jax and TensorFlow and PyTorch those tools helped build PMC. PMC wasn't their first customer that wasn't why those tools were originally built like TensorFlow was built for uh neural networks but because they these tools
7:36existed because they were invested in we can reapply them to the small data domain and this is the same thing I'm doing now at Google. We have Gemini. It's a large model and nobody would debate that, but we're taking that and we're packing it into Gemma, which is our smaller series of models. It's the same story again
7:52that a lot of innovation and investment happens in the big space. Um, but the benefits then flow out to everyone even at the uh at the smaller space.
8:01And so, Gemma 3 uh I also see a lot of things about single node, small data, single node. We really target Gemma to be the best model that can run on a single node, H100 um or below. Um it's a competitive open ecosystem but with 24 million downloads we're pretty positive that somebody likes our models. Um and
8:18we also even build smaller variants. So we have um 4B 1B uh those run on my Pixel device and 270M also run on our Pixel device. So single node could mean like single Android phone that's 3 years old. Um again it takes the big stuff from Gemini and we really cram it in to a smaller parameter model.
8:37And so you can download these now. I've seen the Olama logo in a lot of places.
8:40is I think some of folks have the swag and you see the Lama logo. Super great place to run models. Uh that's their cute logo and if you want to grab the Gemma 3 models, uh that's where the URL is. So small models are awesome because big models became awesome and we're we're really again distilling those big
8:57models into smaller models here. And I haven't talked to Shelby. I haven't met her yet, but she has a talk um which I'm glad I didn't give because she would have we would have had the same talk which is how you can take uh small models and I believe although don't quote me on this her talk is going to be
9:12a lot about how small models get built uh and the techniques you can use as uh to make them have a bigger impact. So I think it's like about an hour. I would highly recommend this one. And so the takeaway here I want to stress again is big model innovations um benefit small models. So there isn't this dichotomy.
9:28It's not like big models and small models need to be intentioned like I hope um I cheer on big models now because I know a lot of that knowledge and stuff like that will flow into uh the smaller space. And so now let's talk about what's to come or at least what I predict is to uh is to come.
9:45One is I really do believe small models will keep doing amazing things. Um even at Google, you know, we cheer on a lot all the innovations we get on the Gemini scale, but we're also seeing many innovations at the smaller scale. I think just two weeks ago, a Gemma 27B model was heavily fine-tuned to be really good at predicting um I think
10:03signals for early cancer. This is the cell to sentence model. I didn't personally work on this model, but I thought it was really cool that you could take a language model um and really shift it into a different domain and uh produce original results.
10:16Uh and cancer of course is something that um you know we all want to get better as as a society at understanding and addressing and things like that.
10:25I also see a world where small models and big models work together. Um so this was outside of Google but Beijian models are used very heavily in you know cancer sort of situations because you don't have big data sets. I think for a lot of reasons we're trying to make the data set of cancer as small as possible. We
10:40don't want more data points in the area. So you need small models and particularly interpretable models in this domain to um to treat people to understand the outcomes. And so a bigger model like Gemma Gemma 27B can be used to predict um how we can treat cancer and then smaller Beijian models can be used to assess whether that prediction
10:59um and those mechanisms worked or not. And the other one is big models accelerate small data. So this is the same model that we had earlier. In my SpaceX days, I think this would have taken me about 45 minutes to an hour to write. I'd have to write every single one of these characters by hand. I'd have to debug some colon and some uh
11:18some shape error um things like that. Uh but these days with Gemini, I can almost one or two shot these things and instead of focusing on these little minutiae of making sure that I didn't use a double parenthesis and a single parenthesis, which has burned me a ton of times. Um I can prompt Gemini and focus on the
11:34important part of the problem uh rather than all the minutia of the code and things like that. So um I also heard this in the panel as well uh that the um a number of these CEOs or the folks that were here were were thinking about how these big models can help uh in small data situations.
11:51And the last thing I want to bring up is just the size of the models themselves.
11:55This talk and um in this conference was on uh small data and I was like what is what is small data like what size is small? How do we even uh bring that up?
12:07And so at the time Alex net which we saw at the very beginning from uh 2012 uh Alex net was 60 million parameters and this was considered large scale. um one of the authors on this paper had to write um GPU code by hand across I think two t two GPUs that were 6 gigabytes and it was considered like a massive um then
12:27we get to GPT1. It was like forever ago, but 2017. This was an early large language model, but here we are. And I just released this model two months ago, Gemma 27M. And we called it a tiny model, yet it is three times bigger than the model from a decade ago. And it is bigger twice as big as the large
12:43language model that kicked off uh this revolution. Like this idea is today's big becomes tomorrow's small. Um and this idea of what small data is, like I talked about with my SpaceX talk, uh keeps keeps shifting. And so this was one of the template slides we got. It said it's time to think small. But to me, it's always been time to think
13:03small. It's really more that small is better uh better than ever with better tools. Small being is small from today being much bigger than the small from yesterday and with all the ideas and ecosystem and um and attention that data is getting just in in general. And so the other part in the motivational part was uh I came into this conference and
13:24every like two seconds I kept seeing something or hearing something from folks. I grabbed this one off of um off of one of the screens. Small data and AI is more valuable than you think. Um fully agree with this one. Uh I've personally seen the value of small data.
13:37Even as in a company like Google which has a lot of big data, some of the biggest data, small data still helps with business decisions. small data can in the right area can still um push models in certain ways that a massive corpus can't. So I absolutely agree and believe this is true.
13:54Uh build something awesome. Totally agree with that as well. I I you know going to pitch you again. You can download Gemma and build many things off of Gemma in the Gemma series. Um the supply though is not limited. The good news with models is you can download as many as you want and especially with small models. Uh you don't really the
14:10bandwidth cost is negligible as well and you can play with it a lot. It's on your it's on your MacBook or your uh your phone or whatever you know you're not paying extra inference cost. Just mess around with the models as much as you want and you won't be racking up um any sort of bill.
14:26And I fully agree with this as well. Even though I work at Google and I have the opportunity to work with planet scale data now um where I didn't um I really do believe in the joys of of simple and small models I still have fun going home playing with small models seeing results um in 20 minutes or 30
14:42minutes versus waiting sometimes a month large model pre-training runs can take a long long time where you're kind of just sitting around. So there is situations where less is more where if you really focus in uh and come up with a small data set with the right model um you get mighty results. And so I hope I covered
14:58everything. I hope you got a bit of uh historical technical uh and motivational talks. But I really want to emphasize again that in this era of big data and small models, there's not this tension.
15:08They both can live together and both uh help each other. So thank you for the time.
15:15[music]


