Better Data, Smaller Models, Bigger Impact
Speaker

Dr. Shelby Heinecke leads an AI research team at Salesforce, focusing on cutting-edge AI for product and research in emerging directions including autonomous agents, LLMs, and on-device AI.Some of her team's most notable agentic AI works include the open-source multi-agent platform, AgentLite, and the "Tiny Giant", a competitive 1B model for function-calling.Shelby earned her Ph.D. in Mathematics from the University of Illinois at Chicago, specializing in machine learning theory. She also holds an M.S. in Mathematics from Northwestern and a B.S. in Mathematics from MIT.
0:08It's the end of the day everyone. I know you heard some great talks about small data, right? And how small data is going to drive efficiency and analytics, going to drive low latency and analytics. So great. And today what I want to tell you about is that in the AI model world, a similar movement is taking is forming.
0:25It's taking place. It's powered by what we call small language models. Who here has heard of small language models?
0:32Awesome. A good number of you. Great. So, that's really what I want to talk to you today about. I want to tell you about what small language models are.
0:39You're going to see a lot of parallels to the small data that you've heard all day. I want to tell you about their benefits and then I want to tell you about how you can build small language models that are competitive that can even beat large language models. So, let's get started everyone.
0:53Before we get started, I want to give a quick intro about myself. I'm Shelby. I lead an AI research team at Salesforce.
1:00And as an research team, our job is to think ahead for the customer. We're always thinking 6, 12, 18 months ahead for the customer, sometimes farther. And in doing that, we're publishing research papers. We're open sourcing a lot. So, what I'm excited to share with you today is that everything I'm going to talk about today completely open source.
1:18Everyone, I'm going to give you QR codes. You can check out the small models that we built and the training data. We'll get we'll get to that. But ultimately as the AI research team, we we shoot for deployment. But on the way to get there, we're open sourcing a lot.
1:29We're sharing a lot of our work. My team focuses on AI agents, pushing the boundaries there. Robotics, super new.
1:36We're thinking about robotics now. Small language models, which is the focus of the day. But what what also powers robotics and agents and on device. I'm going to talk about that a little bit today because if we can make models small enough, we now can have LLMs on our phones, on our laptops. We're going to get into all of that today.
1:55So, let's get started, everyone. Small language models. Some of you here already know about it. They're called small in comparison to something I know you all know. You all know large language models. Every one of you knows that. And if you look here, you can see a lot of very well-known large language models. Everyone, some of these names are familiar to you. I know
2:17you've heard of DeepSeek. You may have heard of Mistral large. I know you've heard of GPT. I have GPT3 up here. And what I want you to see is that I want you to see the size of these large language models. You can see DeepSeek has 671 billion parameters. When I say parameters, I mean the number of weights
2:39in its deep neural network. 671 billion. Mistral large 123 billion. GPT3 175
2:48billion. What I'm trying to tell you is that a lot of the LLMs, the large language models that you're used to, that you see in the news, are often hundreds of billions of parameters in size. Now, what I'm telling you about today are small models. They're a fraction of the size, everyone. Now, there's no formal definition for a small
3:05language model, but I like to think about small language models as roughly being no more than 13 billion parameters. And today, we're seeing even less. We're seeing hundreds of millions of parameter models emerge. So, this is really exciting. They are a fraction of the size. Now with that fraction of the size, so many benefits. So many benefits. And this is where you're going
3:26to see a lot of parallels to small data. First and foremost, if there's less weights in the deep neural networks, then that means it consumes less compute, right? Fewer weights need to be stored in RAM. Fewer computations, so fewer GP GPUs are needed, fewer CPU is needed, disks, less disk space is needed. So these small models consume
3:47less compute which makes it easier for all of us to deploy. Now another benefit is low latency.
3:53If there's fewer model weights then when I put take it when I put a prompt into that model it's going to be a lot faster than if I plug that prompt into a huge model because there's simply less math to do right the output is going to be generated more quickly. Low latency is extremely important especially today as
4:09we're deploying LLMs and agents. We don't want people waiting for output right? So low latency is a huge benefit now because they consume less compute as we talked about. It means they can be deployed anywhere. And everyone this is a really exciting direction. You're going to see more and more of this in the coming months and years. We don't
4:28only have to deploy LLMs on the cloud
4:32anymore. Yes, the cloud is is always going to be there. We can also deploy LLMs on our phones now on laptops on onprim clusters smart on smartwatches on glasses on on robots you name it. That's all because the models are small enough and can fit on this type of hardware now. And because of that we're going to
4:53get more privacy. That's the beauty of small language models. If I have a model on my phone and I pass in a prompt, that prompt never leaves my phone. That prompt stays on my phone. never gets passed to a thirdparty server. Any output generated also stays on my phone.
5:08Never gets passed to a third party server. That's amazing. Imagine you want you want a personal LLM. You want a personal agent navigating maybe personal documents. You don't want that on a third party server ever. So now with small models this is all possible.
5:22The other aspect to consider is that in general this means we all have more control now. Right? With a small model we can all have our own models. If we're but if we're deploying models, we can own we can own it, right? We can serve it. We can fine-tune it. We can control it. Lots of benefits to small models.
5:39Now, I know you all agree with me this sounds amazing, but none of this matters if the performance isn't there, right?
5:44The performance has to be there for all of that to matter. So, that's the big question for today. Can we get small models to perform competitively and as well as large models?
5:56I have some great news to share with you. With the right training, that is the big point here. With the right training, small models can be just as powerful as large models on specific tasks. So that's what I want to get into today. I want to tell you how we're doing it at Salesforce Research.
6:17So what are we doing at Salesforce? Well, we're all about agents. Agents, agents. Everyone is. I know you hear a lot about agents. And for for all of us, we're on the same page. An agent is an LLM system. You can think of a agent as an LLM system. So an agent uses the LLM as its brain. So an
6:39agent gets a task call given that task calls the LM and says, "Hey LM, write a plan for break this task down into manageable steps." The LLM does that. Then the agent calls the LLM over and over again to execute each step in that plan. That's basically how that's that's a very basic definition of how an agent works.
7:00Now, one of the most important steps for an agent and really what different differentiates an agent from an isolated LLM is the fact that an agent can actually take action.
7:12Okay, so like if we look at this example, let's look at this example. In this example, we're saying what meetings are on my calendar this afternoon?
7:21Now, if I pass that to an LLM without any ability to take action, that LLM can't get this right. The LLM is trained on historical data. There's no way it's going to know what's on your calendar this afternoon. But if we give this LLM the ability to take action, then it can get this question right. So, in other
7:40words, imagine an agent gets this prompt. The agent can say, "Hey, LM, what action do I need to take to get this to to uh answer this prompt?
7:51And this is what we call function calling. This is the function calling tasks for LLMs.
7:57You can see here that the LLM is able to generate a completely executable function call. So it it it selects the correct app and it selects the correct function. By the way, this is this is a challenging task because you can imagine any LLM any agent is going to have access to tons of apps each of which has
8:13tons of functions. It also completely defines the variables, the parameters so that this action is completely executable. This is the key step to what makes an agent an agent. The fact that it can generate this function call now the system can execute and that's how agents can move autonomously through your system.
8:35So, we asked, you know, with this step being so important for an agent, can we make this step more efficient and can we make it fast? Because today we're we're starting to get pretty sophisticated with agents. Customers and people want agents to do tasks that take multiple steps now. Now, if each one of those steps is slow or is taking time, that's
8:58wait time, right? We want our agents to move fast. And now with agents being deployed everywhere, we need them to be efficient, costefficient, right?
9:08So that's exactly what we did. We thought about how could we make these uh this function calling capability faster and more efficient. And so what we've built at Salesforce AI research is what we call large action models. We built a series of open-source models that are experts at this function calling task, everyone.
9:28And in particular, the ones I want to tell you about today are our small action models. The smaller models of this suite of models. Now there's now these small models are in size 1B, 3B, and 7B. If you remember from that first first slide, it is a fraction of the size of many LLMs that you're you're familiar with. Very very small, faster.
9:50At that size, these are lightning fast. And again the most important piece here everyone we call these tiny giants because at 1B 3B 7B they outperform
10:04models more than 10 times its size on agentic tasks. That's the beauty of this. So let me show you what I mean by that. What what I'm going to show you here is a well-known agentic leaderboard. This is a totally public leaderboard. It's called the Berkeley function calling leaderboard. one of the first that were released and it tests
10:23models ability on simple agentic tasks and complex ones, ones that take multiple turns, one that requires multiple uh conversation turns with with users. And so this is a screenshot from when we were when we first released these models several months ago. Now, what I want you to see on this leaderboard first, I want you to see that
10:45number one and two are our models are our XLAM models. But I want you to notice these are our 70 billion parameters and 32 billion parameters.
10:53That's great. Large models are here to say large models are powerful. This shows that the large model just has more learning capacity. We ex It's great to see that they're one and two. But here's what I really want you to look at. I want you to look at that red box. Our 8B model, everyone. 8 billion parameters is
11:11number four. 8 billion parameters. If you think about think about that trade-off, 8 billion parameters only slightly worse than the top two, right? And beating at the time GPT40.
11:26So these small models when trained well, when trained correctly on quality data, we're going to get into that, they have so much power. Now I want to also show you our other models where they stand.
11:38Again, this 3B model, we're not expecting the 3B model to be number one, but the fact that the 3B model is hanging up there with the big giants is pretty impress impressive. And finally, where does that 1B model stand? Yeah, that 1B model is lower on the list. But look at look at the giants that's around and look at the giants is beating
11:56everyone. One billion parameters is beating models that we know are at least 10 times if not more times larger.
12:05So, this is a lot of potential, right? So, how did we do it? I alluded to it already. How did we do this? It it came down to the training data.
12:15So, we started with open- source pre-trained models, everyone. So, we didn't pre-train from scratch. We started with open source pre-trained models and then we fine-tune on very very high quality data. So, let me tell you about how we did that.
12:32So, when we think about fine-tuning, we're training the model, right? We're training the model to behave in the way we want it to. And so if we want this model to be an expert at taking action at calling functions, we got to train it on data that shows it how to take fun take action. It shows it how to how to
12:48write functions. So here's an example of one data point, everyone. Now this is completely, you know, simplified and made up, but for demonstration purposes, you can see here the task is what is the phone number and email of my friend Astro? Imagine an agent gets that paired with that task we have the correct actions to take to execute that
13:10task and in this case it would be navigating contacts getting the phone number of Astro so the correct API call then second we need another API call navigating contacts and getting the email of Astro both of these are the correct API calls to complete that task now this is one data point a single data point we generated thousands of these
13:31data points So what you just saw is an example of a single or very simple task. Sometimes we call it singlestep function calling. A function, you know, it's a task that requires maybe one function call, maybe a couple, but single step very simple.
13:48Our training data has to include these because a lot of agentic tasks are simple, right? It also includes the more complex multi-step multi-turn function calls. So imagine, you know, more and more today agents are dealing with tasks that require multiple steps. So that require a lot of engagement with the user. We have a lot of data similar to
14:08what you just saw that that takes more steps, that takes more interaction with the user. And finally, our data set had to be diverse across a wide range of tools, a wide a wide range of APIs, a wide range of domains so that the so that the model could really learn how to navigate lots of different tools.
14:30Now that type of data you saw, action data, it's not everywhere. It's actually pretty it's pretty new actually. So the big question is the big question we had was how do we get that data? Think about it. It maybe if if you were to brute force this, you could yourself think of a few tasks your end users do. Write the
14:52API calls. Maybe there's some other ways to to maybe there's some ways to generate a few examples like this. You could think of maybe with some annotators you could you could think of some ways to generate this data. But we need this at scale because we're training models, right? We need to create thousands of these and we need to
15:08be able to create thousands of these for lots of new new use cases all the time for lots of new domains. And so what we created is an open-source data generation pipeline for exactly this everyone. It's called API genen.
15:27And this is our this is our framework. This is our pipeline that we use to train the models that you just saw on that leaderboard. This is it. And step one of this framework involves LLM gener generation of tasks and tool calls. So we do use LLM to help synthesize some of this data. But as you can imagine, LMS
15:46make mistakes, LM hallucinate. We can't just rely on that. That would be lowquality data. We have a second step, and that's where all the work gets done.
15:54Lots of different levels of verification, whether it's actually calling the API calls that that the LLM suggested to make sure they're valid.
16:01Lots of different error checking. Um there's LLM committee evaluations. That's where all the quality happens.
16:08And in that step, points are actually going to be removed are going to be thrown out if they if they are not at a certain quality. Then we step three, we added in simulated conversation as well.
16:20That's how we make the the these action data points look realistic and capture the nuances of actually of actually interacting with the user.
16:31And everyone, as I mentioned, these are both open source. Feel free to check feel free to scan these QR codes to check them out yourself. So the XLAM will take you to our hugging face page.
16:40You'll be able to see all the models um that you can download and try. The API gen will take you to our API gen technical report. You can see how we build it. You can check out our GitHub to actually try the framework yourself.
16:52And there's samples of what this data actually look like that we use to train these top performing small language models. Thanks so much everyone.


