2025Small Data SF

Uncharted: Building a Semantic Layer + MCP to Map 1.7M Songwriter Connections with Claude Code.

In 2018, the Music Modernization Act setup the The Mechanical Licensing Collective (MLC) which also makes it database of musical works catalog of song writer data public. Using this unprecedented treasure trove of songwriting data & modern local data tools, I ask the question: can small data tools allow a single person to map the pop songwriter family tree – and connect the dots from kanye west to taylor swift?
Speaker
Sam Alexander
Sam Alexander

AI Engineer, Knapsack

Knapsack

Sam Alexander is an independent software consultant with two decades of experience. He has developed data products and pipelines for clients from indie record labels Epitaph and ANTI Records, to Fortune 500 brands like Nike and Intel.

0:00[music]

0:05[music] So, I am really excited to be here. Uh, I am going to do something I've seen a lot of folks do. If you've heard of uh Kanye West, not a fan of, but if you've heard of Kanye West, can I see a hand?

0:25Okay. What about Taylor Swift? Okay. Uh, Quincy Jones. Yeah. Awesome. Uh, what about Justin Vernon?

0:36Okay. Awesome. I'm here to talk about a a problem really interesting to me. Uh, that kind of touches a lot of folks in ways they don't understand or or know or maybe pay attention to and the ways that that problem is addressed with small data. Another working title for this that I that I put together is called

0:54Uncharted and it's building a semantic layer in MCP to map two million songwriting connections with cloud code.

1:02So this is an ego graph of Quincy Jones songwriting network. Um that is the official term that I was uh uh explain that this is what uh what we're looking at here. And there's a few interesting things. First of all, Quincy's there in the middle and you see uh 569 collaborators kind of surrounding him.

1:23These are folks who have done more than two uh songwriting connections with Quincy. Um and over here you see in in

1:31kind of this uh softer gold uh long-term

1:35people who have collaborated multiple times over over many different uh many different works. This is a similar ego for Prince uh Prince Rogers Nelson. He had this really core group of musicians and songwriters who worked with him throughout his career uh that made made a big difference in his collaboration.

1:55And these are both kind of older songwriters. And this is interesting to me. And here's a more modern model. And when you look at this, this is Taylor Ellison Swift. When you look at this, you see a songwriter who has some really core writing partners, right? and who does work with these partners, but then also a lot of

2:16different collaborators who might guest on a particular song or track. And then there's this.

2:23When I when I got this spit out, I probably thought the same thing a lot of you, which is like, I have a lot of questions.

2:31There's almost 2,000 collaborators. Why does it look like the eye of Sauron? Uh, I can't read a single one of these, but it does build up these questions, these curiosities of, oh, wait a second.

2:43So, how does sampling play into this? How does hiphop featuring play into what this network looks like? So, I am not a graph database engineer. I'm hardly a data engineer. Uh, there's a lot of things that I'm not, but one thing I am is curious. And uh I've been working in this in the the industry of data in in

3:03in music and in media for uh about 10 years. And there's this moment where I felt like my curiosity might actually pay off. In 2018, the mechanical licensing collective released a public data set of 44 million works which includes 2 million songwriting credits and connections. This includes what percentage of that song write uh was written by that song as far as royalties

3:28go and mechanical licensing goes. Uh it includes data about uh were they a composer, a lyricist, a composer lyricist? Uh royalties and ownership were all part of this data set. [snorts] And back in 2018, I thought yes, I'm going to get to explore my curiosity and and actually dive into this question. Now, the reality was in 2018,

3:52I had no way to do this. I I was not capable of actually exploring that curiosity. But here we are in 2025 and the game has changed. Someone who has that uh instinct of saying I want to explore a particular question, their ability to do so has really shifted.

4:08[snorts] This is an embarrassing laptop that I didn't want to put on this. But I I I built the graphs that you just saw here.

4:16I parsed through that data set and it was myself uh my laptop which I have an M2. I I have a work machine that's a little bigger, but this is a personal personal project. Um, uh, three weeks I had months to actually get this done, but then obviously I waited till the last minute and worked on worked on it

4:34for this talk. Uh, I had no team, no permission, uh, and just curiosity. So, I came to this with a question. How do I map the family tree of songwriters? Why ask this question? I I'll tell you a little bit about myself. I in a past life and and at night I still am a songwriter myself.

4:53Um I've been a lifelong musician. Uh I've been a fractional CTO though as well for creative startups uh media companies kind of small mid-market folks who don't have a full-time CTO. If you've ever been in sales and you've been on a call or a sales cycle and and you're working well with a mid-market CTO and then someone shows up with an

5:13email address that doesn't actually match the company and they show up and they say, "Yeah, we're not doing any of this. We need to talk about this stuff." That was me. Uh I I was the one who got called in to help protect my customers interest because they might not know this realm and I want to make sure

5:27you're not selling them a car that's too much car for them. Uh, but I'm also fascinated with collaboration and I want to tell you if if you had duck in your name, it was like a free pass. I would call you up and say, "Hey, let's get you in here and let's get to work." Uh, so there's a lot

5:41of folks that I I loved working with, but then I saw a lot of problems that were being sold to my customers, my clients that were big data problems that they didn't have a big data team or big data capability to actually maintain or or or work through those problems.

5:57So I came to uh when I saw the data set released uh it really just triggered in me a curiosity and I I I wanted to approach this and find out can someone using small data tools uh which in a way to me just means I can do it right it means no permission there's this idea of

6:17access which I think we are all talking about and some of us are saying but really one of the things that is enabled here is is access to data So I decided yes, I'm going to go for this. And it was immediately challenging. In the old world, a lot of the steps, the process of building out an answer uh to a

6:36curiosity actually removes some risk every way. And we might take that for granted, but that data pipeline that takes that raw data, transforms, it applies some hygiene [clears throat] to it, it's actually trying to take risk out of every step. And I ran smack dab face on into a lot of risk. So the data wasn't big. It was 285 GB. Certainly way

6:58too big to be in any in my 16 GB of memory and it was hella messy. Um it it

7:05wasn't really necessarily uh in a shape that it was very useful. There's obscure columns. This was written by a committee of uh of uh music protocol and standard

7:16writers. And so they're they they did a great job of making a trans uh a data set that would be able to transfer across different music companies but not great for parsing uh insight from. Uh so this is all kinds of broken when I first started looking at this at this problem.

7:35One of the benefits that I had for me though is that even though this is a list of songwriters song, you know, data of a certain size does content. I would say this this might be a row row by row by row of content or release or a songwriter, but once you hit a certain size, everything starts to look a little bit like data.

7:55Ben actually referenced a maybe song quote, which is quantity has a quality of its own. And that's one of the things that I was able to take advantage of because I had data of a certain size. I could start pointing small data tools at it and start getting information out of it. Um, and this is an amazing slide.

8:14Uh, a gentleman Matt Turk, who I don't know if he's there here or not, has a landscape of what's in the ML AI and data landscape. So, I looked at this and I'm just like, I don't how I don't know where to start with this. This actually opens up way more questions than as answers. Doesn't give me a place of a

8:34jumping off point. Um there's a lot of things in here I do know. I as I I've worked as a software engineer and have worked with data. I I know who Ralph Kimble is and have been able to build data pipelines for my customers. But when looking at this saying, I have a laptop. I have a burning curiosity.

8:53Where do I start? I didn't find answers here yet. And I'm I'm hoping that the people in this room are able to help people like me who had this kind of problem because I I think that there is a a a world of industries and niches where you have folks who have a curiosity but they don't yet have the capability. This is

9:14that gap. So, uh, working with folks at, uh, companies, uh, like record labels or I worked at the mental health company who they had this question of, well, we were connecting people who are seeking mental health with providers of mental health and we want to know for in the last month, how many people who are looking for an anxiety therapist found

9:36an anxiety therapist in Vermont? Well, you look at that and you're like, "Okay, well, actually, you don't track what state the the seeker comes from. They cannot enter an optional zip code. What does anxiety mean?" So, therapists have a list of specialties. There's actually 12 different ways they might talk about anxiety. These are the kind of questions

9:55that even though that CEO hats curiosity, the answer to it was, "Okay, let's build a data pipeline. Here's your prescription. Um, we'll build a data pipeline. It'll take us six months and then you can start asking questions." At that point when I went about this, I knew that that wasn't the process. That was the old way of doing things. I

10:12wanted to skip building a team, getting getting a data expert, actually building out the pipeline itself. Um, how could I go about doing that with what I knew, what I had access to at that point? And this is not to to poo poo on data pipelines. I actually did I sold this model for about 5 years helping folks

10:32get up to uh up to speed from having no data sophistication to having some degree of data sophistication. But what I found was that this really a lot of folks just come in at this point this question mark and say hey I just have these questions um and the answer says okay let's build all this build this

10:51back backend to supply you answers but what if they could just ask the question and and start getting that answer so I used my laptop I reached for duck DB which is my Swiss Army knife you've heard a lot about that today I won't uh go too deep into it Python I'm very comfortable with that got got a uh uh

11:10solid state drive and transferred a huge amount of TSV files to parquet and then I used a methodology that uh a good colleague of mine Hoy Emerson has talked about called the slam stack and so what this looks like and why it works for someone kind of sitting solo um I don't know if this anagram will take on but I

11:30hope you don't they might you never know uh is defining a semantic layer using a large language model pulling in an agent or a harness harness uh around an an LLM and tools and building out an MCP and then there's a you kind of supply that with context.

11:48So what I'm going to talk through for the rest of this talk is really how I went about this approach and what I learned by doing it. And [snorts] I hope that this kind of is a little bit of a free uh product research if you're trying to reach folks like me who might not be in uh in the traditional data

12:06industry but work across industries and are trying to build things for themselves to scratch their own curiosity itch. So I started with the semantic layer. Here's the thing. I I know enough about what I'm looking for.

12:17I know what a songwriter is in the in this parlance. I know that they have a particular ID in a a particular way that they are connected to other songwriters, composers or lyricists. And so on my end, what I want us to get to is what's here on the right hand side, which is I want to say I I want to be able to uh

12:35make a query that says what's the relationship between Taylor Swift and Kanye West? behind the scenes, I want you to grab those songwriters, pull up information about them, and then do the back-end work that is magic that lets me see their connection. On the left hand side, you see kind of the raw approach to this. And what I found was that if

12:56you supply the LLM, your your agent with this semantic layer, you get far less hallucination of column titles, uh the queries that it's actually writing. And I know there's a lot of work happening in in this in this field, but the semantic layer basically kind of became my uh data pipeline. From a technical perspective, this really just kind of

13:16looks like pyantic models, uh, MCP tool calls that MCP definitions that you can use to to call from your LLM and some pure functions. Then there's a large language model, which we all know what this is, so I can keep going. Um, there's a harness for me that was cloud code. If you haven't used cloud code, it's like a text adventure that never

13:36ends. And the MCP tool. So at my at my day job now where I work at at NAPSAC, we our job is to connect design system creators uh and in and uh design system administrators uh with a an incredible system to design their design system, but then also to share the outcome of that design system. So if I have a brand

13:55and I I need to build up a new page or a new product, I can use that design system as an input. And MCP has been amazing for that. It's it's like the USB. I've heard this before uh of being able to take your external uh your wherever you need that data, wherever you need those tools and plug it into an

14:13LLM and the portability around that has been huge. So, uh this is what the way if you haven't seen it before, what an MCP uh list of tools looks like to the LLM. A lot of users never get here. They never see this, but these look like instructions you share with your LLM.

14:30And these are, as you see, these are written not as like API calls, but kind of like instructions. It tells you what to do, what to do before you actually call this. And then there's the codeex.

14:40So this is a the the combination of research that's out in the wild that I pulled into the project to help me cover all the things that I didn't know. Uh, one of the things I'm showing here, there's this little button, this little thing called forward as an attachment.

14:56And I love this because I can highlight everything in in a substack, forward that as attachment, download that zip, put it in my in my folder, my repo for context. I essentially have that newsletter, that substack as an adviser at 2 a.m. So, here's the result. I was able to build out this map taking giving it a single songwriter, I could start

15:19looking at what the relationships they have are with other songwriters. There were some questions that I had, burning questions. This is a moment in 2009 that stuck in uh our consciousness of one songwriter and another songwriter coming to beef. What he's the aggressor here, so she's kind of she was in the clear.

15:39Uh but I had this moment of what would what would it take to who could have mediated this? Who would be an arbitr be able to arbitrate this conversation? And I looked at that who's the brave cousin who could like break up this fight over Thanksgiving. And I found that answer.

15:55This answer is Justin Vernon. He has a unique role of collaborating with both of these artists. And I that was found not through me doing a breath first search uh in the data set, but by me describing the problem that I had, iterating over that problem, and actually having the LLM help me build this this toolkit.

16:17So this the LLM provided me with the the skills that that I didn't have. Uh and when we talk about kind of human in the loop, this is to me something that's really exciting. I'm I'm very happy to

16:31admit the things that I don't know and lean on the LM for the things that uh that it knows. This is an example of me talking with the LLM about what this graphic looks like that I'm going to put into this talk. I I asked it to make fonts larger, but what I brought to the table is said this is not accessible.

16:48There's not enough contrast to this particular uh graphic. How do we make this more accessible? Um, and what I want to highlight is I brought that to the table. I I bring that knowledge about songwriting. I bring some knowledge about data, a little bit about UI and viz. And I'm right here in this vin diagram. And I think that we can't

17:09lose sight of that human who sits in this vin diagram even though AI is now showing up in every aspect of this.

17:18And similarly we had bone of air Justin Vernon who was uniquely positioned to be able to maybe take two people who are at odds or at ends and collaborate with both.

17:32uh as an innovator I often said this I I would teach people about this sweet spot of innovation between feasibility desiraability and viability as there's this myth mythical sweet spot that if you decided that you're going to be a data engineer if you had if you learned data machine learning and AI 10 years ago I think all of us think if I had

17:52just done that I would be in that sweet spot most of us here probably did do that so well done I'd be in that sweet spot of being able to be uh relevant like forever But I think that this sweet spot is actually there's many of them. I have a c I have a client who's into podcasting daily

18:10fantasy sports and data. Um I I have a a

18:14neighbor who's into machine learning figurines like Warhammer and cataloging. And I I want to fight against kind of this idea that that we're we're all kind of moving in the same direction. Where I see is that there's all sorts of sweet spots being built up where people who are curious enough to be able to ask a question right there in this in this

18:35middle piece will be able to expand their relevance and maintain a spot that's unique to them that no one else necessarily has a a particular need or want to be there in the same way that they are.

18:50So, I've got some tips. This will be uh on the slides later on, but you know uh I think that one of the great things I learned through this process is how to go about remaining curious. I'm going to skip to this statement which is I um as

19:07professionals a lot of us are trained to find answers but for a lot of folks I think that it's you need to reme remember that it's more human to really explore those questions. There's some people that I recommend following. Uh some of these people are here today.

19:24These are people who occupy their own niche. And I really appreciate your time. Thank you. [music]

More 2025 Talks
View all