DSPy Primer by Mike Taylor

DSPy Primer by Mike Taylor

Listen to this behind the scenes impromptu training between Mike and the Every.to team. Mike gives an introduction to the DSPY framework.

September 29, 2025
← Back to Articles
Summarize in
OR

Transcript

I actually met up with the founder last week, like really smart guy that came out of Stanford and they've used it internally at Stanford for a few of the other projects like Storm, which is like precursor to deep research, was built on DSpie. Oh, interesting. So, I'd say DSP is like one of those kind of open secrets that a lot of people are using. Like someone went sleuth and found that all of the major vibe coding apps, Replet, Windsurf, like all of the CEOs follow DSpay, but they never talk about DSP. And the reason I think is that the documentation is very academic. It doesn't really talk that much about real world business use cases and it assumes that things like what like multihop is or that you've you're familiar with a specific academic data set that nobody uses in the real world. So once you tr trudge through all that, you'll find it's actually pretty simple. Okay. It just looks very complicated because the people who made it are really smart and therefore the they have a bit of a blind spot I think in terms of making it more accessible. Yeah. They also settled on this weird domain specific language that is not very Pythonic. Here is a DSPI signature which is just if you extract it out, it's literally just defining what the inputs and outputs are of your prompt. Okay. And and these are the instructions, right? Extract structured information from text. Right. So you can give really simple instructions and that ends up going into the end and that ends up going into the end prompt. That's weird. Yeah, really weird. And then you have to specify the types and this is like DSPI input field, DSPI output field, right? Once you've created a signature that is your prompt and the it builds the prompt from that. But once you've got it in that format, you can swap out models really easily because you just it's literally the same signature and it has this adapter underneath that kind of makes it run on any of the different LMS. So you don't have to worry about how does Google do formatting of inputs and outputs. How does OpenAI but the really big thing is the optimizer. So once you've told it here are the inputs, here are the outputs and you've given it an evaluation metric and a data set to evaluate on it can automatically improve the prompt. So that's the real payoff. Yeah. Once you see that magic in action, okay, now it's worth it to me to go through all this extra effort. Yeah. And you can regularly see pretty easily like a 10point percentage increase. like you can go from say 40% accuracy to 50% accuracy or in some cases I've gone from 50% accuracy to 90% accuracy. The really difficult thing about DSPI as well is just that it forces you to really think about what you actually want out of the program. And quite often when people struggle with DSP, it's actually because they're struggling to specify what they want the program to do. What is the evaluation metric I want to use to define whether this program is working or not? and formalizing an evaluation metric. Most people when they're building AI tools are doing vibe based emails, right? This is what I do and this is what you should do like in the beginning. You should say, hey, I think the output looks good or I think the output looks bad or my CEO thinks the output looks good or he's noticed in these scenarios it does badly. So you start with vibe based and it's very hard to go from like my CEO likes this to we have a formal data set of inputs and expected outputs and we have an evaluation metric that can check whether the outputs are good given the inputs. So like that is actually the hard part but that's the hard part of building AI applications. It's notice fault. Yeah. That makes sense. Yeah, that is like where my mind trips up. I'm like, "Okay, so this is describe something, but I'm used to like being in the details." And this is actually saying, "No, don't be in the details. Let us do the details." But I'm like, "But like how do I even steer it?" I think a lot of the tension DSP comes from the fact that they've almost accidentally built the world's greatest prompt optimizer. But what they really want to be is like a full framework. They want to use this domain specific language and not care about the details because you shouldn't worry you can just compile the program with a different model if you want and don't worry about what the program did like just worry about the whether the outputs have good eval. Yeah. So one question because length chain is also oh don't worry about the details but like at some point no one was even understanding what was going on and no one was like yeah it does something but no one knows how it works. I'm I'm a little bit like worried that it's like a lang chain thing whereas yeah sure like we promise a great library but if no one knows what the hell going on what's the value so I'm curious what your take on that is as well exactly so I used lang chain a lot early on and then I like started digging into it and I was like wait hold on they've just wrapped the length the python length statement like in 100 lines of code for no reason yeah I always have this lovehate relationship with lang chain but I would say where DSpie is fundamentally different from lang chain is that lang chain is like just use our framework and it will make everything easy don't worry about the details but sometimes you really do have to worry about the details but they're not really giving you anything in returns pie is giving you the optimizers we genuinely can ignore the details if you have a good evaluation metric and that's the key and I would say where DSpay is really useful is if we have a formal evaluation metric that you trust before we have one which is did it guess the category correct Exactly. Which yeah categorization which is super easy. So I would say probably for a use case like that is probably ideal. Right. Exactly. Yeah. There are a million different ways to build AI applications. And what I found is because I was working as a prompt engineer for the past few years. There's basically like a single pattern that you could just use and ignore everything else. This is the evaluator optimizer pattern. And what this solves is quite often the task you want to do is fuzzy. Like it is there is no like real formal evaluation. If you're gen doing a blog post generator, you could generate you could check the length of the blog post, but like how do you check whether it's a good what is the what are you optimizing for? Yeah, exactly. So the way I would approach that typically with clients is I would say okay who's your domain expert or it might be the CEO, it might be someone else. So like I worked with a team of psychologists for example to do like a personality quiz type thing and their time is really valuable. So you don't want them like manually evaluating every single response. What I started to do was this LLM as a judge type pattern. Don't worry about evaling the actual generator or the optimizer in this case like the thing doing the task. Instead build a judge to replace the domain expert. Yeah. And you normally or is it like prompt? So you prompt that or you fine-tune that or you actually use the espy to Exactly. So that's also a spy program. Okay. So you just Yeah, you do that as well. Yeah. Yeah. Exactly. So like the judge will check whether it passes the test. Yeah. and it'll give you some sort of score. So with my DSPI metric like the LLM judge is my DSPI evaluator. So it allows me to do the fuzzy as long as the judge agrees with me or agrees with domain expert most of the time 80% 90% then you can trust it to do the optimization and it doesn't cost too much especially if you can get the judge working with duty mini or one of the cheaper models and it's usually possible. It's actually a much easier job to evaluate whether something is good than to create something that is good. Pals and this is in a sense of where you need LM as a judge because obviously categorization you don't need that. Yeah, exactly. So categorization is like a judge type task. The like the fuzzier type task like the example I'm using today I've got like a notebook that I share afterwards is a telling a joke. getting AI to tell a funny joke. If you ask it, it just uses dad jokes and funny but like not real like standard comedian type jokes. The first thing you want to do is train a judge to check whether the joke was good or not like just one or zero. Really simp it could be really simple. Are you always binary or is it just two? I actually found binary works much better and it's actually better to just stitch together a bunch of binary evals than it is to use a like scale. rate this one out of five because LLMs they tend to be too positive and it's like everything's a four. Yeah. So yeah, I just do one and zero quite often and you can build up more weights. So you can say I don't know if you're generating like an article for every you could say there what are all the different things that I care about in terms of a good article like it needs to have a catchy hook at the beginning. It needs to be this length it needs to be you can actually string all these things together into one master eval and weight them and say okay like catchy hook is 80% of the value whereas length is like 10% whatever. So that's the way I think about it. Cool. So let's jump into the code. install DSP and then what about sentiment like positive, negative, neutral or is that too much issue to people? The way I see that is that is still like in a way like a classification. Okay. So yeah, I would say it's important that it's mutually exclusive and collectively exhaustive. Yeah. As in there's no ambiguity between Yeah. Exactly. Yeah. Okay. Okay. That makes sense. Yeah. The way you set up DSP is very easy. It's just provider slashgptini or whatever one you're using. You can set up multiple and you can use one as the teacher for the optimizer and one as the student like to do the task. I'll show you that in a second. But you just generate it and you can put your prompt in. So you actually can just use dspi as like a scripting language, right? You don't have to worry about any you don't actually don't really need to use it like for optimizers, right? And so I use it quite often now for like throwaway programs and like Python experiments and stuff which is nice. There's a couple of nice features that DSPI has straight out of the box. One is that it's probably like the easiest way to get something working on Azure or AWS or something. It's literally just add a couple of more environment variables and and also if you wanted to see how this run on another LLM, you could just add in anthropic on it, right? It also has caching built in. You can see fast that ran, right? But if I maybe change the temperature so that has like default zero temperature if I want one it'll run again right because I've changed one of the parameters so it it will skip the cache and you can also do cache equals false if you want but but you see how fast that just loaded that from the cache right it can save you a lot of money when you're doing experiments even if you didn't remember that you had used that combination before. Yeah. Or in optimization, it's it happens to have a cache hit. It will it will just use the cache version like it won't cost you anything. Cool. So that's interesting. Then to create a program, you can actually just do it in one line if you want to. So this is a basic joke program. It just takes a topic and then gives you a joke, right? Everything on the left side of the arrow, it it just turns into an input variable and everything on the outside it turns into Oh, is this their DSL? This like input. Exactly. They've just added that as well. Again, extra confusing, but for convenience, it's quite fun. And if you wanted to say, I don't know, I don't know, come median, you could do that. And then literally like the program will just work. So I really don't like that. Yeah, it is terrible. But so I will you I use Ruby. I don't like Python. So I will use Dspy RB, which is in Ruby, which has a way nicer way to do this. They don't have a DSL in strings. They just have it. Yeah. Ruby I think so a couple of hacks and workarounds for doing doing it the more Pythonic way you can see this is the type of dad joke has why do Python programmers prefer dark mode because light attracts bugs it's classic now one of the fun hacks to get use out of DSP is you can just see what prompt was run the last so this is n equals one this is the last thing that was run on the lm and that's a global thing um and so you can see the actual prompt here. This is what goes in. Actually, it's probably better if we do the scroll. It's turned that whole thing into literally just the system message here. Your input fields are topic. Your output fields are joke and it's given the types. All interactions should be structured in the following way. It by default it uses like this kind of markdown type thing. Um rather than like JSON, but you can change it to JSON if you want. This does tend to work better actually. It's like JSON mode. I think it ends up making it a little less intelligent is what people have found testing. But yeah, then puts in the user message for you and then you get the response. The really nice thing is it's all automatically formatted and then all automatically passed afterwards as well for you. Do you look at these like normally before running or like to see if you can optimize it and how do you see if it's good or not? I used to really hate this because I don't like the style that they use, but it does. Now I just go I just use whatever they have like straight away and then it tends to work. Okay, I'll micromanage my like really important prompts like the core prompt at Rally is not using DSP but like I run all of my experiments with DSpie and then I find something that works and then I'll incorporate that into my main prompt. Yeah. Interesting. Yeah. So what because it's in this format it can go and optimize this. So, one of the optimizers adds few shot examples. It'll test a bunch of your examples that you give it and it'll see which ones are the best and it'll add them and but it automatically puts them in as user message response, user message response. So, you don't have to do all these convenience methods that you might have to do normally, which is quite nice. Cool. Basically, so the optimizer you run first and then you store that version or something like that. You can. Yeah, exactly. Yeah. But this is my way of like more Pythonic way of creating DSPI programs. I I just create the fields and the instructions and then there's this like buried deep in their library there's like this DSpi.mmake signature which like doesn't appear in the documentation anywhere but but works and so you can just give the signature name of whatever you want the instructions and then the fields and it does use all this. So like actually the like some of the optimizers are program aware as in like they it shows itself what the program looks like. So even the name of the program and stuff. So like what you name things is actually super important to the VS PIP from it. It's funny because that's how the Ruby version works. You have a class that is the signature with an input and an output and a description and you run that which makes more sense to me. It's like this. Yeah. So you see here I put dot predict. This is just like the base chain of thought or whatever there as well. Yeah. But you can change that to change of thought. And then now you have Yeah. So I've just printed output joke. But if I just print what is output, it's a prediction object with reasoning and a joke. So it's put the reasoning in here. That's a really cool thing is like they have a bunch of they have the react pattern for agents tool use. They have sampling where it just generates five versions and chooses the best. You know, they have things like that. So they have a lot of these prompt engineering techniques that are built in which is quite nice. And it's just a oneliner to add them which is useful. Cool. Ever use multiple predictors and then go from there like in a multi or Yeah. Exactly. Yeah. So you can basically once you've created the signature like then any valid Python object can be a module essentially you can string loads of different predictors together create like complex workflows. That's where you get more into the lang chain type stuff where Yeah. like multistep like synthesis and all that stuff. Yeah. Exactly. Yeah. Just to show you like I just brought in Gemini Flash and all I needed was my Gemini API key. I don't need to worry about how Google's currently doing stuff which is batch So all good. It's really nice. And you can see here by the way you can have a globally configured like I configured the LM up here somewhere. Yeah. DSPI configure and it would just use that one automatically. But you can also just run it in context. So you say with dspi context lm equals gemini lm then it's it will just use gemini for anything in there but like the openi is still completely yeah and some of the modules and stuff you can pass in the lm but I found it's inconsistent actually. Yeah. So yeah you can do different steps in pipelines with different models and stuff like that. Exactly. Yeah. Yeah. This is like a chain of thought example which we've already put in there and you can print out the reasoning and because it's all automatically passed you can just do like output reasoning output dot joke. So you really don't have to worry about the intern. And is there does it is there a cycle feedback loop where you give feedback and it looks at the reasoning and sees if it can optimize from the feedback and the reason. Yeah. So some of the optimizers work that way. Yeah. Yeah. Yeah, I'll cover that. But it is it's more of a static thing like you run an optimization job like there's no active learning. Yeah. But for example, if I have like hundreds of people saying think this summary was like needed something more like that or something like that and I the chain of thought and everything also saved. I could run it once a while to update the prompt with an optimizer to like optimize for that output then more. Yeah. Exactly. Yeah. Yeah. and and that you start to get into this really healthy kind of virtuous loop where because you have the eval metric you can either make the eval metric perform better or to cover that that new use case that you've seen coming up and then you can run the optimization job again but you can also go back and change the where the program works you can start from yeah you can use a different optimizer al together see if that works or yeah or try it with Gemini see if Gemini does better or whatever it is yeah cool again this is like stupid but the way it works It's this is actually how you build a real kind of DSPI program. You you create the class, you know, with the this is the instructions and then this is the input field. This is the output field and then you create another class for the module. And it took me a long time actually to kind of get my head around why this is necessary. But essentially this allows you to build programs of arbitrary complexity. The DSPI module is the base class that chain of thought and predict inherit from. So it's a way of you to add your own prompt engineering techniques to it. So you just define an init and it inherits from from the module and then like in this case I just create the joke generator right and then you just have to define a forward method and the forward method basically just takes the inputs in this case just topic and then gives you the outputs. This is a simple one right it does the prediction and then ignores the chain of thought and just returns the joke. Yeah, but you could call another predictor or do whatever you want there as well, right? Exactly. You can see that the the Yeah, like the I think the result is basically Yeah. the same. Why did Python script need therapy? Because had too many deeply nested if statements couldn't handle the indentation. Yeah. But yeah, cool. The So that that's like the just getting a script working. This is where it gets really powerful. So here what I'm doing is I'm training my optimizer because I'm going I think I I think this in terms of we used to have test driven development now it's eval driven development like you have to formalize your eval and if you formalize your eval then you can train an arbitrary program whether it's fine tuning and actually DS pilot let's does fine tuning for you as well they have this like better together optimizer which does both fine tuning and prompt optimization crazy but like you actually funnily enough like I find that prompt optimization almost always beats fine tuning. Yeah, I've never had fine-tuned work in my work. Yeah, unless you've got like thousands like I think the there was a paper that said that until you have 2,000 data like inputs and outputs lines prompting be fine tuning. Okay, great. Yeah, what I did and this is another little hack that I do to get to build a data set is I just went online and found a lot of funny jokes and just I used a deep research here and then told it to give it in into Python. So I got like the topic, the joke and then the comedian in here just for attribution. But so I got you got it to create I think it's like Well, you look like Rick Jerves if you Yeah, I'll do that once actually. I'll take that as a compliment. Not that you're saying a compliment. He's great. And then and then I took the unfunny jokes. Quite often the task it sounds like a complicated task to build a data set, but quite often it's just that you need to go and pick out like select things that you think are good and you don't even have to really explain why you think they're good. And then you just have to then as the things that are not good, you just have to generate a bunch of GPT answers because that's what you're from. You're trying to get it to not sound like chat GPT. Yeah. Yeah. Yeah. So here I just I literally just got I said to chatbt I'll use mini just to make sure it was especially dumb like just give me a bunch of jokes on these topics and and then so that's how I got these. So that's a really good hack for building a why was the belt arrested? It held up a pair of pants. It's so bad. There's a couple of useful things to think about here. And the name is inconvenient like in the documentation they don't really explain like what it means why you would have a training set versus validation set. But as as best as I understand training set is the data that you give to the optimizer to train on like it can see that data it will evaluate that data and it will use some of those examples when it's optimizing like it'll add them as fshot examples right validation set is you give that to the optimizer as well or you don't have to but you can give it to the optimizer as well and it uses that as the test set so it will try a bunch it'll look at the data try a bunch of things and then it will run a test on the validation set. If you don't give it the validation set, it will just run it on the full training data. So, it can end up costing quite a lot. That's one of the primary reasons. Also, I think the results don't generalize as much if you don't have a separate test set. And then the reason why you need a development set is that because it also sees like the test set when it's running the results. sometimes it will bleed across and it won't give you like really good like generalization like the it you'll get good optimization scores but when you try and run a new joke or whatever it won't be very good and then so the development set is what you use the optimizer never sees right like you just create a new like you just run an evaluation yourself after the optimizer is done and that's that's really important I think for proving to the client or proving to stakeholders that like it actually did do a good job it's part of the loop you need yeah you need to have the evaluation step in the end, but it's just running it. It's not training or optimizing or doing anything. Exactly. So, the way I split it usually is I'll try and get like more than 100 examples or 200 examples of good versus bad and then I'll split it 60 2020 20. Yeah, you do new examples, I assume, right? Yeah, exactly. Yeah. So, you can see here we split this out. A couple of weird things. Yeah. The way that DSPI wants the examples is also like a bit odd. So you see here it's DSpi.example and then you just give the different parameters. In this case because it's the judge we want this funny equals true to be the output and then you just do dot with inputs and then list the ones that you want as inputs and then it will assume the funny is the output. And examples. You always use examples for optimization, right? Or Yeah, that is what like a data set is for DS, right? That's how they format it. Yeah, exactly. Yeah. And they'll they have like internal stuff that will turn that into Pandanda's data frame essentially. Yeah. Cool. So, we've got our data set. Just here's an example. We've got the topic joke and whether it's funny or not. So, we create our jud judge and this is case where like actually is convenient to do the inline thing. We're just taking because we don't care m that much about this judge. We just care about whether it's accurate. We don't care how it gets to accuracy necessarily. So, we're just giving a topic, a joke, and then we're saying whether it's funny or not. You can also set types in here as well. So, I set this as a boolean because I found that otherwise it gives you back like a big spiel of like why it's funny or not. You can see here like in this case, judge says it's says it's funny true. Actual ground truth is false. So, the judge was wrong here. And I made it chain of thought because chain of thought tends to really improve the results of judges. It thinks it's funny but it's not. So we need to train our judge. So we can actually do an evaluation of our judge. The evaluate convenience method is like really helpful because it runs in threads. So it runs in parallel. It makes it run a lot faster. So it'll run eight of them at the time in this case, but you can set that to whatever you want, which is nice. And then you just pass it in a metric development set as in what the this is like the one that we kept aside that the optimizer won't see if that makes sense. The final judgment and we can see that here it's 51% accurate. And you can see the actual examples here. This one wasn't funny but the judge predicted true that it was funny. So it got that wrong. Right. This one was true. It was funny. And it predicted that it was funny. So it got that one correct. What the nice thing about judges is they can have an exact match a score and the way I've set the prediction met metric up is again specific to DSPI you have to have this like trace equals none. So you can set that. So that allows you to basically set like a different metric response if the optimizer is evaluating versus if it is like learning from the failures. So if the if trace equals not none then then it's like learning. So you could give it like a more strict evaluation criteria because the reason why you would do that is like in this case it's just exact match so it's easy but the reason why you might want to do that is maybe if you did have an evaluation metric which is like rate this one out of five you have a score then but when the judge is learning it's deciding what few shot examples to put in. So you might want to say I only want five star ratings as my examples. Right? So you don't have to do that but that's why that's there and it automatically will pass in the prediction and then the gold which is like the essentially the the ground truth like from the data set. So in this case we're just checking whether like prediction funny equals gold funny. So that's like the simplest possible metric. Cool. So we have a judge that's wrong half the time. How do we make it better? This is where we're bringing in optimizer and this is the simplest optimizer. It's just called bootstrap viewshot. All it does is it just adds in a few or actually creates a few shot a couple of few shot examples and then checks them against the evaluation metric. If the evaluation metric is correct then it will then it will pass. It will like add them in to the prompt. So when you run that you're saying it synthesizes to passes. Yeah exactly. Yeah. So yeah here we go. So we have max bootstrap demos and max label demos. Max label demos is like demonstrations as in examples from your data set. So in it's allowing you're allowing it to add up to 16 fshot examples from your data set from your training data set. And then max bootstrap demos is you're allowing it to generate new examples. Yeah. Data set but still pass your metric. So would you ever run this without examples or you will always add examples? Always add examples. Yeah. Yeah. Because it needs I guess you could just make the max label demos equal zero and then it should be able to do but yeah I mean it's but you would never do that. That's just not a state. Yeah. Like I would even go to the length of I would go and run the program like 100 times to generate the data set and then run. Yeah. And then you manually pick or whatever the examples. You can also set a few things like so you put the metric in there but you can also set a threshold. So you can say I just care about 80% accuracy. So if it gets 80% accuracy, stop optimizing. You could also set the teacher. So you can say I want GPT I want 03 to be the teacher and I want but I still want mini to be the task and so it's distillation. Whenever it generates a bootstrapped demo, it will use GPT3 to generate that. Oh, sorry. It use 03 to generate that. Yeah. So that is like quite an effective approach. And how do you choose the number of examples and what examples and when do you change the examples? I find it's a it's like very much an art rather than the science. I found it like really weird results sometimes. But yeah, broadly speaking, I'll try it with the default and then I'll diagnose. So there's a couple of different optimizers. By the way, this doesn't change the instructions. This still keeps your instructions, but it only changes the fot. And usually that's enough actually for classification tasks. I'd say typically I would say that you want like at least five, maybe more like 10 examples in the final prompt, maybe more than that in order to get good accuracy on a classification task, but you don't want too much. Obviously, there's a cost to running this. I just this just happened instantly because because I have this is a deterministic optimizer and and because I I've already run it before it pulled from the cache but like this can actually take 10 minutes to run or something and it can cost like it can make a few hundred API calls or a few thousand API calls. So it can actually cost you in into the hundreds or maybe even thousands of dollars depending on how big your data set is. Yeah. So with DSP it's forced me to just basically be as dumb as possible. don't care about the prompt. Like just do the inputs and outputs and see if it works straight off the bat. If it doesn't, then run an optimizer, see if that works. If it doesn't, run a more powerful optimizer and then go, okay, I'm going to try and run a few experiments. I'll do it with 16 demos or I'll do it with five demos and just see what works. So I only apply the complexity like on demand. It's almost like just in time addition of complexity. Yeah. And like for example, you made one last year or like half year ago. Do you ever go back to the example because people do give feedback and things do change and models change like when do you decide to redo this or add other examples? Yeah, I would say typically once you like it I I would rerun a prompt when when we discover a new dangerous edge case. Oh, everyone's complaining about this now. So then I would do a job just specifically to solve it. Or if I've collected more than 50 new user responses because you could have a whole system here where you have a thumbs up thumbs down button on your on your that's what I have and then you that generates the data set for you and then you just rerun and see how well it does. Yeah. But yeah like it it is like a continuous thing. Yeah. And you could the dream is what I want to get to with our system is it's just running overnight every night. It just keeps improving. Yeah. and I don't have to worry about it. Yeah. And really it's like the examples. So how but how do you then choose the examples or like you just give all the examples and say you figure out this optimizer chooses the examples. Exactly. So you don't really need to deal with any of it. You just say whatever the default is 10 or 16 whatever that is. Yeah. Exactly. And I find for simple tasks bootstrap few shot works pretty well and it's the cheapest one to run. So it's like very easy. It's pennies. Bootstrap fot random search works much better for more important tasks. And again, it only chooses examples, but it adds like a random search component. So, it'll go and it'll find a new like this one just goes through the examples and chooses one, but it might be a local maximum like random search actually just literally hops around your data set and like picks out. So, so it's more compute intensive, but it does a really good job. Yeah. Yeah. And then so that's optimizing the judge. And here we go. Obviously the eval would run a lot slower if we it wasn't cached, but we have even just adding a few short examples. Like we have 92% accuracy now. Yeah. That's cool. Wow. So that's amazing, right? Like you just go we didn't even have to think about it. We didn't even look at the prompt and now we've had a 80% improvement with a couple of lines of code. That's the magic. Yeah. I'm going to for sure use this for categorization like next week. Yeah. So categorization is perfect. But then every fuzzy problem also is a categorization problem, right? So now like the judge is the categorization problem. We can now use that to make our and you'll see it won't work as well and it's more intense, but it does lead you to a path to get the fuzzy mask under control. So now I've just created a new metric which is the judge score. And again, literally just saying the judge, like I'm using my bootstrapped optimized judge, right? Like this program that we created. And that's the optimized that's the optimized. Yeah, exactly. That's the final optimized judge, right? Yeah. That we created here. And then I'm just using that as a program, right? So I just get my judge result. So I pass in the topic and the joke and then I give a one or a zero, right? So that's really great. And then you can create a data set. In this case, I basically filtered out the examples and I changed the inputs a little bit because before if you remember we had the topic and the joke and then we check whether it's funny or not. In this case I'm just passing in the topic and then getting a joke right so I had to create a new data set the topic data set and the only input is topic and then it would generate a joke. The other thing I did is I filtered it. So, I'm only giving few short examples where the joke was funny because I find that it's actually really detrimental sometimes to give negative examples. You want it to only have positive examples because when you give it negative examples, it tends to follow that negative example. Weirdly, it's like say don't put your hand in the toaster and it goes and puts it hand in the toaster. But you can experiment on that case like edge cases. How would you choose to include edge cases in the example or like you just include them and it will be handled. Yeah. What I would do is I'll take the edge case. I'll get your domain expert to rewrite what should have been the correct answer and then use that as the future. Yeah. Because otherwise it's confusing or like it will throw it. Exactly. Yeah. Or you could do like a rewriting type prompt which would be again another DSPI program where it takes the wrong answer and rewrites it to be the correct answer. And then and then you can see in terms of optimization that so what I've done here is this is taking the topic dev set which is those those examples that we had before. So literally just the exact dev set and it's just given the topic. So it's just taken those topics and it's found that out of 51 topics only 20 20 of them resulted in funny jokes. So we got 39% funny score generator from our joke generator baseline right. So now we're using the big guns mero which is like a basian optimization algorithm that very smart people worked on and this not just does the optimization of fshot but it also changes the prompt instructions. You can actually just run this as a prompt instruction optimizer if you want by just setting max bootstrap demos to zero and max label demos to zero. In this case, I set it like you can also do light versus medium versus heavy like I said it heavy because it's non-deterministic. You want to set a seed as well so that like you get the same results and otherwise every time you run it it's there's not going to be any limit worse or better. Yeah. You can also set the initial temperature. So it starts at that temperature and then it will try different temperatures at other sides and things like that. So it also optimizes those parameters for you which is quite nice. But pass that in and then I must have messed something up because it's running again. There we go. Great. Yeah. Yeah. Exactly. Oh no. Sorry. It was just running through the cache. Cool. Yeah. No, I mean I'm using mini so it's fine. But that could be like a $100 mistake if you're using 03 or pro or something. Yeah. Yeah. Just to show you like again a lot of this won't make sense but you can see it's super optimized instruction text like the few shot examples and balance between different optimization strategies. First it starts like bootstrapping traces meaning it's like it's generating new fshot examples and it'll have a certain number of attempts to see to get like a funny joke and then it's like doing that a bunch of different rounds and then it's proposing instruction candidates. So we use the fuchial examples from the previous step. A generated data set summary, a summary of the program code. So it's like program aware. I can see what the code looks like of the full module. And then a randomly selecting prompting tip to propose instruction, right? So like you can actually override those as well if you want to subclass repro, but I'm too afraid to try. And then yeah, so so it's proposed instructions like it did have tell a funny joke about the topic, but then it came up with this one. generate a humor stroke related to the specified topic suitable for general adult audience. Be mindful of potentially sensitive content. Right? So that's its prompt, its new prompt. And then which model was it using to generate this prompt? So by default, it's just using the same model that you passed, but you can set that. So you can choose the prompt model, you can choose the teacher model, right? And so the difference is the prompt model writes rewrites the prompt and creates candidates. The teacher model generates new fshot examples. Okay. Yeah. So it goes through all this and you can see it runs for a long time, right? Lots of trials here. You can set the number of trials and stuff. But once you have that, you can run the optimization and then 49%. Yeah. Yeah. So that's actually better for some reason. Maybe maybe I didn't run it with that seed before, but yeah, it actually went up from 39% to 49%. Yeah. It got it did get a lot funnier. It still does a few dad jokes and stuff like that is what I found. But the way you would go to improve this is you would improve the optimizer, right? Because we only had 90% accuracy, but and we also had pretty easy examples where the dad jokes are very obviously bad and like the comedian jokes are very obviously good. So we don't really have that many in the middle. So I would go a bit more fine grain now if I wanted to improve it further. And I would give it some harder examples to train on. Can you see some of the jokes that Yeah. Yeah. Yeah. I'll show you that in a sec. But um This looks great. Show me. Yeah, show me the money. Yeah, this is like the basic one. So, topic was Python. Why do Python programmers prefer dot mode? Because like attracts bugs. Me pro is like, why did the programmer quit his job? Because he didn't get a raise. It's not funny, but it's better. That's better. Yeah. Yeah. So, this one's a little bit better. Why did the co coffee go to the police? It got mugged. I like my coffee how I like myself. Dark, bitter, and too hot for you. Okay, this is a bit better. Why did the bicycle fall over? Because it was too tired. And then Mro said, "I hate when I lose my motivation to exercise. It's like where do these extra 10 pounds keep coming from?" I think that's much better. But yeah, anyway, so you can see that it is a little bit harder to get like for these fuzzier tasks, but we did it's not to be sniffed at, right? We went from 39% funny to like over 50% funny or almost 50% funny. So that that obviously keep working on this, but like the fuzzier tasks are just harder. But yeah, you know, I mean that it was definitely interesting. So the way what I'm taking from this is the judge is like everything. Yeah. Actually my I'm like radicalized on this to the point where I actually don't think data matters. Everyone says it's all about your data, your unique data. I'm like no, you can generate infinite synthetic data if you have a good judge. But that's also you need data for a judge, right? You need to get good data to create a good judge, which is Yeah. You only need the data to create a good judge. Once you've got a good judge, you don't really need to get Yeah. So and that's what like you hear from the labs as well like a lot of them are doing synthetic data now. I have two questions. One how do you prod productionize this? Do you export like the prompt and you run that? Yeah. So I I don't productionize this like I've never put a DSP program into production for any Exactly. Yeah. I could never convince them to do that and I wouldn't want to. No, I don't want this because it's like a kind of complicated and yeah, maybe I'm warming to it. Like I actually have an internal tool that uses DSP end to end. Yeah. Okay. There is a way to get it running with a fast API and so so one of the things that is important is that you can save the program locally and then that creates a folder with all the metadata and the pickle of the program. If you share this folder with someone, they can just literally just do this. they can just load the joke program and then it runs perfectly on their machine. So can I export a prompt? Is that something you do? Yeah. So this is where my special hack and this is like the that's what I would want. I wrote I tested Omar like insistently because I he like really hates that people just want to get the prompt out and it's not that easy because even inspect element it shows the outputs and stuff and it's hard to construct it programmatically. So anyway, he was like you could just do this and okay fine. So you basically pass in to the chat adapter which is what it uses underneath the hood the format function and then you get the signature the demos for all of the named predictors essentially. So it's a yeah don't don't worry about it too much. Your eyes will bleed if you look at it. But but basically if you just run this then it will like you'll have the full and this is the same like this is exactly the same. Exactly the same. It's got the system message and it turns it into the user and assistant messages as well. So And for variables like how like I normally use liquids but like it uses something else I assume or it doesn't use variables importantly which is interesting. So okay so how do I custom so how do I use make it use of variables or like how does it even work? It just has instructions and then you have a user message and then it will follow up with whatever the reasoning and things is from the user message. So how do I trigger this thing then? Yeah, exactly. So you can see here it doesn't like use variable like even this actually isn't a variable. It's just uh formatting. Yeah. And then all it all the only place it uses the variables. So here it just puts in the inputs and the outputs. Yeah. And then and then when you generate so you can see that the user message is like just topic like this and then religious satire. So I have to use the exact same format. Yeah, you'd have to use the same format or you could use the DSP chat adapter if you wanted to. Or I replace those things with liquid text like real stuff like how you normally. Yeah, you can also use a different chat adapter. So you can also create your own chat adapters like you can make it work automatically convert it into your own format the way you like to do it if it's markdown or or whatever it is. So you can see we've got the prompt out here got the this is just the messages stream essentially. So you could actually just pass this into chat GPT as well into open AAI. So I have one one other question that's us. Yeah. Which is it's interesting to see that it's doing its few shot examples as assistant user assistant and user assistant. Yeah. Is there a way to be like I want the examples to be inside just the system prompt? You just have to change the chat adapter. So yeah and it's all open source, right? So you can see what they're doing and you can just subclass it which is quite nice. But but yeah, I found actually that it works better to just put the few short examples in the system message. It tends to follow it better. So I don't fully agree with the way they've done this, but it is this is like the way that you're supposed to do it. But like again, it's one of those hacky things where like I found that doing it the wrong way actually sometimes works. Works as well. Yeah. And again, like the nice thing is that you can still benefit from everything by just changing the chat adapter. And if you change the chat adapter and optimize with that, you can recompile the program and then it will just change the few shot examples. It would change all of the output input. It would change everything to just match your format which is because the chat adapter is used when running the optimization. Exactly. So every single time any prompt runs from any of the optimizer processes or processes, it will always go through the chat adapter to the LM and back. And one last question from me. So I have memories specifically tied to users and I would inject memory into the context as well to personalize things more. But how would memory work or how would that work with an optimizer or you just leave it out completely? Yeah. So no I yeah I've thought about this a lot and and the way you do it is you give it as like a specific field in DSPI. So they actually have a history field somewhere. They have memory like there is a concept of memory. Yeah. And they've got a tutorial about memory. But what I've used for those because we also have the same thing with with rally like we have the past chat history. Yeah. Exactly. There's a specific DSPI history which will then format it in the correct way as a message stream. Yeah. Okay. But then but how do you like like we have 5,000 users like how yeah the way you got to think about it is it's like a it's like a excel sheet right for every interaction with the user you have the same as long as you have the same columns like the input from the user is this and then the memory of the user is this and then maybe you did some rag right so like the rag result is this column you know and you can actually build rag into the program as well so you can do yeah you can do all that stuff you can have actually I think they've got an example here. Yeah. Yeah. Here's an example. Oh no, sorry. Yeah, rag. So like here they've used cola as the rag thing and they've just passed that in to search Wikipedia, right? Or they have agents where it's like they have evaluate math and search Wikipedia and then they pass them is the tools, right? You can actually build whatever program it is you want, but you just need to make sure your examples contain the user ID or something like that with and then it can do whatever memory lookup for that user and then it's included in the optimization. Exactly. Yeah. And so that that's nice because it will make it extra generalizable in terms of it will because the memories are different every time like it will I think really truly learn the task if that makes sense. Yeah. Yeah. But like every row is just a user interaction and what context interaction had at the time. Yeah. Yeah. And you just include it or do it however you want. Yeah. Yeah. Exactly. Yeah. So I have a question if you have the time. Yeah. So you say you don't use disbu. How do you go from the prompt that you have in production to disb? Yeah. Good question. Yeah. So typically what I'll do is I'll run the optimization. So there's a few things that I tend to use it for. one is to just script to try different things in different models. So I'll be like okay does this work right? Does like I don't know I was trying to make like a focus group where the agents talk to each other and it was much much easier to just give them a talk tool and then make a react agent in DSP and then let it run and just see if it worked. So then it just gives me a broad strokes of does this interactive format work or not and it's crappy. So I was like okay I'll abandon that. It's really good for that. I use it to test out theories of what might work and then when if it works in DSPI then I'll go into my code and I'll buy test it. The other way is I'll actually run the optimization. I'll create the eval metric but then but then I'll export the prompt like you saw here like I'll get the system message etc. And then I'll just translate that to my format. So I'll just say here's how I structure my prompt. It's not perfect because you might change actually the evaluation score when you change the format, but it's actually pretty easy to just give this to chat like paste it in and say, "Hey, I got this program from DSP. Can you just change the fus examples?" So, how do you go the other way? Like for example, to be more specific, I have my point of spiral. Let's say I have a very fuzzy task that I want to evaluate against. How would I take that and then put it into DSP to then evaluate against? Yeah. So the canonical way for the SPI would they would say don't worry about all the crazy stuff that you put in your prompt just describe the task as the instructions like just a simple task describe what the inputs and outputs are and then dump your your eval data set and then create the focus on the eval metric. I found pretty bad results doing that like it takes a bit of time to get back up to if you've got a really lovingly handcrafted prompt it actually takes quite a while. Like I actually have a joke prompt which is not safe for work because it's like very good doing Dave Chappelle and I even trained his voice um and I didn't really see the guy I didn't want to get sued but it is like you it literally passes the touring test and this was back with four like before four and it's unbelievable like so I still haven't created a DSP joke generator that beats that but that took me like days like a lot of I just love comedy so I put a lot of effort into it basically so I would say quite often I use DS pie for the things I don't care that much about which is 80 90% of my prompts for rally we have a really lovingly handcrafted prompt for quering the personas and for generating the personas, but we also have prompts for like generating the title and I'm like I don't care about that. Like I just use DSP for that or I have a prompt for classifying the diversity of thought in the and I don't even know what diversity of thought means really in a lot of cases. So like I just use DSP for that and and I put in into my data set cases where the users have compliant and sent them to me. So that's how I think about it is create your classifiers. It's amazing at that. Create your like utility tasks and disc pie. It's like amazing for that. And you don't need if you don't care that much about it. But just like AI writing like I still hand write stuff like my every article is I actually the only thing really that I like write myself anymore because going to 100,000 people it's actually like worthwhile to spend a day and it takes me a day. Who cares? Like I don't would my life really be better if I like managed to get that down to a couple of hours but like maybe the quality wouldn't be as good. Yeah. So like for things you really care about I think you can still beat the S pie but like I would say increasing amount of things I just don't care about. Yeah. So if I have a categorization thing would be would it be better than hands doing it categorization or you would say still think it's better it will be better if you hand I would say I I haven't met a classification task that DSPI didn't completely blow out of the water. Okay. So I should use it for core email classification. Yeah, I think so. Yeah, I think that would be like a really good place to start. Yeah. Yeah. Yeah. The only thing there is like how do I train it for all thousands of users because all users have different rules and different nuance like I the prompt is dynamic like my prompt changes per user and I'm just worried that will mess like it will generalize something that's really good for the people. Yeah. to some degree. I guess like I if there is a coherent meta task as in if most of your users have similar preferences then it will do a pretty good job of teasing out if there is some pattern maybe that's it will do a good job break it up into steps where it's like one is the like the generalized version and then it will do a secondary for things but that's more costly so think about it. Yeah. Yeah. Exactly. But what you could do if you have enough data per user is you could just have a DSPI program per user, right? Like I've seen that with fine tuning on like say writer did a really good job of this. They made their own AI models. They had the base model which was really good at writing but then they trained a custom model for each because they were enterprise clients, right? trained a custom model for each enterprise client on their brand values and their tone of voice that or you could have see AI image companies do this right like they have a custom Laura per brand yeah yeah I think that might be the key and I am going to try that yeah do you like how costly is it normally to run this do you know I would say you basically don't really have to worry too much if you're using GPC mini or claude haiku or quantum gemini flash you know, like that that's you're never going to cost yourself more than a couple bucks even with the heavy optimizers. Where it gets really dangerous and scary is when you're actually when you're using anything anthropic like Sonnet and and particularly like Opus, right? Like that that could be like a few hundred mistake or if you're using the big OpenI models as well. I actually did a big project for a client which was 5,000 AI personas and we were just using 40 but I was I did this big optimization and and my credit card got declined because we'd spent like $900 in an hour and I was like oh so I had to ring up my client. I was like do you have a credit card that I could use continue? Yeah, that's pretty funny. Yeah, this was like early early. So that was that was okay but our product went down because we couldn't afford any more tokens. Again, very fun. But yeah, my margins are not that big because my product is $15 per month and I already spent around $2 and a half dollars per month in inference just so like a dollar is is important. Yeah. Yeah. Also is important. So I would say you might want to do like a more of a meta task then. And if you can optimize the prompt once and it gets 50% better for all users is better. Yeah. Yeah. then then like you can those costs over like the whole use space then yeah I at least I need to experiment and see is it worth a dollar or $2 once to run or yeah to some degree as well there is a real you can actually really bring down your actual cost with the pie optimization is an investment but you I've seen it happen where like you can take like I I would take GT4.5 because it's good at creative writing and then I would use that as the teacher model or I'd use that to generate a synthetic data set. Yeah, you need to get almost as good like 80 90% good as 4.5 for that task. So like distillation I think is like the primary like classification and distillation are like the two major brain I'm already on the cheap cannot lose essentially. I already use the flash 2.0 for classification. Yeah, flash is great. Unbelievable. I hope Google keeps subsidizing it. Yeah. Cool. This is amazing. Thank you so much. Cool. Yeah. Yeah. Yeah. Yeah. It'll be good to get the recording afterwards as well. We could turn this into maybe I don't have to write my next uh every column now. That's all. Yeah, we should like Yeah, it's like I'm going to I'm definitely going to try this out next week or weeks because this is exactly the problem I'm trying to solve and I'm just lazy to work on the prompt. Yeah, it is like one of those things where actually similar to vibe coding like when I first started using copilot I was like checking every line and then now I'm like oh I'll just see if cloud code can do it and yeah we'll see I think I used to care a lot about my prompts and now I'm like ah we'll just see if DSpark can do it. Yeah. Yeah. Cool. Cool. All right. Thanks guys. Yeah. Thank you. All right. Take care. See you.

Stay Updated

Subscribe to our newsletter to get notified about new articles and updates.

Simara
Simara

I write deeply researched articles about simulations with virtual audiences

← Back to Articles