Transcript

Uh so we're going to talk about uh a specific pattern uh that you can use uh when you're building AI products called the evaluator optimizer pattern. So I'm going to explain how it works. I'm going to actually run through some code with you. Just show you what I did. Uh and I use this uh to get the AI to tell funny jokes so you can judge whether you think the jokes are funny or not. Um but typically Chad GBT jokes are not great. you know, they're like dad jokes, and I am a dad, so I appreciate them, but uh most people don't. And and I wanted to use that as an example of a fuzzy evaluation. Uh there's a lot of creative tasks that we're asking AI to do, and it can it's very very hard to formally define whether they're doing a good job or not. So, this is a really good pattern for that. So, who am I? I'm Mike Taylor. Uh so this is a AI generated headsh shot for me you know keeping um but uh the uh the book uh that I wrote last year was prompt engineering for generative AI with James Phoenix who's unfortunately sick today but um uh but we wrote that book together uh it still holds up uh thankfully despite everything happening in AI because actually prompt engineering uh people predicted would be dead by now because the AI models get smarter and smarter and it turns out that actually the smarter they get the more the prompt matters. uh and and you really need to uh make sure you understand what goes into the context and and now we've uh rebranded prompts engineering as context engineering uh because you know why why not uh uh we want to sound cool uh so you know it's it's largely the same sort of thing but with a few differences and I'm going to kind of walk through that uh as well as as we go through um I do have a copy of the book here and uh we're going to do questions at the end of the both talks uh so me and Leo who's talking next will stand up here and then whoever asks the best question will get a free copy. Okay. I use the LLM as a judge. All right. So, what is context engineering? Uh, some of you might have heard about it. Uh, some of you might not have heard about it at all. Uh, others of you are probably sick of it already. Uh, you know, it depends like it's all not evenly distributed across the audience. Um but the the quote that I really like I know Toby look from Shopify has talked about context engineering and Andre Kopathy has talked about it as well. Um but the the guy I think who coined it was Anko Goyel I think he's a brain trust data one of the evaluation um tools. He said as the models get more powerful I find myself focusing more effort on context engineering which is the task of bringing the right information in the right format to the LLM. Um so that's I think a really good kind of self-explanatory definition. Uh but in order to bring the right information to the LM at the right time, it's really really difficult and um the uh you know the first mistake most people make is they don't give it enough context. Uh they don't give it enough information to actually do the job. They they don't give it the sort of information that you would give to an intern on their first day. um you know and and I'd say like that's usually pretty easily solvable like once you understand that LM aren't magic they need some context they need to understand how your organization does that task or how your application should do that task um then uh usually pretty straightforward to identify okay well the intern would have access to Gmail you know the intern would have access to this document uh or maybe we need to write up a brief of how to do this task and and that's how uh I would say like the the beginning stages of prompt engineering context engineering very beginning is like when you don't have enough context, but you very quickly enter into the much much worse problem of too much information. Um, so there's all sorts of crazy things that can happen uh when you have too much uh in your context window. And uh on purpose I've made this slide stuffed with information so you can experience what it's like to have a full context window. Uh but uh these are all the different problems that I've seen with my clients and with my own applications that I've been building. uh context poisoning. That's when um it hallucinates uh gives you like a weird answer because there's some um you know incorrect information in the prompt itself and it's kind of going off of that. Uh and that can happen quite easily especially when you give uh tools access to MCPs where they can pull information that you might not necessarily need to do that task. Context distraction as the prompt gets really really long uh it starts to not pay attention to some parts and pay more attention to others. LLMs are really good at deciding what to pay attention to, but even the best LLMs will still make mistakes and um they tend to pay more attention uh to the beginning and the end of the prompt and the stuff in the middle gets lost. So that's the lost in the middle one. Uh there's also context clash. Sometimes you'll have conflicting information and this happens with interns too. You know, one boss says this, the other boss says this, you know, the official documents say something else and uh that happens with LLMs as well. uh a lot of the problems that we face with building effective AI agents we also face with uh just you know training people as well. My before I got into AI I I used to run a marketing agency and and trust me it's like exactly the same problems coming up again and again but this time with AI uh employees that never sleep uh so quite useful. Um then uh you also have problems with like prompt injection. Sometimes you want to control what type of information goes into the LLM cuz if it's accessible to the AI agent uh then anyone using that system could get that information by tricking it uh into sharing uh oversharing. Um so that's a whole big topic. Uh but to be honest like the main reason I don't like uh having too much information in the prompt is it costs too much money. uh and you know AI is a really efficient humans but uh but once you're you know once you're doing this and you're running thousands of AI agents or you know that AI agent is is uh you know doing thousands of tasks uh it the cost adds up uh the I have a small startup it's just two and a half people right now um and uh the highest paid person on my team is open AI right so uh so cost really really uh becomes important and then latency as well for some types of applications where it needs to be real time. The longer the prompt, the longer it takes to get a response because it has to process that whole thing. Cool. So, how do we solve this? Uh, DSpie will solve all of your problems. Now, I'm not affiliated with DSpie. Uh, actually, I really hated the framework uh when I first uh encountered it because uh they uh there was a lot of publicity when it came out. They said prompt engineering is dead. And as someone who was writing a book on prompt engineering, I was really hoping that it wasn't dead. I didn't take it uh too personally though. I did dive into the framework and once I got through the weird documentation and and the kind of overly academic uh tutorials, I started to realize how powerful it is and and um weirdly enough uh they designed like the perfect library for context engineering 3 years 2 years before context engineering became a term. Um so uh and and Toby look the co-founder of Shopify uh confirmed this as well recently. He said that DSpie is what he uses for context engineering uh personally. So um you know it's it's having a bit of a moment now because of that. Uh but I've been messing around with it for a while and and uh you know hopefully like I actually I've moved to doing all of my scripting in DSpay uh and even putting some projects into production. So I think it's a good time to learn. Um but you can do this in any framework. Lang chain you know I don't want to hate on the other frameworks as well. You can do it without a framework. Um but uh you know the code's just easier for you guys to see uh and and there's like a less boilerplate to write. Cool. So uh what is the pattern we're going to do is evaluate optimizer pattern. I use this for a lot of different things. Um because a lot of the tasks I'm doing, especially because I came from a marketing background, uh they're creative tasks. You know, I want you to write a blog post or a social media post. you know, I want you to um you know, uh read this report and tell me uh uh what decision to make. You know, these are creative tasks that are pretty fuzzy and you don't always have a right answer. Um so, one of the nice ways you can solve that problem is you can train a classifier first. Um so, uh we're going to use the example of telling jokes. We're going to have a AI tell jokes. Um, and uh, the first thing you need to do is to train an audience member AI uh, that can tell whether the joke is funny or not. And it's much easier to build a collection of funny jokes, build a collection of not funny jokes, you know, give funny jokes a one, give not funny jokes a zero, and then you have a data set pretty quickly where you can train the judge. Once you have the judge, then you can train the real test that you're trying to train. So, uh, the LM call evaluator, uh, is trained on the jokes that you think are funny and not funny. Um, but then you're using that as the evaluation metric when you're optimizing your original task, which is telling the joke in the first place. Uh, cool. Okay. So, we're going to dive into code here. And there's a lot of it, but um, we're going to go through I'm going to zoom in a little bit so you guys can see. Is that good? All right. So, don't worry if you don't know how to code. Um, I think most of the uh room is skewing technical, but um you know, just close your eyes if it's uh scaring you. Uh it's not not that scary. I promise. I'm going to explain uh each step uh what we're doing. Um so, I'm not going to run any of this, by the way, cuz DSP does take like 10 to 30 to maybe even an 30 minutes, like an hour sometimes. Uh my longest running optimization was uh 16 hours actually and it cost me $800. So uh yeah, that was paid for by a client. So uh don't worry, I didn't get my credit card out. Um but uh one of the nice things I like about DSPI is um it abstracts away a lot of the complexity of uh calling AI models and getting the responses back. Um so you know this is this is like how you set up the model. you know, the GPT5 mini here. And then I just set those parameters and then I I hit call and then it literally just sends that that prompt, right? Uh super simple. And if I wanted to use um you know, Google or whatever, I you know, I can put I could put Gemini in here, you know, um and then I could uh change that uh to whatever model I want to use. If I want to upgrade to a new model, it's literally one line of code, which is really really nice. It makes the programs more portable. uh and because there's always new models coming out and also quite often you want to like move the program down to a cheaper model once you've optimized it and and that's something that uh DSPI does quite well. Cool. So that's my little pitch. Um I I don't use the standard DSPI structure. Um if you guys want to see what that looks like, it's like this. Uh which is weird. And uh uh like it took me a long time to get over the the weirdness of it, but um like this is your prompt like the dock string. Uh and then these are your inputs uh for the prompt. So these are the different the pieces of context you're sending it. In this case, this is a classifier. It's going to classify the sentiment of a given sentence and you're going to pass in the sentence as an input field. the um sentiment um will be an output field and uh and it just has these three possible outputs um and then it gives like a confidence score, right? So that's like a DSPI program. I I don't use that uh cuz I really hate that. Um so uh so what I do is I found deep in their doc in their um codebase they have this like convenience method called make signature and it's the same stuff but um I I prefer to kind of it feels a bit more Pythonic or whatever. Um so I write my instructions here as a string and then I pass in the different fields. Um and uh yeah you can see here the input field uh and output field. Uh so what this will do uh is create a prompt for me. Uh I don't have to worry about the instructions other than what I told it. I I don't have to write a big long list of instructions. I'm just giving it a small amount of instructions and it's going to set up things like here are the structured outputs that we need. Um you know it's going to handle output passing for me. So it's going to pull those uh outputs back and put it in the structure that I specified. Um so my output fields are actually there in the object. And I'll show you that. Um but um you can see here that when I output the joke, I don't have to go I don't know if you guys have done this. Look at the OpenI documentation and go hold on a second like what how do I get the output again? It's like somewhere nested inside of like uh an array and then it's like you get the first thing in the array and then it's like I think it's called output text now in the response. Yeah, you don't have to worry about any of that like uh because I told it ahead of time that uh I have two two output fields uh sorry just one output field here which is called joke. Uh then when I when I get the response back I just go output do joke which is really nice right cool. So um I told you you need a data set. Um so DSPI um you you know you can actually uh use the DSpay as a framework. You don't necessarily have to do optimization um because it just formats that prompt quite nicely for you. Now you'll see an example of a DSPI prompt later on. Um but uh the um the really nice thing is that um you don't uh like you could just use it without without uh a data set without an optimizer. Um but um you get a lot of the power from being able to optimize. Like the reason I pushed through uh all of the weird documentation and the structure and um you know reading all this stuff and trying to understand the codebase was because the optimizers can give you a reliable like 10 to 100% uplift in performance. Um so you're going to see that here today. Um, so here I just got a bunch of, you know, from different uh locations. I got a bunch of funny jokes from different comedians and that's just in an array. And then I have a bunch of unfunny jokes and these are just generated by chat GBT, right? Um, and all I've done is I've set them up uh in this kind of DSP example thing here and then split them into three buckets. So I have the training data, which is what I'm going to use to optimize on. I have the evaluation data which is um how it's going to test whether it did a good job in training. Um and then I have the development set which is um at the very end um once I've built the model I want to check that it does generalize to things it hasn't seen before. Um so that's uh that's why we have that held back and um then uh you know I can you can see here that like um you know we have 152 jokes here and then 51 to test against um and this is the format. Um, so, uh, I said before, you know, how how do you evaluate that? Like, how do you how do you check like, you know, when you're doing DSPI optimization, it might call like a API a thousand times? Like, are you really going to sit there manually and just say, "Yeah, that's a funny one. That's not a funny one. That's a funny one. That's kind of funny." You know, that's really painful. So, what we're going to do instead is we're going to train a judge to do that for us. Um, so again, same structure, uh, except we're going to input the topic of the joke and the joke itself. And then we're going to we're going to check whether it's funny or not. And I'm just using make signature here. And I'm calling this one audience. And you can see here, our judge isn't very good yet, right? So, um, why did the MacBook look sad? Because it had too many problems. Um, yeah, I mean that is kind of funny. I did get a laugh, but um, but like the the judge said, this is a classic family-friendly pun that plays on the double meaning of problems, right? and it thought it was funny and I said, "No, no, that's not funny. That was in the unfunny group, right?" Uh, so I wanted it to be a bit edgier than that. Um, so so that's what we've done. So, uh, you set up an u a metric. In this case, the metric is very easy. We have a data set of funny and not funny jokes. We know whether it was a one or a zero. So, we literally just need to check if, you know, if if uh if the if the um uh the program that we made in DSPI uh outputs the same as uh what we had in our data set. So if it was a one, is it a one? And you can see the result of that here. Uh it got 47% accuracy. So you know it's like a coin flip. Uh it's not very good as a judge. So we need to train our judge. Um so you can see here like the ones that got wrong, you know, the these ones that got it correct and then these ones that didn't get it correct like uh the you know it said false here and it predicted true. Cool. So the optimization everything we've done so far is literally just to get to this point to optimize it. Um so uh there's a few different types of optimization you can do. Broadly they come into two buckets. One is that you add examples of the task being done correctly uh to the prompt. Few shot examples it's called in the literature. And um you know I haven't run a big optimizer here. I've literally just done this one which is labeled fshot where it literally just takes eight uh examples from your data set and then just adds them into your prompt. Um and it formats them more in the right way as assistant and user assistant as user and user right. Um so even just doing that like literally that you know took a second and uh we got to 82% accuracy straight away right so you can see the power yeah go ahead eight examples in the yeah this is eight examples so K equals 8. So this literally just taken eight examples added them in. So you can see how quick that is to literally just in DSP went from 50% to 80% quite quickly. You know, normally you'd have to write some custom code to add all those examples in and and choose the examples. The um but the the one that we're all here for, like the one that's really cool and and sexy is uh the um actually rewriting the instructions in the prompt, right? This is what prompt engineers do quite often, right? Like you'll rewrite the instructions, you'll see where it's going wrong, you'll look at the data and go, it's getting this wrong, it's getting that wrong. Um and then you'll write something manually like in the prompt. Uh well, you don't have to do that anymore, which is nice. uh AI can do that pretty well. Uh and it can look at, you know, thousands of examples in the time that it takes you to do one example. Um so, uh you know, it can iterate towards better instructions by trying out lots of different strategies. And so this is the Jeeper optimizer. It's the newest one. It's very cool. It's very good. I use it for everything now. Um and uh I've only run this with one kind of optimization loop. Uh you could run it for 10, right? But it would take like an hour to do that and cost um maybe like $10, $20, something like that. Um uh so it is you know it's not it's not cheap. Um I'm using um GPT5 here. I'm using a bigger model to train a smaller model. So I'm using GPT Mini GT5 Mini to tell the jokes. Uh but I'm using GT5 to teach GT Mini how to tell jokes, right? So uh once my optimization is done, I've invested that money and time. Now I can run my whole program on GT Mini and it's like pretty good. Uh so uh let's see um how that works. You can see some of the prompts that it's writing here and like look at this is like human level. So it's saying like um it's giving some decision guidelines here and this isn't me writing this is just uh you know what DSP has figured out. Um it's given some concrete examples here. Um it's given some instructions and rules. It said things like keep the reasoning tightly tied to joke structure. Um it's it's like basically written a guide on how to tell funny jokes, right? What is a funny joke? What isn't? Like I don't have a strong opinion on that. my all I had to do is just choose which jokes I thought were funny and choose which jokes I thought were not funny and then it has learned what my preferences are without me having to state my preferences which makes building these applications a lot quicker. Cool. So um you know you can see it's it's done a lot here and then when it comes out the other end uh now it's it's uh it's got 98% accuracy. It agrees with me with my judgment 98% of the time. Right. Really really good. We went from 50 to 80 and then to 98, right? And that's really powerful. And this this took about half an hour. Actually, about 20 minutes. Yeah. How many examples did you use for training? Yeah, this one. Um Yeah. So, it's Yeah, it's 51 to test, but uh it was just over a hundred for for training. Um Yeah. All the training. Yeah. And and it get it chooses which ones to use in the training. So, it hops around. It has like this crazy like, you know, evolutionary algorithm behind it. Um written by people much smarter than me. uh but I get to benefit. Now we have a judge and we can we have to make our new evaluation metric but we can use the judge to do the evaluation. So we want to this time evaluate whether the joke is funny and uh therefore we're using our optimized audience like this is the the program that we trained before that has 98% agreement with me. Now I trust it to go and rate a thousand examples on my behalf because it agrees with me 98% of the time. Perfect. Right. Um, so again that's uh, you know, going to get the response and then we're going to return whether it was funny or not. We're also going to return feedback because the way the Jeeper algorithm works is it takes into account the structure of the program. It actually gets to look at the data set and see what it got wrong. Uh, but you can also give it feedback in the evaluation metric and it will take that feedback into account as well. Uh, so in this case um, I'm using the reasoning of the judge like the judge thinks first before making the decision. I'm using that thinking process uh and then giving it like a look into the mind of the judge, you know, and so that makes it much quicker to be able to optimize and and hill climb to a, you know, to a good outcome. Cool. So, uh you can see here that um we have the uh the metric. If we look um it's got 100% accuracy uh and then uh if we add Oh, sorry. Yeah. Uh no, sorry, not 100% accuracy. So it's got 91% accuracy. Um what we've done is we've added um just like we did with the other one, we added some few short examples to the comedian. Um so now the comedian has some examples of what is a good joke, what is a bad joke. Uh and then we've run the same optimization process again uh with Jeeper again, but this time on the comedian uh using the judge as the metric. And you can see we go through here and uh now it tells jokes that are you know it is like say 95% happy with um and you can look at some of these jokes. Um we're still seeing some errors right like so we're still seeing one one persistent thing we're seeing here is that it's saying I can't write in this comedian's voice um but here's the joke. Um so that's you know like that's something we haven't specifically put into the evaluator. Um but you know that is is like the next thing we would do. We would say okay you know if it says this then give the feedback. Exactly. Yeah. So um so uh you know optimization again it's not like a magic process. Uh but I think it's uh pretty good. So um let's just see uh so this is telling jerk on AI engineering in the style of Ricky Jace. Um and you can see it has this kind of refusal at the beginning which you know we want to deal with but um AI engineering is teaching machines to be human and it's gone too well. They now procrastinate, make excuses and ask for a raise, right? That's much funnier, right? I think um so cool. So yeah, hopefully that kind of helps you understand the process. Like it's an iterative process. I only optimize this for one turn. Normally I do say like 10 turns. Um but yeah, I just wanted to kind of help you understand uh how this structure works. Um, just one more thing. I'm going to just show you what the DSPI prompt looks like and how to export it and use it. Um, and then we'll we'll get Leo on and we'll do questions at the end if that makes sense. Cool. So, um, the DSPI prompts uh, they kind of hide them like they want you not to use the prompts uh, afterwards. They want you to build the program in DSpay, which is something that like also put me off. like I I've been burned a lot of times with like lang chain and a few other frameworks and like I built the first product in llama index and then llama index wasn't cool anymore or whatever you know so anyway I keep going back and forth with frameworks but one thing that's um uh you know that that like I would say like is a downside of DS PI is they they kind of do make it a little bit harder to um port your programs over um but I've now because the optimizers are so powerful I've now kind of moved over to like just doing everything in DSPI so that's fair enough But uh I do have a hack for getting the prompt out. Um so you can you can uh investigate here. You can see what the prompt is. Uh you can see it wrote the input fields um etc. Um but then uh I have this like weird I actually got this from the founder of DSpi cuz I was bugging him a lot. Uh he relented. Uh but uh basically this just outputs it um in um in OpenAI format, right? So you could take this this is the RO system with the content um you know and then these are the few shot examples it's added uh and then um yeah and then it has like the the ending at the end but just to kind of see the system prompt just so you kind of get sense here uh this is what it's written right so um it's already kind of you know it's it's turned your signature into input fields so all of that formatting stuff it's all consistent it's told it's told that to like expect a topic a comedian a joke. Um, and then, um, it's, uh, you know, it's using this weird structure that they use, but you can actually change that. You can change the chat adapters they use. Um, but this does work pretty well out of the box. But the real power is this, like you have a whole prompt here now, which is like you're writing a single funny joke given the topic, etc. The goal is deliver one concise original joke clearly tied to the topic. If the comedian specified, evoke only highle characteristics of their style. Do not imitate their exact voice. Keep it tight. you know, one to three sentences with a clear setup and punch line. Make it feel fresh. Avoid stale, overused dad jokes. Like, it's learned all of this stuff without me telling it. I didn't tell it. I don't want dad jokes, right? I just gave it a bunch of dad jokes and marked them as zero. And that's really, really powerful. Now, if you think about taking that outside of joke telling, which is just a fun example, um, but like into other creative tasks, uh, you now have, uh, the power to, uh, build a whole AI application where it can like learn your preferences. Uh, and all you have to do is just say, I like this one. I don't like this one. I like this one. I don't like this one. And then you have a data set. And yeah, there's a lot of complicated setup, but that's hopefully the the takeaway is like it's worth going through this to use a framework like DSP, which I think is really slept on because of all of the weird documentation stuff. Um, because now you can build a whole AI application where it writes itself and and uh you don't have to worry about um writing those rules rules manually. You don't have to worry about what's conflicting or not conflicting because um you know DSPI will figure that out for you. It will take it out if it's not improving the evaluation score. It will add something in if it is improving the evaluation score. So um you know you can keep adding to your data set, keep retraining, you know, and then you have like this kind of endless loop of improvement. So yeah, hopefully that all makes sense. Uh I'm going to be around afterwards as well for a bunch of questions. We'll do questions after Leo's talk though. Uh, and remember there's like a free book uh if you ask a good question. So, keep thinking. All right. Thanks, guys. [Applause]

Ask Rally

The Evaluator–Optimizer Pattern in DSPy with GEPA

Transcript

Stay Updated