A few weeks ago I made an automated solution that processes emails, pulls out any actions it thinks are for me, and adds them to my task management system. It worked pretty well when I tested it on a few dozen emails, and I was pleased with the results. I then sent it a blank email and it assigned me two tasks: “Hire a web developer” and “Agree scope for new trouser menu”. I no longer use this automation.
To be fair, In this case it was probably that the temperature of the model was too high. Perhaps, with a bit of tweaking, I could have got better results. But what if I didn’t? What if I ended up wasting a day scoping a new trouser menu? Would this mean that AI wasn’t a useful technology after all?
Obviously not. The problem here is that I was testing the technology, not the concept. The temptation to do this is high, mostly because it’s fun, but it leads to inconclusive and non-actionable results. I’ll set out below the steps to creating a meaningful test of this new technology, not just one that proves “it sort of works like everyone says it does”. I’ll draw on our experience carrying out these PoC’s for customers both here in Europe, and Africa.
Let’s imagine a company that sells many different types of widgets to large companies, with multiple premises.
A solution idea has inputs and outputs. It has an overall purpose. It achieves something. Making these choices gives you something to test. This testing will be useful even if you never take the solution a step further. The process of whiteboarding a solution will flush out dependencies and risks. This will help us decide what needs to be proved.
In our example we might decide that a solution that supports the logistics team would be a huge win. The new system will look at orders placed, the location of the required widgets, and produce all the necessary import and export documentation and shipping manifests and submit them to the correct authorities. It will also produce a summary of all this activity and send it to the client and a summary of shipping costs to the finance department.
Let’s say the big benefits in our scenario would be quickly creating documentation and manifests that are accurate, and let’s say the biggest risk is not balancing customer waiting time against cost of shipping. We now have something worth proving. All the notifying of customer and finance department stuff can be ignored - we know how to do that already.
It’s worth noting that the PoC we’re imagining here may require us to build some non-AI technology. Maybe we don’t need clever technology to work out the total shipping price, the customer waiting time and check those against some set parameters. This simple tech could act as a safety valve, and we can prove it will prevent / flag up overspend or poor customer experience.
We now have to think about data. One thing we should be proving is not just that we can turn data into learning, but that there’s enough data for this to be meaningful in production. We also have to think about data quality (effects the cost of preparing it for learning) and data variance (sometimes we think we have a lot of data, when in fact we just have the same pieces of data over and over again!).
At this point we should choose our KPIs. What metrics are we going to use to determine the AI’s performance and what metrics are we going to use to measure the overall PoC?
We should test performance as we go, so we can see how well the data we have is affecting outcomes. Document everything we try, this information could be useful for other projects and helps us build domain specific understanding. Sometimes in this part of the process we may decide that a different KPI is a better measure of performance. That’s OK - you can add that new KPI in. The only rule is you can’t stop tracking a KPI from the previous step, just because you’ve identified a better one later on.
It’s really important we try and break it at this point. In our scenario we should try and get the system to produce loss making suggestions that get through the safety system. Let’s build scenarios that are way outside the training data - how well can those be dealt with? If we’re using a generative AI how can we get it to produce some ‘Scope trouser menu’ gibberish, that still gets past the safety valve? This bit is the most fun!
This one’s pretty obvious… But it’s worth mentioning that we should consider the implications of the new solution, not just its performance. Removing the whole contracts and logistics department may seem appealing, but the corresponding loss of knowledge and experience may be disastrous. We may envisage having a smaller team, to keep that specialist knowledge, but how often will it be needed? And what will they do between instances of using that knowledge? Will anyone want to work in that kind of environment? We didn’t answer those kinds of questions before, maybe we should have.
I hope this helps!