I had the opportunity to participate a few days ago in a panel at the UTTR conference. The topic of the panel was “scoring metrics for chatbots”. This is an interesting topic, first and foremost because there is not straightforward answer.

In other text-related fields, like information retrieval, performance can be measured through metrics such as precision and recall. However, things are not as easy with chatbots. The reason  is that chatbots can have various uses, so the goal can be different from business to business. Some examples chatbot use cases are:

  1. Replacement for customer support.
  2. Personal assistants.
  3. Intelligent interfaces for more standard functionalities (e.g. learning the news, or the weather).
  4. Replacement for professional services such as doctors, or financial advice.
  5. As tools to improve sales, e.g. by creating a chatbot salesman or a chatbot that makes smart recommendations.

If your goal is to replace a professional (e.g. a doctor), then you should be using metrics that measure how close the chatbot simulates a human’s skills. In the doctor example, accuracy of prediction might be one such metric. Another good metric in this case, would be the length of conversation. The shorter the conversation, the more efficient the chatbot.

However, in other cases, the length of conversation might not be directly related in a straightforward manner to performance. Let’s say that you built a chatbot that makes recommendations for clothes. A long conversation time might mean that the user is either engaged with the chatbot, and wants to chat, or that the user is confused. Similarly, a short conversation time might mean that the user lost interest, or it might mean that the chatbot is really very efficient in making good recommendations.

Here are some metrics which you might want to consider:

  1. Length of conversation: Do you want a chatbot that engages with user? Then the longer the better. Do you want a chatbot that simply delivers a service? Then the shorter the better.
  2. Confusion triggers: These are expressions the user might say to show that he/she is confused. E.g. saying “I don’t understand.”  or “I want to restart the dialogue”
  3. Sentiment analysis of the user’s dialogue: Angry words can clearly indicate something is going wrong.
  4. Business-related metrics: E.g. if you are using your chatbot in order to make product recommendations to your customers, then the sales that took place because of the chatbot is such as metric.

These were just some of the few metrics which you might want to take into account when using a chatbot. I expect that over time, metrics will get more standardised for the different scenarios. So, one day we might get the equivalent of precision/recall in information retrieval.