The Anatomy of a Small Scale Question Classification Engine by David Curran

The Anatomy of a Small Scale Question Classification Engine by David Curran, Machine Learning Engineer, Openjaw Technologies

Another great presentation on chatbots with a focus on question classification and practical issues of deploying chatbots in China.

Great review of the approach to classifying questions for a chatbot to determine the intents of customers. Think of it like a spam filter, that examines incoming emails and determines if it is either spam or not spam. Rather across a number of possible intents / ground truths.

This is an example of supervised learning, where a data set is gathered of possible questions from customer agents, which are classified by humans to define Ground Truths (intents).  Such as “I need to change my flight”, or “My luggage is lost”, or “I need to book a flight”. Check out the “How to improve Natural Language Datasets” to understand more on the Kfold test and improving the quality of the training dataset.

David highlights some important points of running chatbots in China in the difficulty of using IBM or Google’s machine learning platforms; and also the relatively high cost of AI engines in China given the restricted competition. Which results in many businesses building their own AI Engine. He also covers the unique aspects of the written Chinese language compared to Roman Scripts, for example the lack of spaces between words.

David runs through the steps in creating the classifier:

  • Read in data. Utterance, label;
  • Separate out words;
  • Turn into machine compatible format, e.g. word vector etc;
  • Carry out manipulations. Tf-idf, stopwords, bigrams, stemming etc
  • Test classifier. Blind set, k-fold (validation set) – which we covered in this presentation (“How to improve Natural Language Datasets”)

Tf-idf is frequency–inverse document frequency, a numerical statistic that is intended to reflect how important a word is to a document. Like the word iPhone being used 5 times in a passage means it’s likely about iPhone.

David shows how using support-vector machines, supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. And a RASA pipeline can create a Small Scale Question Classification Engine. Without giving all your data away to Google. Though in the West the cost is so low with IBM and Google, and their engines so well-trained, its hard to justify this approach outside China.