How to improve Natural Language Datasets by David Curran

How to improve Natural Language Datasets. 
David Curran, Machine Learning Engineer at OpenJaw Technologies.

You can follow along this clean-up of the natural language data set here: https://github.com/cavedave/datacleanup.

In many chatbot projects the basics of good input data are skipped through, rather focusing on the ‘intellectual’ pursuit of funky machine learning models and algorithms. It’s the old adage of garbage in : garbage out. This advice is important from David, and he breaks it down into 15 steps:

  • Load Libraries
  • Label Data
  • Find Topics, Find Verbs
  • Load Data
  • Language specific stuff (spelling, accents, segment words, etc.)
  • Run KFold test
  • Save Data to show improvements
  • Graph questions, Intents and Accuracy
  • Find Duplicates
  • Find Wrongly labeled
  • Find Double Categories
  • Find Nonsense
  • Find Bad intents
  • Look at confusion Matrix
  • Label new data

David will go through a Jupyter notebook showing you how to fix errors in your training data for a bunny (baby rabbit) chat bot.

This is a great walk through on machine learning. David shows the Kfold test, that is training on 90% of the data, an testing on the other 10% to determine its accuracy. This can be done across all 10% testing segments. From this comes the Kfold test result, which in his example is 68%, normally the target for launch is 80%.

So he then goes through the data to find errors such as duplicates, high confidence and wrong, second confidence answer is also high, and there are no high confidence answers. And explains the typical causes of such errors. Discovering a mislabel can improve the bots performance as much as the latests and greatest algorithm. Or classifying as multiple questions, when its one question with 2 related parts. Through cleaning up the data he shows an improvement to 73%, from the walk through.

This a very approachable review not only of improving Natural Language Datasets, but the basics of building a bot.