How to improve Natural Language Datasets by David Curran - Blog @ Telecom Application Developer Summit (TADS)

How to improve Natural Language Datasets.
David Curran, Machine Learning Engineer at OpenJaw Technologies.

You can follow along this clean-up of the natural language data set here: https://github.com/cavedave/datacleanup.

In many chatbot projects the basics of good input data are skipped through, rather focusing on the ‘intellectual’ pursuit of funky machine learning models and algorithms. It’s the old adage of garbage in : garbage out. This advice is important from David, and he breaks it down into 15 steps:

Load Libraries
Label Data
Find Topics, Find Verbs
Load Data
Language specific stuff (spelling, accents, segment words, etc.)
Run KFold test
Save Data to show improvements
Graph questions, Intents and Accuracy
Find Duplicates
Find Wrongly labeled
Find Double Categories
Find Nonsense
Find Bad intents
Look at confusion Matrix
Label new data

David will go through a Jupyter notebook showing you how to fix errors in your training data for a bunny (baby rabbit) chat bot.

This is a great walk through on machine learning. David shows the Kfold test, that is training on 90% of the data, an testing on the other 10% to determine its accuracy. This can be done across all 10% testing segments. From this comes the Kfold test result, which in his example is 68%, normally the target for launch is 80%.

So he then goes through the data to find errors such as duplicates, high confidence and wrong, second confidence answer is also high, and there are no high confidence answers. And explains the typical causes of such errors. Discovering a mislabel can improve the bots performance as much as the latests and greatest algorithm. Or classifying as multiple questions, when its one question with 2 related parts. Through cleaning up the data he shows an improvement to 73%, from the walk through.

This a very approachable review not only of improving Natural Language Datasets, but the basics of building a bot.

Learn, Share, Code, Create!