We’re living through a paradigm shift in computing.
Traditional software development is quickly losing territory to a new form of development – machine learning. A type of AI, machine learning is a programming technique that allows algorithms to become more accurate at predicting outcomes without being explicitly programmed. Instead, data is used to present examples to the learning algorithm — much like a child is taught by being told something over and over again. Teaching machines also relies on repetitive training data – lots and lots of it.
How much data do we need?
To give you an idea of the data needs, let’s look at a key voice technologies. Speech To Text (STT) can be tackled with a machine learning approach. Estimates and experiments suggest that approximately 10,000 hours of audio is required to get a decent STT engine. That is the equivalent of nearly 5 years of someone talking 40 hours a week.
Many of the machine learning frameworks are starting to be released as open source — TensorFlow is an excellent example coming out of Google research. However the data underpinning them is not being released. Without data a learning framework cannot learn anything. This means that those without access to LOTS of data are getting left behind.
How’s everyone else getting their hands on adequate data?
There are currently a handful of organizations which have millions of customers who have no choice but to agree to allow their data be used if they want to use that private company’s service. This is what’s in those “Terms of Service” notices everyone routinely accepts. Typically, the only option for a user who wishes for privacy is to simply stop using these services. This is true of Amazon, Google and Apple to name a few.
For anyone who does not have these millions of customers at their disposal, there is usually little hope that they can collect the volume of data they need to create new technology. This includes academic researchers, smaller business and startups, individual working on their pet project and, ironically, open source organizations who are respectful of their user’s privacy.
Are privacy and training data mutually exclusive?
This has been the Catch-22 in the open source world as we enter the machine learning era. Privacy is respected, yet the basic thing needed to allow individuals to have reliable, auditable, trustworthy technologies is data. There has been no ethical way to capture the kind and volume of data needed to allow this to be created.
A Cure for the Catch 22
Here at Mycroft, we’re on the path to changing that – at least in the realm of human-machine interfaces and speech to text. In 0.8.22 we added the ability for users to select LEARN on Mark 1 devices to choose to contribute recordings of their device activations to reduce false-positives (inadvertent activations) and false negatives (missed activations). Now all users can choose to Opt-In to be part of the Mycroft Open Dataset and share some of their data to help us improve this technology.
All who Opt-In under their basic settings at https://home.mycroft.ai/#/setting/basic can later choose to Opt-Out, stopping not only future data contributions but also removing any previously contributed data from future training sessions. We truly appreciate the help of all in building an AI for Everyone, and aim to safeguard what has been entrusted to us.
Source: Mycroft AI blog