By 2021, over 80% of emerging technologies will be AI-based.
However, even though this technology underpins almost every new tech product that hits the market, there is surprisingly little conversation about what shapes our artificially intelligent systems: data quality.
AI, or machine learning (ML), training data is usually compared to textbooks; These educate the artificially intelligent systems, giving them context as well as the prism through which to understand concepts.
This means that AI-powered tech is only as sophisticated and accurate as is the data it learns from.
We sat down with a subject expert and CEO of data service agency TechSpeed, Vidya Plainfield, to discuss the importance of AI training data, the consequences of insufficient or poorly selected data sets and some of the trends we can expect to see in the field.
1. Hi Vidya, before we get into the technicalities, tell us a bit about TechSpeed and your background in AI/ML and the business of data?
Vidya: TechSpeed was founded in Portland, Oregon, in 2002 by a data geek (my mother) and an inventor (my father).
While they are both retired now, their spirit of invention, entrepreneurship and family is still very much alive in our growing team of over +100 technicians, developers and managers.
Over our 18-year history we have had the chance to evolve and shape the data industry with our client partners as we mine, sort and harvest insights from data.
What most people don't realize is that there is a huge data engine behind the shiny frontend of AI and those terabytes of data are powered by carefully constructed information.
If you are not careful with your backend data, you can accidently teach an AI tool something you did not intend to.
TechSpeed foundationally understands data and that has been the bedrock of how we have partnered with clients to help train and audit their AI.
2. Let’s define data quality in the context of AI/ML: How does TechSpeed qualify data?
Vidya: Of course quality is king; Garbage in is garbage out.
It is certainly tedious to clean raw data, to recode missing variables and transform qualitative into quantitative variables.
There is a saying: “Data scientists spend 80% of their time cleaning data and 20% building a model.”
The biggest pitfall we see is that firms underestimate and underfund clean quality data.
This underestimation means that when it comes to building out their program, they are faced with having to choose between having a large enough data set or having a quality data set.
The key is you need both quality AND quantity.
TechSpeeds works with clients to help affordably scale their data sets so they don't have to make the trade off. We offer a wide range of services including single, multi and DEQA processing to ensure that data is qualified in a way to meet the needs of the program.
3. How would you evaluate the industry’s approach to data quality? Looking at your peers and clients, what are some of the most common mistakes or misconceptions regarding AI/ML training that you have come across?
Vidya: There are a lot of firms out there offering a wide range of promises to well-intentioned companies.
Some providers get things started but expect companies to handle the heavy lifting when it comes to training and ongoing exception management.
The biggest mistakes we see companies making when managing their data plan are:
1. Insufficient Volume
Large data sets across all categories is required to ensure that an even weighting of data is available for both majority and minority parameters. Without that, the algorithms will overweight majority data when trying to respond to a minority situation.
For example, suppose you are looking to categorize images of trees. Let’s say you have lots of good data on all different species of trees and all kinds of lighting and stage of life. However you don't have a lot of volume of what trees look like after a hurricane.
Of course, these will be the minority instances, but if you have robust data counts for only the majority data, when the tool looks at an image of a tree after a hurricane, it will rely and overly weight data from the majority healthy tree data set. This can lead to errors.
2. Insufficient Variety
A lack of robust data across a wide range of categories is required to ensure that the tool is able to handle ongoing changes in the data set environment.
For example, let's say you were building a visual analysis tool that looked at images of storage containers. Then, all of a sudden, an upgrade to the camera system was made. Invariably the tool output will be impacted.
The world is a dynamic place. Current and future attributes to customers, environments, attitudes etc need to be considered to ensure that tools can accommodate those changes.
3. Underestimating The Difficulty Of Sourcing Data
Oftentimes firms have a lot of the majority data that they want to classify and a challenge can come when they need to mine for minority data.
For example, let's say you were building a visual analysis tool that looks at smartphone images. You may have a million images sourced from social media, across a wide variety of categories, but what you don't have is all the images that people don't upload.
What I mean is people generally post images to social media that they like, with relatively good quality and clarity.
However, if your tool looks to review cell phone images, there are lots of images that are blurry, overexposed, tilted etc. These images are hard to source because where do you find minority test images that people don't post?
Firms often underestimate the number of gaps in their data that will require resources to fill. In that way, a good machine learning partner will not only help you organize the data you have, but also help you source the data you don't.
4. Finally, The “Ron Popeil” Fallacy
In other words: The “set it and forget it” fallacy.
Firms often forget that the human eye is still needed for ongoing management and maintenance.
Be it low confidence results, exception handling, auditing or optimizing with reinforcement data, these ongoing work flows are key to keeping the tool fresh and enabling ongoing success.
4. What are the consequences of poorly handled AI training?
Vidya: I don't have enough fingers and toes to count the times a client has come to us because they underestimated the planning, cost and scope required to develop their machine learning tool.
The worst part is that because the foundation of any program is data, clients can lose valuable time and money as they have to tear down their original data sets and start over.
If you ask a panel of CEO’s they will all tell you that they think leveraging AI is key to competitiveness in the future.
That being said, a very small percentage of firms actually budget for AI or include it as part of the strategic planning process.
So for those firms that have put money aside, they usually only have one shot to make it work.
Poorly handled AI training can sometimes mean that a firm does not have the ability to reinvest after a failed attempt. This can mean they are forever playing catch up with their competition.
5. In your view, what are some of the most important examples of how AI training data impacts us on a societal level?
Vidya: We are at a time in our history where there is an emerging awareness of the bias that has been programmed into our society.
Race, gender, age and so many more false data points have been used for far too long to drive decisions, and I would argue, sub-optimized choices that have prevented us from collective achievement.
Take for example, a financial firm that wants to use a machine learning tool to help narrow down the field of applicants.
Let’s say the firm used 20 years of its historical employee data to identify those employees who were promoted the most, who had the highest performance evaluations and then looked at where they went to school, what experiences they had prior to joining the firm etc.
At first blush this may make a lot of sense, “let’s see who has been successful in our firm and hire more people like that”.
What your HR tool is blind to, is the institutional bias that may have impacted historically hiring and promotion decisions.
- Men are more likely to be promoted than women.
- Caucasians are more likely to be interviewed and ultimately hired compared to people of color.
- And historically, low-income minorities are underrepresented in higher education and are disadvantaged on several attributes when it comes to college admission at tier 1 schools.
In this example, the data set was incomplete, and outside performance data must be included along with other selection variables like potential.
The magic of intentionally designed AI that is created from a purposefully diverse team can help us cut through the bias and blindspots.
It is a powerful and liberating thing to realize that we can make machines smarter than us if we choose to.
6. Does and how does the fact that you are a female-led business differentiate you from your competitors?
Vidya: TechSpeed has always been a minority women-led organization.
Women make up only 5% of all CEO’s and executive level minority women in technology are virtually non-existent.
Being a minority-women-owned business differentiates us for exactly that reason. In an industry that is heavily male-dominated, we are proud to exemplify how female leadership can bring different perspectives and solutions to the table.
We are in the business of data; We are teaching machines to see the world as it is with all of the colors and shapes it has to offer.
Our organization reflects the diversity of perspectives that we seek to be reflected in our work.
I am a mother of three racially diverse girls in a blended household.
Diversity and female empowerment is not something we talk about, it is who we are and how we live.
7. Now, back to training data and looking at the positive side, how does quality training data benefit the AI product, i.e. businesses that own it?
Vidya: Foundationally, well thought out training data means fewer exceptions and errors.
The primary reason to invest in machine learning and AI tools is to be able to solve problems faster and more dependably.
There is a misnomer by folks new to the industry that AI is self-propelling and can be fully autonomous. However, the truth is that for most firms out there 10-20% error and exceptions will still exist.
This bucket of low confidence or exception records are not a curse, they are an opportunity. Exceptions can be processed and analyzed “manually” and then can be converted into new or better rules or logic.
8. What process would you recommend for continuous data quality assurance? When, if ever, would you recommend machine learning be shifted to fully autonomous functioning? Does training ever end for an AI?
Vidya: Certainly the heavy lifting that is needed during the initial set-up of an AI or machine learning program is very different from what is needed for ongoing maintenance.
What we see is that the most effective ongoing programs include some sort of ongoing auditing and exception processing.
Continual review of processing exceptions and ongoing auditing will identify opportunities and weaknesses in the program.
Without exception, every project and every data set reveals nuances that were not originally planned for and sometimes those nuances need time to emerge.
In this way, planning is everything and yet the plan is nothing. Building in auditing allows the plan to remain flexible and the tool nimble.
While there are of course exceptions for very simple tools, for the most part when it comes to AI the work is never really over, it simply evolves.
9. Finally, what do you predict are upcoming trends in AI training data optimization? What should businesses that rely on AI look out for?
Vidya: There is a surge of AI/machine learning off the shelf tools out there and more launching everyday.
Access to serve-yourself tools is allowing all sorts of businesses to experiment and start to leverage their data.
This, of course, is great for the industry and businesses. However, as we discussed before, without quality data and ongoing support, it can be problematic for do-it-yourselfers.
Firms want to run their own program, but they rarely have the horse power to get organized and get learning data sets processed.
This can sometimes result in small or otherwise insufficient data sets and ultimately bad models.
That is where a good data support partner can provide both perspective and scalable support to help lead from behind.
There is an old saying among researchers: The more questions you ask, the more questions you realize you need answers too.
As companies seek to build increasingly complex machine learning programs, they will continue to find that the datasets that they had on-hand which they used to get started, are simply not enough anymore.
The need for data mining to help fill in AI logic will continue to expand. The more mature the industry, the greater the awareness of the data that we don’t have.
While not unique to AI or machine learning I think we are at a time in history that people are re-evaluating how they think about their business, their customers and their community.
The assumptions and expectations that were the backbone of existing products, programs and strategies are all being re-evaluated.
Now is the time for firms to look at existing and future AI and machine learning tools with fresh and inclusive eyes.
Before it was optional, but now it is expected and companies that do not evolve will be left behind by consumers who irreversibly raised their expectations.
Thank you, Vidya!