<< Back to article Print this page Loading page, please wait...

AI’s biggest risk factor: Data gone wrong

Bad data is big issue for artificial intelligence, and as businesses increasingly embrace AI, the stakes will only get higher. Here’s how not to get burned.

Maria Korolov (CIO (US))
13 February, 2018 22:00

Artificial intelligence and machine learning promise to radically transform many industries, but they also pose significant risks — many of which are yet to be discovered, given that the technology is only now beginning to be rolled out in force.

There have already been a number of public, and embarrassing, examples of AI gone bad. Microsoft's Tay went from innocent chatbot to a crazed racist in just a day, corrupted by Twitter trolls. Two years ago, Google had to censor image searches for keywords like "gorilla" and "chimp" because it returned photos of African-Americans — and the problem still hasn't been fully fixed in its Google Photos app.

As businesses increasingly embrace AI, the stakes will only get higher.

"We wake up sweating," says Ankur Teredesai, head of AI at Seattle-based KenSci, a startup that applies AI to health care data. "At the end of the day, we're talking about real patients, real lives."

KenSci’s AI platform makes health care recommendations to doctors and insurance companies. If there are errors in the medical records, or in the training sets used to create predictive models, the consequences could potentially be fatal, a situation that sheds light on a key risk factor for AI implementations: the quality of your data practices.

Guardrails against AI going bad

KenSci deals with millions of patient records from partner organizations around the world. The information is in different languages, standards and formats, and is organized around different classification schemes.

To address this issue, KenSci uses home-grown and third-party tools, and it depends on partner health care organizations as well.

"The health care systems have invested significant amounts of effort in putting protocols in place, compliance in place, for ensuring that their data assets are as clean as possible," he says. "Five or ten years ago, this was a big problem. Today, because of the maturity of digitilization in most of the Western world, Asia, and Australia, there is significantly less disparate coding. A lot of the world has moved to standardization."

In mitigating the risks in relying on AI, KenSci has three additional layers of safety. First, there's the front line of defense against errors: the doctors delivering care.

"We don't believe in artificial intelligence," says Teredesai. "We believe in assistive intelligence. We leave the decision on how to act in the hands of well-trained experts, like the physicians."

The KenSci platform just makes recommendations, he says. And in most cases, those recommendations aren't even for treatments. "The majority of our work focuses on cost predictions, workflow analysis, and workflow optimizations. Many times, we are three steps away from a clinical decision."

The company's own medical experts provide a second line of defense, by reviewing the data coming in and the limits of how it can be used. For example, data from the results of treatments of male patients might not apply to women.

"We have a rigorous process for ensuring that models do not get scored if the underlying data is not correct for that model to be scored — garbage in, garbage out," he says.

Finally, there are external peer reviews of the outputs of KenSci’s models, and the factors that went into the platform’s decisions.

"Our researchers here are at the forefront of fairness and transparency of the AI movement," he says. "We believe in open publication, in distributing the parameters on which the model is making the decision, so that experts can not only review the outputs of the models, but the factors and scores that went into that scoring. There's a lot of thought that goes into making sure the KenSci platform is open, transparent, open to scrutiny."

KenSci’s approach shows the kinds of processes companies will need to put in place as they further their dependence on AI.

It's all about the data

Ninety percent of AI is data logistics, says JJ Guy, CTO at Jask, an AI-based cybersecurity startup. All the major AI advances have been fueled by advances in data sets, he says.

"The algorithms are easy and interesting, because they are clean, simple and discrete problems," he says. "Collecting, classifying and labeling datasets used to train the algorithms is the grunt work that’s difficult — especially datasets comprehensive enough to reflect the real world."

Take, for example, apps that provide turn-by-turn driving directions. They've been around for decades, he says, but have become good only very recently — because of better data.

"Google funded a fleet of cars that have driven and digitally mapped every road in America," he says. "They combine that data with satellite imagery and other data sources, then employ a team of human curators manually polishing the data representing every building, intersection and traffic light in the world. As AI is applied to a wider range of problems, the successful approaches will be those that recognize success doesn’t come from algorithms, but the data wrangling.”

However, companies often don't realize the importance of good data until they have already started their AI projects.

"Most organizations simply don’t recognize this as a problem," says Michele Goetz, an analyst at Forrester Research. "When asked about challenges expected with AI, having well curated collections of data for training AI was at the bottom of the list."

According to a survey conducted by Forrester last year, only 17 percent of respondents say that their biggest challenge was that they didn't "have a well-curated collection of that to train an AI system."

"However, when companies embark on AI projects, this is one of the biggest pain points and barriers to moving from a proof of concept and pilot to a production system," she says.

One of the biggest issues that comes up isn't so much that there isn't enough data, but that the data is locked away and hard to access, says Nick Patience, founder and research vice president of 451 Research.

"Machine learning won’t work if your data is rigidly siloed," he says. "If, for example, your financials are in Oracle, your HR data is in Workday, your contracts are in a Documentum repository and you’ve not done anything to try and create connections between those silos."

At that point, the company isn't ready for AI, he says.

"You might as well use standard analytics tools in each silo," he says.

Data issues that can derail AI

Even if you have the data, you can still run into problems with its quality, as well as biases hidden within your training sets.

Several recent research studies demonstrated that popular data sets used to train image recognition AI included gender biases. For example, a picture of a man cooking would be misidentified as a woman because in the training data, cooks were women.

"Whatever bias we have, if there are various kinds of discrimination, racial or gender or age, those can get reflected in the data," says Bruce Molloy, CEO of SpringBoard.ai.

Companies building AI systems need to look at whether the data and the algorithms that analyze the data are in line with the principles, goals, and values of the organization.

"You can't outsource judgment, ethics, values to AI," he says.

That could come from analysis tools that help people understand how the AI made the decision it did, from internal or external auditors, or review boards, he says.

Compliance is also an issue with data sources — just because a company has access to information, doesn't mean that it can use it any way it wants.

Organizations have already started to audit their machine learning models, and looking at the data that goes into those models, says David Schubmehl, director of IDC's cognitive and artificial intelligent systems research.

Independent auditing firms are also beginning to take a look at it, he says.

"I think it's going to become a part of the auditing process," he says. "But like anything else, it's an emerging area. Organizations are still trying to figure out what the best practices are."

Until then, he says, companies are taking it slow.

"I think we're in the early days where the AI or machine learning models are just providing recommendations and assistance to trained professionals, rather than doing the work themselves," he says. "And AI applications are taking longer to build because people are trying to make sure that the data is correct and integrated properly and that they have the right types of data and right sets of data."

Even perfectly accurate data could be problematically biased, says Anand Rao, partner and global AI leader at PricewaterhouseCoopers. If, say, an insurance company based in the Midwest used its historical data to train its AI systems, then expanded to Florida, the system would not be useful for predicting the risk of hurricanes.

"The history is valid; the data is valid," he says. "The question is, Where do you use the model, and how do you use the model?"

The rise of fake data

These kinds of intrinsic biases may be difficult to identify, but at least they don't involve data sources actively trying to mess up the results. Take the spread of fake news on social media, for example, where the problem is getting worse.

"It's an arms race," Rao says.

While social media companies work to combat the issue, hackers are using their own AI to create bots clever enough to pass for human, whether to influence social media, or to convince advertisers that they are real consumers.

"We're already seeing an impact," says Will Hayes, CEO at Lucidworks. "Look at the elections and the amplification of messaging with bots and other manipulators."

Those manipulators aren't always in Russia or China, either.

"If a brand is looking to do amplification over social media, and a marketing firm wants to prove that they increased your share of voice, it doesn't take an engineer to think of ways they can manipulate the data," says Hayes.

That's where domain expertise and common sense comes into play.

"Understanding the mathematics and patterns will only get you so far," says Chris Geiser, CTO of The Garrigan Lyman Group, a marketing firm that helps companies process data from a variety of sources. "The most important thing is to understand all your individual data sources. The more you understand your data, and what you're trying to achieve and your key performance indicators, the more you can point yourself in the right direction."

Triangulate your data sources

If a company has data coming in from multiple sources, it’s important to check the data from one source against another before applying any machine learning.

As one of the largest telecoms in the world, NTT Group generates a great deal of data from its network infrastructure.

"We employ machine learning to analyze network flow data for security purposes," says Kenji Takahashi, global VP for security research in NTT Security. "Our ultimate goal is to gain the complete understanding of malicious botnet infrastructures hidden in our network."

The company is currently investing in technology to improve the quality of training data for machine learning. To do this, NTT uses "ensemble" methods that take a weighted vote of data analysis results from different sources, he says.

That data then goes into a hyperscale database that preps it as training data for machine learning.

"Just like in classrooms, it is disastrous to learn anything with poor quality textbook with lots of errors," he says. "The quality of training data determines the performance of machine learning systems."

Building the team and tools to tackle the problem

According to a survey released in January by Infosys, 49 percent of IT decision makers say they can't deploy the AI they want because their data isn't ready.

"AI is becoming core to business strategy, but data management remains a persistent obstacle," says Sudhir Jha, senior vice president and head of product management and strategy at Infosys.

Here, leadership is key, and for some organizations embarking on an AI journey, the first step may be to appoint a chief data officer, says Marc Teerlink, global vice president for the Leonardo and AI division at SAP, as companies who have a chief data officer do a better job managing their data.

"Garbage in, garbage out," he says. "Data quality, ownership, and governance make all the difference."

Most companies today have to develop their own technologies to prepare data for use in AI and machine learning systems. For that, you need data scientists, and if you don't have the brain power in-house, you can hire consultants to do the work, PricewaterhouseCoopers' Rao says.

Some forward-thinking companies, such as Bluestem Brands, are using AI to process data for use in other AI systems. The company, which has 13 different retail brands, including Fingerhut and Bedford Fair, has taken this approach in helping ensure that customers searching for, say, a black dress should get all the relevant results — whether the vendor calls the color "black," "midnight," "deep mirage," or "dark charcoal."

"The endless creativity of artists to refer to the shades of the same basic colors — it never stops," says IT director Jacob Wagner. And it's not just colors. "The same problem exists over every attribute that is human parsed and interpretable," he says.

Bluestem built its data-prep system out of pieces that are readily available.

"The search technology is largely becoming commoditized," Wagner says. “Lexical parsing, text matching, all that technology has been codified and polished and open source algorithms are just as effective as any proprietary package.”

And it didn't take PhD-level data scientists to do it.

"With some talented engineers, you can figure out how to wire it into your data stream," he says.

Wagner is a big fan of Apache Spark, a big data engine that can pull in data from many different sources and slice and dice it, and Apache Solr, an open source search engine. Bluestem not only uses it on the customer-facing side, but also internally, to help with editorial workflows.

The company also uses commercial products, such as Lucidworks Fusion, which allows business users to customize the search experience with additional business logic — say, to funnel queries related to Valentine's Day to a curated set of recommendations without requiring IT to get involved.