Having available data does not necessarily mean it is useful ‘as-is’ to develop AI and machine learning.
We can take a shortcut to understand the current state of affairs of artificial intelligence ‘in the wild’ with the opinion of Andrew Ng, a leading AI expert:
“Surprisingly, despite AI’s breadth of impact, the types of it being deployed are still extremely limited. Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B).” — Andrew Ng, Harvard Business Review (November 2016)
There are two things to highlight from that statement.
First, he mentions that the types of AI products deployed are still limited. The deployments are “limited” with regard to the number of tasks and the generalities of what the AI actually accomplishes. This is not to say they are meaningless or irrelevant at all. These accomplishments are huge enablers for companies—aside from also being very cool showoffs.
To give a trivial example, compare a machine’s capacity to automatically process millions of photos to an army of people doing it manually. The second point to highlight is the down-to-earth approach to artificial intelligence. It’s no longer an impractical field of computer science. There are deployments out there that are already impacting our everyday life, without having to think about it.
Developers and applications that fail to grasp the continued necessity of human participation are doomed to fail, while those that do are poised for sustainable success.
Offering some concrete examples to help visualise the statement, let’s use Andrew Ng’s notation of Input A, Response B, and the possible application:
• Input A are pictures, and we want a Response B of the type: Are there human faces in the pictures?
With this sort of AI, we could build a photo-tagging application.
• Input A are loan applications, and we want a Response B of the type: Will the loan be repaid? With this we could build an AI-powered loan approval application.
• Inputs A are ads and user information, and the Response B we want is: Will the user click on an ad? With this we can build an ultra-targeted online-ads platform (not so alien to our common experience, is it?).
Diving deeper, we can add some more examples:
• We have audio as input, we want the transcript of the audio as an output. With this we can build a speech to text application or take it further to speech recognition.
• Using a sentence, in English for the sake of argument, and expecting the sentence translated to French as an output, we could build a language translation application.
• What about a self-driving car? For that we would have input from cameras and other sensors like light detection and ranging (LIDAR). The type of outputs could be the speed, trajectory, and positions of other cars and objects.
With all the different possible applications and players getting into AI, what remains to be tackled are the key aspects for a successful AI product and to determine what the key considerations are.
First, successful AI products must be driven by one or both of two fundamental goals: aiding people in accessing and processing information, and facilitating decision-making. The crucial characteristic shared by both is that AI is treated as technology to supplement, not replace, human capability. Developers and applications that fail to grasp the continued necessity of human participation are doomed to fail, while those that do are poised for sustainable success.
There are deployments out there that are already impacting our everyday life, without having to think about it.
Second, the need of data, which is an unquenchable need. There is a growing availability of data and maturity of the tools needed to collect and manage it. Having available data does not necessarily mean it is useful “as-is” to develop artificial intelligence and machine learning. Data “in the wild” is usually not.
Labeled data is at the heart of the majority of current deployments of machine learning. It is the way in which an algorithm can be taught by example which input corresponds to which output. But not just any data will do. It needs to be of good quality if it is going to “teach” anything useful. Quality, in this sense, implies the following:
• Curated so that no wrong or mistaken data is consumed, and it’s processed in a way that makes it suitable for consumption.
• Labeled so that the expected outcome, or lesson to learn, is present, enabling the most powerful methods at our disposal to be used.
• Balanced, which means that the sets of characteristics and outcomes are not skewed in a way that constrains generalisation or leads to biased assumptions.
• Representative of the kind of situation we are trying to assess and not a subset or special case. How good would my go-to-the-beach model (see below) be if all the data points come from winter days?
Following an example used by Steven S. Skienal in The Data Science Design Manual, let’s say you are building a predictive go-to-the-beach model that, depending on several variables, predicts whether people will go to the beach or not as the outcome. In this case you might have an input that looks like this:
Outlook = Sunny, Temperature = High, Humidity = High, Go to the beach = Yes
That input can be labeled as Beach.
Another data point (also known as datum) could be as follows:
Outlook = Sunny, Temperature = Low, Humidity = Normal, Go to the beach = No
That can be labeled as NoBeach. Having that labeled data is critical to train the model that will make the prediction.
You might be wondering, Why do we need labeled data in the first place? Don’t we have unsupervised learning techniques available?
Recalling the difference between supervised and unsupervised methods, we are not trying to see if there is some endogenous structure to the data that we can leverage. We have a specific objective. The trick we want our pony to learn is knowing when to go to the beach. If we don’t have the labels, how can we direct him toward them?
Without labels, there is no right or wrong answer upfront. Unsupervised techniques would be a helpful approach for exploring the data and making sense out of an untouched (by humans) dataset. But to get a specific response, we need supervised techniques.
Last but not least, it’s important to understand that the current state of the art in artificial intelligence and machine learning enables us to extract useful data out of any kind of “dark” data, or data hidden within a picture, phrase, piece of audio, and so on. That is something that we are seeing being used in social networks, retail, finance, and other industries to extract information from these sources and make it richer.
JJ Lopez Murphy is the AI and big data tech director and data science practice lead - data strategy at Globant. This is an excerpt from his book Embracing the Power of AI.
Get the latest on digital transformation: Sign up for CIO newsletters for regular updates on CIO news, career tips, views and events. Follow CIO New Zealand on Twitter:@cio_nz
Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.