Project Oxford: Microsoft serves up APIs for intelligent apps

Microsoft talks up the IoT potential of the cross-platform Oxford SDKs and face, speech, and vision APIs

Comments

Microsoft this past spring announced Project Oxford, a set of SDKs and APIs that allow developers to build “intelligent” applications without having to learn machine learning. Using Oxford’s face, speech, and vision APIs, developers can create applications that recognize facial features, analyze images, or perform speech-to-text or text-to-speech translations.

In an interview with InfoWorld Editor at Large Paul Krill, Microsoft’s Ryan Galgon, senior program manager responsible for the Project Oxford platform and technologies, talked about the goals behind Oxford, emphasizing its potential in the Internet of things.

InfoWorld: Who is building Oxford applications? Who is Oxford for?

Galgon: We’ve had a lot of people come in and sign up for the API services. The exact numbers [are not] something I can get into, but we’ve had lots of Azure accounts created, lots of signups through our Microsoft Azure Marketplace. People are kicking the tires for the services, as well as reaching out for making higher uses of the services. Right now, they’re all offered as a limited free tier on a monthly basis, and we’re working to open that up as we’ve gotten feedback about what changes developers want to see made to the APIs and models.

It’s all cross-platform, in the sense that it’s a set of Web services that are accessed primarily through a REST API interface. Anything that can contact a website can call these back-end services. We provide a set of SDKs, which wrap those REST calls and make them easier to use on clients like Android and Windows and iOS. Anything that can make an HTTP Web call can call the services.

InfoWorld: Do you foresee Oxford being used primarily on mobile devices or on Windows desktops?

Galgon: It’ll primarily be a mix of probably mobile and IoT devices. In the sense that when people are using desktops, the vast majority of uses I see, you’re sitting there, you have the keyboard and mouse and that type of input. But when you have a mobile phone, you’re capturing photos and video and audio. It’s so much easier and natural to capture that with a tiny device. [Project Oxford technology will be used] where the dominant input case is going to be a natural data, not only numbers but some sort of visual or audio data type.

InfoWorld: Tell us more about these APIs. What are some of the things developers can do?

Galgon: Because we want to reach as many developers as possible, we’ve really put a lot of work into making them very easy to use, [for] things like face detection or computer vision, image categorization. Those things are trained and modeled, built by people with years of deep research experience in those places and we don’t want developers to have to go become an expert in computer vision. We’ve really tried to say, “Look, we’re going to build the best model we can build and make it available to you and make it accessible within three lines of code for you.”

I can’t talk about how external partners are looking at making use of the Oxford APIs, but the main ones that Microsoft has worked on, that maybe you’ve seen, the first one was the How-old.net site for predicting ages and genders. Then we had TwinsorNot.net, and that was given two photos, how similar are these people? Those were both good examples of the Face APIs. The final one, which used the Face API and some Speech APIs, was a Windows 10 IoT project that a few blog posts were written about where you were able to unlock a door with your face and converse with the door -- or the lock, in that case. I think those are three examples Microsoft has worked on to show you here’s a type of an application that can be built and shared those with other people.

InfoWorld: Under these REST APIs, what makes Oxford tick?

Galgon: The core is machine-learned models that we built for things like speech-to-text. Whether you access it via a REST API -- or with speech-to-text, you can also access it via a Web socket connection -- the magic or the powerful thing there is this model that can take audio of someone speaking and a language that it’s in and translate that into text format. That’s the main thing that makes Oxford tick as a whole.

InfoWorld: Why is Project Oxford separate from the Azure Machine Learning project?

Galgon: In Azure Machine Learning, one of the main components is the Azure Machine Learning Studio, where people can come in with their data, build an experiment, train their own model, then host to that model. With Oxford, this is a prebuilt model that Microsoft has, a model we’re going to keep improving in the future and we let people make use of that model over these REST interfaces.

InfoWorld: What type of enterprise business use do you see for Project Oxford? What is the business case for Oxford applications?

Galgon: There are no specific partners I can really talk about at this time, but I think one of the cases we’ve seen a lot of interest in, where I personally see a lot of use cases, is when it comes to Internet of things-connected devices. When I look at the way that people are looking at building IoT devices, you don’t have a keyboard and a mouse and often even a real monitor associated with all these devices, but it’s easy to stick a microphone on there and it’s pretty easy to stick a camera on there as well. If you combine something like the speech APIs and LUIS (Language Understanding Intelligent Service), then a device that has only a microphone and no other way of input, you can now talk to it, tell it what you want to do, translate that into a set of structured actions, and make use of that in the back end. That’s where I think we’re going to see a lot of use cases for the Oxford APIs.

InfoWorld: You mentioned iOS and Android. What’s been the uptake on those platforms?

Galgon: By making the APIs RESTful and providing these wrappers for them, we’ve definitely seen people downloading those wrappers, making use of them. But at the end of the day, it happens to be, “Here’s a Java language wrapper around a Web caller,” “Here’s an Objective-C wrapper around a Web call.” We don’t have a lot of insight into what is the exact device that’s making the call.

InfoWorld: Is Oxford going to be open source?

Galgon: We don’t plan on open-sourcing the core models, and I don’t have anything to share about that because we keep updating the models over time. The SDKs that we provide, since they’re wrappers around those REST calls, that source code’s there and available to go download for anyone today from the website. But again, that’s a hidden wrapper on things and we’ve actually seen people in MSDN forums that have been providing code snippets in different languages around it.

InfoWorld: How does Microsoft plan to make money off of Oxford?

Galgon: The APIs in the Marketplace are all free today for limited usage, so you get 5,000 API transactions a month. That’s the only plan we have available now. In the future, we’ll roll out paid plans based on usage of the APIs.

InfoWorld: What is next for Oxford?

Galgon: Where we go from here is really three areas. The first area is about updating and improving the existing models. We got feedback from developers [about how] one of the APIs might not work great with certain types of images. We’ll improve the core model there.

One of the other things we’ll do is we’ll keep expanding the number of features returned from the models. Today, the Face API gives you predicted age and predicted gender. We’ve seen a lot of requests for being able to recognize other content within images.

The third area is we’ll expand the portfolio of APIs that we have. We have four today, but we’re definitely not done. We don’t think the whole space that we want to provide or the tools that we want to provide is complete yet. We’ll keep adding on new APIs that can deal with different data types or can provide very different types of natural data understanding than what we give today.