Using Cognitive Services from Excel via PowerQuery

I have a client that I regularly consult with about various .NET application development tasks. One time they were describing me a particular piece of software that supports product logistics. One part of the process is that products are basically assembled on a conveyor belt. They also showed me the conveyor belts in the factory itself. There I noticed a single guy running around with pen and paper, checking the conveyor belts. When I asked my tour guide about it, they painted me an all-to-familiar picture of inefficient red tape.

So the guy was there to check on the conveyor belts. If one of them stopped, he had to mark the identification number of the conveyor belt, the time when it stopped and the time when it started again. He had a sheet for every day with three columns for each piece of data. Then, at the end of each day, the paper was scanned and then sent to someone who looked at the scanned image and typed it into an Excel sheet :)

And there I was listening to this story in the 21st century. I mean, there has to be a way to make this more efficient.

And there is :) Azure Cognitive Services have a computer vision API that can recognize text on images. Only works in English for now, but numbers are numbers in every language (of the western world, at least).

Image recognition with the API is a two step process:

  1. You have to send an HTTP-request to a specific endpoint with the image as the content of the request. The response is empty but in the headers there is another url that you have to use in step 2.
  2. You have to periodically query the url you receive in step 1. Step 1 basically queues your image processing job to a queue and in step 2 you have to check the status of the job. If the response status indicates that the job is done, you also get a json content describing your text.

Now if only there was a way to do all this from Excel. And there is :) Actually, Power Query is exactly designed for this purpose: to create a unified method to import data from any data source, be that a simple text file or a web service. And now, we have a web service.

So all I need to do is create a Power Query query (in its own programming language, called M) that does all these things. It's actually just a couple of lines:

let   
    apikey      = "my api key",
    endpoint    = "https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/recognizeText?handwriting=true",    
    proxyendpoint = "http://cognitiveproxyservice.azurewebsites.net/api/CognitiveProxy/RecognizeText",
    bytesbody   = File.Contents("D:\temp\stats.jpg"),
    headers     = [#"Ocp-Apim-Subscription-Key" = apikey],
    bytesresp   = Web.Contents(proxyendpoint, [Headers=headers, Content=bytesbody]), 
    jsonresp    = Json.Document(bytesresp),    
    opLoc = jsonresp[OperationLocation],  
    getResult = (url, apiKeyParam, iteration) =>
      let
        currentHeaders = [#"Ocp-Apim-Subscription-Key" = apiKeyParam, #"CacheHack" = Number.ToText(Number.Random())],
        linesBody = Web.Contents(url, [Headers=currentHeaders]),
        linesJson = Json.Document(linesBody),
        resultStatus = linesJson[status],
        result = if (resultStatus = "Succeeded") then linesJson[recognitionResult][lines]
                 else Function.InvokeAfter(()=>@getResult(url, apiKeyParam, iteration+1), #duration(0,0,0,iteration * 5))
       in result,
    linesText = List.Transform(getResult(opLoc, apikey,1), each _[text]),   
    table = Table.FromList(linesText)    
 in table

Let's go through the code. M is a functional language so it can be a bit overwhelming if you are not familiar with the style. First, I simply define a couple of "variables". The first one is my API key. To use Cognitive Services, you have to first register and get an API key that you have to include in every request.

The next one if the endpoint where you have to issue the request. The second one is a "proxy" endpoint. Here's where I have cheated a bit. If you remember, if you want to use Cognitive Services, in step 1 you issue a request and in a header value you get back the url that you have to query from then on. The only problem is that Power Query is quite new and still doesn't have an API that could process HTTP response headers. So I created a very simple proxy service that essentially relays your query to Cognitive Services and transforms the response so that the header value is parsed and added to the actual body of the response, that can be processed by Power Query later.

public class CognitiveProxyController : ApiController
{
  [HttpPost]
  public async Task<IHttpActionResult> RecognizeText()
  {
    string uriBase = "https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/recognizeText";
    using (HttpClient client = new HttpClient())
    {
      client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key",
             this.Request.Headers
                         .GetValues("Ocp-Apim-Subscription-Key")
                         .Single());
      var requestParameters = "handwriting=true";
      var uri = uriBase + "?" + requestParameters;
      var byteData = await this.Request.Content.ReadAsByteArrayAsync();
     var content = new ByteArrayContent(byteData);
     content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");
     var response = await client.PostAsync(uri, content);
     if (response.IsSuccessStatusCode)
        return Ok(
              new { 
                   OperationLocation = response.Headers
                                   .GetValues("Operation-Location")
                                   .FirstOrDefault() 
                  });
     var clone = new HttpResponseMessage(response.StatusCode);
     clone.Content = response.Content;
     return ResponseMessage(clone);                
    }
  }
}

This is a techincal limitation of Power Query that I'm sure will go away in time.

So with that in place and the proxy explained, the next thing is to create hte HTTP request. Here I use a hardcoded filename, but Power Query supports parameters, so this could be a value from a cell of the Excel table. Next are the headers and finally the request is issued using Web.Contents. The contents are then parsed into json and the url is retrieved from the response that can be used to query for the status of the job.

Then the getResult recursive function is defined; this is the one that queries for the result. It creates an HTTP-request again and if the response indicates success, then the results are extracted from the json. If not, then the same function is invoked again recursively with linear backoff using Function.InvokeAfter. Finally, the results are transformed into an Excel table.

IF you look closely at getResult, you can see that there is a header that I called CacheHack. This is (as the name suggests) a hack. Power Query does some internal caching and if you issue the same web request more than once, the additional requests are served from this cache. In this case this is obviously bad. It took me some time to figure it out (even went on SO and asked a question about it). My analysis finally led me to this post and the caching issue was resolved. After some time I also got a comment on SO that caching can be disabled in a nicer way as indicated by the docs (that I clearly wasn't reading carefully enough).

So that's that. There's clearly room for improvement both on my side (using a parameter instead of a hardcoded value, limiting the retries, formatting the results into a proper table) and on the Power Query-side (handling response headers), but this is close enough for a POC.

Oh, and the results? Well, that depends very much on Cognitive Services, and turns out there's room for improvement in that area as well (but then again, my handwriting isn't the easiest to read even for a human being).

So for this test image that I created:

Here's the result:

As you can see, the time values are pretty good. The identifiers of the machines are a bit off, but that's partially because I wanted to try how it handles actual Hungarian text ("Gép" means "Machine" in Hungarian). I think if the conveyor belts have simple numbers or just random letters as ID-s and the are written with a bit cleaner using ALL CAPS, the results would be even better.

So there it is. A cool solution to shave off operating costs and limit human errors using cool pieces of technology.