Last year I wrote about how you can use Google Apps Script to integrate with import.io. If you are not familiar with import.io the service lets you:
transform any website into a table of data or an API in minutes without even writing any code
As part of my work at ALT we recently needed to extract data from our hosted Open Conference Systems (OCS). OCS has some data export options but none that fitted our exact need. As a hosted solution we don’t have access to the backend so I turned to import.io to liberate our own data <sigh>. OCS uses a basic authentication but the great thing about import.io is you can train it to enter your username and password and extract the data from the pages you need. Getting data behind an authentication layer with the import.io API is a two step process:
Make sure you check out the docs before integrating authenticated sources!
Every time you pass in credentials you will be logged in; pass in credentials once or via a login call and subsequently pass through cookies.
I took a while to get my head around the process because the two links in the support message just take you to the generic API docs. This is a better url to the queryLogin methods. It’s clear that import.io have put a lot of work into the developer experience, but unfortunately I struggled testing the queryLogin method. Using a valid id and model schema for the input just gave an ‘UnexpectedErrorException’. So I then turned to import.io’s own dataset tools. This was another dead end as I was struggling to get it to recognise my OCS login. Peeking under the hood I discovered:
@importio tracking in dev tools network log looks like it sends 2 _login requests, the first is empty, second has cred payload
— Martin Hawksey (@mhawksey) July 1, 2015
Looking for another ‘in’ a quick search came up with this post on Using import.io authenticated data sources with PHP and Go. Given I do a lot of coding in PHP translating to Javascript/Google Apps Script is relatively straight forward. I was still struggling however with the ‘shape’ of the login payload and the $connectorDomain. The breakthrough came remembering that import.io looked like they were dog fooding their own API in their dataset tool.
With this I could see what the $connectorDomain should have been and can now happily go off and liberate our data. Here’s my translation of the PHP example in Google Apps Script also available as a gist:
function getResults() { var connector = {'username':'YOUR_SITE_USERNAME', 'password':'YOUR_SITE_PASSWORD', 'connectorDomain':'YOUR_CONNECTOR_DOMAIN', 'userGuid':'YOUR_USER_GUID', 'connectorGuid':'YOUR_CONNECTOR_GUID', 'apiKey':'YOUR_API_KEY'} var creds = {}; creds[connector.connectorDomain] = { "username": connector.username, "password": connector.password }; var additionalInput = {}; additionalInput[connector.connectorGuid] = {'domainCredentials':creds}; //get cookies var login = query(connector.connectorGuid, false, connector.userGuid, connector.apiKey, additionalInput, false); additionalInput[connector.connectorGuid].cookies = login.cookies; var result = query(connector.connectorGuid, {"webpage/url":"http://ocs.sfu.ca/alt/index.php/conferences/altc2015/director/submissionReview/799/1"}, connector.userGuid, connector.apiKey, additionalInput, false); // do something with results like write to Google Sheet https://developers.google.com/apps-script/guides/sheets#writing_data } // http://blog.import.io/post/using-importio-authenticated-data-sources-with-php-and-go function query(connectorGuid, input, userGuid, apiKey, additionalInput, login) { var url = "https://api.import.io/store/connector/" + connectorGuid + "/_query?_user=" + userGuid + "&_apikey=" + apiKey; var data = {}; if (input) { data["input"] = input; } if (additionalInput) { data["additionalInput"] = additionalInput; } if (login) { data["loginOnly"] = true; } var ch = UrlFetchApp.fetch(url, {'method':'POST', 'payload': JSON.stringify(data)}); var result = ch.getContentText(); return JSON.parse(result); }