Open data equals open scrutiny but doesn’t always equal all of the answer

Martin Hawksey

13 years ago

[flickr]64284665[/flickr]

Last week I got frustrated at not being able to find some JISC funded project outputs, which was a little annoying. This led to a small exploration around JISC’s Programme Information Management System (PIMS). The system was originally only available to JISC executive staff but made available to all sometime last year and used to log all the JISC funded programmes and projects. As well as looking up projects via your browser some of the data can be accessed via it’s API.

At the back of my mind was a post by Jonas Eriksson (@joeriks) on Coding my own automated web site monitoring with Google Apps Script and a Spreadsheet. What I wanted to do was pull JISC funded project website addresses from the last 3 years* and automatically test to see if they were still alive.

To do this I first needed a list of project website urls from the last 3 years. Unfortunately the PIMS API doesn’t appear to let you access records based on a date range so instead I just grabbed the lot via http://misc.jisc.ac.uk/pims/api/dev/project which returns the data in XML format. I could have just dumped this straight into a Google Spreadsheet using the importXML formula, but I find this is *very* unreliable so opted to handle it using the Google Apps Script XML Service. [Because of my lack of knowledge of using this service I initially didn’t get very far so tried processing the response as JSON instead by using http://misc.jisc.ac.uk/pims/api/dev/project=json but got into more difficulty because the API returns objects with a duplicate ‘Project’ key so reverted back to XML]

So here is my code to get some selected columns of PIMS data for projects that have finished in the last 3 years:

* Para 27 and 29 of JISC’s general terms and conditions of funding state “27. The institution and its partners must create a web page and web site to explain the project aims and objectives and to disseminate information about project activities and results. …. 29. The lead institution or one of its partners must agree to host the web site on their server for a minimum of 3 years after the end of the project and to assist JISC in archiving it subsequently.”

The data isn’t clean

Any systems that gives the user freedom over the data entered will invariably get some ‘dirty data’. For example it would have been nice to just iterate across the project website urls but: a) not all projects have a url entered; b) not all of the projects are projects (some of the entries are holders for Programme Management or Evaluation); c) urls are entered with leading whitespace; d) the field may have multiple urls or text notes; or e) might just be entered wrong or be an old url.

You can add layers of code to factor some of these out like trim whitespaces or only process urls that begin ‘http:’ but at the end of the day there will always be an error factor.

Regardless of this I was keen to push on and find out how many of these urls were pingable using Apps Script. So here is my next bit of code to ping a spreadsheet of urls:

Using this I found that the UrlFetchApp service would hang if there were over 200 urls and not push it’s results into the sheet so I found that I had to manually adjust the for loop (line 5) to do batches of urls at a time.

The results for this first pass are in the embedded table below (also in this Google Spreadsheet). It’s notable that almost half of the entries don’t have website urls associated with projects. As mentioned earlier not all of the projects entered are externally funder projects a proportion are internal activities.

As there were also quite a few invalid website url entries (n. 32) I tided these up manually (a number of ‘tbc’ and other text entries), and ran again as a 2nd pass also summarised in the table. In the second pass I did some manual checking of the project entries returning 404 and 590 errors.

The 590 errors are all associated with project websites hosted on the HEA website associated with the OER programme. An example is http://www.heacademy.ac.uk/projects/detail/OER_IND_Bradford which returns a HEA themed page with no content. It appears these urls have been entered incorrectly or the HEA have changed the structure of their site as the following url does work http://www.heacademy.ac.uk/projects/detail/oer/OER_IND_Bradford.

The manual checks only identified 1 or 2 mistaken missing urls indicating that the UrlFetchApp, which automatically follows redirects, is accurate.

PRODing a different data source

Another data source which has JISC project data is the CETIS PROD directory. This pulls a number of data sources including PIMS and is given some extra TLC from CETIS staff who curate the data manually adding extra pieces of information. Like PIMS, PROD has an API to get data out, but as far as I could see there was no way to get all the data.

I was original made aware of PROD via JISC CETIS’s Wilbert Krann (@wilm) and so a few friendly tweets later I ended up with the following query for the TALIS/PROD data store (I should say I know very little about Linked Data/SPARQL so rather than show my ignorance I’m not even going to mention it):

PREFIX prod: <http://prod.cetis.ac.uk/vocab/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX mu: <http://www.jiscmu.ac.uk/schema/muweb/>
PREFIX jisc: <http://www.rkbexplorer.com/ontologies/jisc#>
SELECT DISTINCT *
WHERE {
    ?s a doap:Project .
    ?s doap:name ?project .
    ?s jisc:end-date ?date .
    ?s doap:homepage ?homepage .
}

Just as with the PIMS example it was possible to use Apps Script to fetch the results using this getProdData code. [You’ll see that the fetch query url has been shortened using bit.ly. This is because as Tony discovered Google Apps Script doesn’t like lonnnnnngggg queries, but it is happy to follow redirects.

Below is a summary of the results from pulling and pinging the project homepage urls from PROD (also available in this Google Spreadsheet). There are more than half the number of projects returned from PROD (n. 430), but all the projects with homepage urls are valid without any clean-up. There are still over 30% of projects without a homepage url, but this doesn’t mean that the project doesn’t have some sort of web presence and the PROD data contains other urls if they exist for a JISC page, rss feed, wiki site etc.

Even though PROD gets some extra love and attention 13% of the recorded projects homepage urls hit dead ends (mainly 404 errors). Just as with the PIMS data I had a look at these in a 2nd pass and found that 5 projects actually had an alternative web presence (usually a hosted blog).

Many eyes and many things

What can we take from all of this? A third of JISC funded projects don’t have a project website? One in ten projects with websites aren’t available for 3 years after the project ends? I don’t think it’s as conclusive as that. Those states are based on the assumption that all JISC funded projects have agreed to the general terms and that the general terms have remained unchanged. What interests me more is how the information can be improved and reused.

In terms of improving the quality of the data just as PROD adds an extra level of human curation there is potentially the opportunity to add a wider level of community curation similar to Wikipedia or IBM’s Many Eyes. The challenge is less technical (in this sheet I’ve added a comment link for each entry which links to a comment form) and usually finding the right incentive to get contributions.

In terms of reuse I have one simple use case in mind and I’m sure there are many more. In a couple of hours it was possible to pull this data into a Google Spreadsheet and ping project website urls. It would only be a tiny step to not only automate this but trigger an email to someone when a website went off the radar.

[Here’s a link to the Google Spreadsheet with all the data I pulled]