I told you I wasn’t intending on posting every day 😉 Yesterday ended up being quite frustrating with a number of dead-ends. It all started following on from Processing a resource feed to find frequency using Google Spreadsheets where I took Leeds Metropolitan University’s Repository feed from Yahoo Pipes to get activity data into a spreadsheet (BTW Nick Sheppard has documented how the Pipe was made here).
For the next step I wanted to do something similar with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). You can read more about OAI-PMH here but basically its a method for repositories to share information about the resources they’ve got. For example, you can visit the Social Policy and Social Work (SWAP) Subject Centre: SWAPBox repository and see what they’ve got. They’ve got a nice interface for browsing and searching for resources which is great for humans but a struggle for machines. Instead the machines have their own interface using the SWAPBox OAI service (even though this service is designed for machines it’s been beautified with Chris Gutteridge’s OAI to XHTML XSLT). Here you can get access to individual record details or the thing I was more interested in a list of record ids and date stamps.
I had a look around/asked for a desktop OAI Harvester that could export data from. The closest I could get was the suggestion by James Toon to use Google Refine but OAI results usually have pagination and I couldn’t find a way to automatically get the next page of results. I also tried putting PKP’s Open Harvester System on a USB drive by running it on an instance of MoWeS portable webserver (now there’s an idea UKOER on a stick, could be a great conference bag goodie?). Unfortunately I discovered there wasn’t a way to easily export harvested records (although I may be able to dump from MySQL).
Instead I reverted back to Google Apps Script (something I was avoiding as parsing XML in Apps Script is never fun). Here is a script I wrote to read OAI service ListIdentifiers and write to a spreadsheet. Using this with the SWAPBox repository we can get a distribution of submissions (the spreadsheet for this is here)
[As Leeds Met Unicycle repository also has an OAI service I thought I’d collect data from there to compare the RSS feed but I was getting timeouts]
PS Whilst looking for OAI export solutions I came across Ghent University Library Download/APIs page which includes: daily/weekly exports of the complete datasets of the Ghent University Academic Bibliography in various formats; every report record can be accessed by HTML, RDF, Dublin Core, METS, MODS, MPEG-21/DIDL; an Author export; department exports; classification type exports; and OAI-PMH; and other stuff. All this lovely data, it just makes you want and go and do something with it! It’s a shame to have all these wonderful repositories of open resources but hidden away. Here are some tips from the JISC Digital Infrastructure Team on Making OER visible and findable.
PPS Here are more real world examples of Open Bibliographic Data H/T @ostephens