A repost of guest post on OER Research Hub on November 24th, 2014
As part of the OER Research Hub the community has been asked to respond to surveys examining OER awareness and usage. This has been done using the online web service SurveyMonkey. One of the features of SurveyMonkey is to record additional information about who is responding to the survey as well as their responses. This includes the option to store the IP address of the respondent. For those not familiar with IPs these are the web equivalent of postcodes. IP addresses are used to let browsers find where the page the person is looking for is stored. They are also used to identify where the request for the page is coming from. As well as IPs being used to direct traffic to a virtual space on a server, the server itself, even in the cloud, has a physically location somewhere. Consequently IP addresses can be used to give a best guess of geographic location. This is a feature that is widely used. Anyone with access with website stats will usually get a breakdown of access by location. How accurate the location returned is variable. Users can mask their location with various techniques including routing via a proxy server. If you are using an internet service provider to access the web the location will most likely be the location of their data centre. For example, when browsing the internet from San Francisco airport I’m recorded as being located north of the airport.
In the case of the OER Research Hub survey data as well as response to the survey question asking where the respondent was based we can get a estimate of their location from their recorded IP address. So the next question, which this post is focused about, is how can we convert over 6,000 survey responses to geo-located data? There are a number of ways you can do this but the technique I’ll be highlighting is using the open-source data clean-up tool OpenRefine and a web service called freegeoip.net.
If you haven’t already you will need to download a copy of OpenRefine and follow the instructions to get it started. Next you need to import the survey responses from SurveyMonkey which I assume you have already exported in a .csv format. To do this launch OpenRefine and in the browser window that open click ‘Create Project’ tab
In the ‘Get data from’, use the ‘This Computer’ option selecting your survey data file. After you click ‘Next’ which should give you an import preview similar to the one below. You may need to use the settings highlight to adjust how your data is imported. For the Evidence Hub data there was some pre-processing to the data which meant additional rows at the start needed to be ignored. When you are happy click the ‘Create Project’ button on the top right.
You should now have all of your survey responses imported, one per row. To get location data for our IP addresses we are going to use a OpenRefine feature called ‘Add column by fetching URLs’. To use this from the IP_Address column click on the down arrow to the left of the column name and select Edit column > Add column by fetching URLs.
You should now see a window similar to the one below. I’ve highlighted parts of the screen you need to pay attention to. As the menu option name suggests we are going to add a new column so a name is required. For this example I’m using the name ‘geo_data’. Next we have a ‘Throttle delay’ option. This is used to set the length of time between each url OpenRefine fetches. This is useful as some services will stop giving you back data if you make too many requests in a short period of time. For our example we will keep to the default 5000 milliseconds (5 seconds). Next there is the ‘Language’ setting. OpenRefine lets you uses a couple of different programming languages to do different things. You want to make sure you are using ‘Google Refine Expression Language (GREL)’ (it’s called GREL instead of OREL because before being called OpenRefine the code was maintained by Google and the product was called Google Refine). The last setting is the most important, the expression. For OpenRefine to fetch new data it needs a web address to go to. For this project we are going to fetch data from the freegeoip.net service. The service has documentation on the various ways to get data. For this project the data format we want to use is json. To get geographic details for our IP addresses the pattern we need is http://freegeoip.net/json/ip_address e.g. http://freegeoip.net/json/188.8.131.52. To pass our IP address for each row of data our expression is:
Once you have set this up you can hit the ‘OK’ button and go and make a cup of tea … or in fact several cups of tea depending how many rows you have. While you are waiting for OpenRefine to fetch all the data don’t close your browser tab as you won’t know when the job is done. This isn’t an excuse to slack off as you can still use your computer for other things.
When OpenRefine is finished you should have an extra column with data similar to that below:
This data is all a bit raw and to make use of it in other applications it’s better to get individual metadata like latitude, longitude, country_name into separate columns. There are a number of ways in OpenRefine that you could do this. In this example I’m going to use another column option ‘Add column based on this column…’:
With this we get a similar window to add column by url. Again we need to specify a column name and make sure we are using the GREL language. The difference this time is we need a different expression. The data we got back from freegeoip.net was in a JSON data format. To select individual parts we can
parseJson(our cell value) and the
.get(the value we want). So to get country_name we use the expression:
You need to repeat this for all the bits of the geo_data cell values you’d like to extract. If your surveys are still active and you would like to add geographic data to new results a nice feature is to export and reuse the steps used to interact with freegeoip.net and manipulated the data. The OER Research Hub file for doing this is here. To extract or apply the steps there are options under the undo/redo tab:
So once you have all your IPs geo-coded the next question is now what. There are various other tools and services that you can use to put and display geo-data and in my next post I’ll highlight what we’ve done with this data in the oermap.org. If you are looking for a quick trick one of the column menu options is Facet > Scatterplot facet which let you see the geographic density of you data
To get your data out of there are a number of export file formats and lots on online tutorials on how to do this and in my next post I’ll outline some of the things we have done with the OER Research Hub survey data