Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).
To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.
Getting the data into C2
There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:
… and then use the Templating Export tool to generate the C2 xml.
For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.
Prefix
<!--?<?xml version="1.0" encoding="UTF-8"?>
<searchresult>
<query>#or2012</query>
Row template
<document id="{{row.index}}">
<title>{{cells["text"].value}}</title>
<url>{{cells["status_url"].value}}</url>
<snippet>
{{cells["text"].value}}
</snippet>
</document>
Row separator
[single carriage return]
Suffix
</searchresult>
If you want to play along here is a copy of the #or2012 archive in C2 xml
Processing the data in Carrot2
Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.
The results
Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):
The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).
Useful/meaningful?
This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out 😉
Nicola Osborne
Martin,
A huge thank you for sharing this – very useful stuff – and an even bigger thank you for using the OR2012 tweets as your example! It’s really fascinating to see those data clusters which do seem to reflect some of the key themes (data most particularly) that arose this year.
We have used TAGSexplorer as one of our tweet archiving methods for the event btw so also a big thank you from us for that! 🙂
– Nicola
Nicola Osborne
OR2012 Host Organising Committee
EDINA Social Media Officer
Martin Hawksey
Hi Nicola – you’re welcome. Carrot2 is a tool I need to revisit and learn how to properly use 🙂
Martin
Nicola Osborne
Likewise – I downloaded the desktop version yesterday and will be having a proper look at Carrot2 when I get time to – it looks like it has some very nice functionality. Have you seen VoyantTools btw (http://voyant-tools.org/) – it’s a quick and dirty approach but the analysis of clustering/proximity within texts brought it to mind.
– Nicola.
Martin Hawksey
Voyant Tools looks like a useful tool. I’ve been playing with the #or2012 archive in it and once you put some stopwords in you’ve got some nice data to play with.
Thanks!
Martin