By @mhawksey

Notes on generating live wordclouds from Yahoo Pipes using D3.js

The JISC OER Rapid Innovation projects are all quickly finding their feet and most are already fully embracing the open innovation model and blogging their progress. Having attended the programme start-up meeting on the 26th March 2012 and speaking to most of the projects there’s rich pickings for me to blog about over the next couple of months.
In our role (JISC CETIS) supporting this programme we’ve already dusted the programme with some of our wizardry. Phil Barker has aggregated all of the registered project RSS feeds into a single stream using Yahoo Pipes and I’ve bundled an OPML file of registered feeds (if you are a Google Reader user you can subscribe directly here) Note: Not all the projects have provided feeds yet. I’ve also started an archive of the #oerri tweets which is looking sparse now but will grow over time.
Something I was interested in trying out was to see if there was a way to dynamically create a word cloud from a RSS feed. Wordle.net does have an option to generate a feed from a blog feed (shown here), but it looks like it’s a static image eg it won’t update as new project blog posts are created.
So I turned my attention to Jason Davies and his Cloud extension to the D3 javascript library.  Jason has a demonstration site which lets you experiment with wordcloud outputs using data from Twitter and wikipedia. Here’s an example for the Twitter search term jisccetis (clicking on a word starts a new search for that term).
OER RI posts straight from Yahoo PipeThere is also an option on Jason’s site to use a ‘custom’ url. This seems to accept a range of sources: html pages, rss feeds and json. You can just use the RSS output from Phil’s pipe to get this. This however looks a bit suspect to me. For example the word ‘rapid’ appears in the cloud but there are just as many occurrences of the word ‘innovation’ in the source text but it doesn’t appear. What I think is happening is the script is picking up the first 250 words and then counting the occurrences of those words. I haven’t had time to test that theory but if anyone else does leave a comment and I’ll update the post.
Instead I tried a workaround using Yahoo Pipes Term Extract. With this Pipe I take Phil’s Pipe as a source and for each blog post extract terms. I can then output this as json and use as a data source for Jason’s cloud generator creating a wordcloud that will update as more posts are published (although I’ve got no way of embedding it yet):


Dynamic cloud of OER-RI Posts using term extract

Visual inspection would suggest that this version is more reliable. There are however some things to remember:

Exit mobile version