Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine

Last month I posted Free the tweets! Export TwapperKeeper archives using Google Spreadsheet, which was a response to the announcement that TwapperKeeper would be removing public access to archives on the 6th January. This solution was limited to archives smaller than 15,000 tweets (although minor tweaking could probably get more). Since then Tony Hirst has come up with a couple of other solutions:

One of the limits these solutions have is they only collect the data stored on TwapperKeeper missing lots of other juicy data like in_reply_to, location, retweet_count (here’s what a tweet used to look like, now there is more data). Whilst this data is probably of little interest to most people for people like me it opens the opportunity to do other interesting stuff. So here’s a way you can make a copy of a Twapper Keeper archive and rebuild the data using Google Refine.

  1. You’re going to need a copy of Google Refine and install/run it
  2. Visit the Twapper Keeper page of the archive you want. On the page copy the RSS Permalink into the URL box in Google Refine adding &l=50000 to the end e.g. the ukoer archive is http://twapperkeeper.com/rss.php?type=hashtag&name=ukoer&l=50000 and click Next.
  3. In the preview window that appears switch ‘Parse data as’ to XML files. Scroll the top pane down to hover over the ‘item’ tag and click
    Refined parse xml 
  4. You should now have a preview with the data split in columns. Enter a Project name and click ‘Create Project’
    Refined Columns
  5. From the ‘item – title’ column dropdown select Edit column > Add column based on this column…
  6. In the dialog that opens enter a New column name ‘id_str’ and the Expression smartSplit(value," ")[-1] (this splits the cell and returns the last group of text)
  7. From the new id_str column dropdown select Edit column > Add column by fetching URLs… and enter a name ‘meta’, change throttle delay to 500 and enter the expression "https://api.twitter.com/1/statuses/show.json?id="+value+"&include_entities=true" (that’s with quotes), then click OK.  [What we’ve done is extract a tweet id and then asked refine to fetch details about it from the Twitter API]
  8. Wait a bit (it took about 4 hours to get data from 8,000 tweets)

Once you have the raw data you can use Refine to make new columns using the expression parseJson(value) then navigate the object namespace. You might find it useful to paste a cell value into http://jsonviewer.stack.hu/ to see the structure. So to make a column which extracts the tweets full name you’d use parseJson(value).user.name

So don’t delay ‘free the tweets’

OER Visualisation Project: What I know about #UKOER records on Jorum and OER Phase 1 & 2 [day 18]

Most of the numbers above come from two spreadsheets: CETIS PROD Spreadsheet; and jorumUKOERReconciled – Issue 2.  I’ve mentioned both of these spreasheets before (day 8 | day 16), but you might like to File > Make a copy of these to play with the data yourself and see some of the formulas used. An additional note on the resource view counts. These were collected by fetching the each resource page on Jorum using Google Refine and scraping the data (more details on using Refine to extract data from day 11.

[The additional processes were to extract a Jorum url by using the expression filter(value.split("|"),v,contains(v,"jorum.ac.uk"))[0]) on the identifier column, fetching a new column based on the new url and then extracting a count using toNumber(replace(filter(cells["jorumpage"].value.parseHtml().select("p.ds-paragraph"),v,contains(v.htmlText(),"viewed"))[0].htmlText(),/\D/, ''))]

So I now have a decent amount of data (some of which might be dropped), next to communicate …

OER Visualisation Project: Data Driven Journalism [day 16] #ukoer

By my calculation it’s day 16 of the OER Visualisation Project. Since day 11 and the Jorum UKOER ‘snowflake’ visualisation I’ve been going back to the refined data and trying to validate the set and understand the information it contains better.

One of the things I did was upload the data from Google Refine to Google Spreadsheet (exported Refine as .xls and uploaded it to Docs). Here is a copy of the spreadsheet. Using the UNIQUE and COUNTIF formula it’s very easy to built a summary of the top Jorum UKOER contributors and subject categorisation.

In the original OER funding call paragraph 19 states: “depositing resources funded through this call into JorumOpen will be mandatory” so in theory all 51 Phase 1 and 2 OER projects should in theory have records in Jorum. We can use this assumption to validate the refined dataset.

Using day 8’s CETIS PROD to Google Spreadsheet its easy for me to create a list of Phase 1 and 2 lead institutions (41 in total as some institutions from phase one were refunded). Using this list was able to query the spreadsheet data and produce the following table embedded below which counts Jorum ‘ukoer’ records for each of the institutions:

You can see a number of institutions have zero record counts. These are mainly for the HEA Subject Centre projects which were not detected using the original extraction and reconciliation method, but as also noted, a number of these records are reconciled against other university names. Using this data the original extracted dataset was further refined and an additional 705 ukoer records were reconciled against institution names. A revised issue and summary of ukoer records is available here.

Data Driven Journalism

Most people are probably unfamiliar with the term ‘data driven journalism’ but would have seen some of the products of the process like The Guardian’s Interactive guide to every IED attack (Wikipedia has a useful definition and overview of data-driven journalism).

It’s been useful for me to consider the OER visualisation project like a data journalistic assignment, using Paul Bradshaws The inverted pyramid of data journalism as a basic processes to approach the Jorum data. For example, remembering the ‘context’ in which the Jorum data was collected (mandatory task, which in cases wasn’t always full automated) is a reminder that even after multiple refinements of the data it’s still not 100% complete and in parts may be unrepresentative.

Looking at a table of top Jorum UKOER contributors, for example, Staffordshire University accounts for almost 50% of the deposits almost all going in the HE – Creative Arts and Design subject area, while University College Falmouth have one Jorum entry for their entire ukoer work.     

Top UKOER Depositors Records
Staffordshire University4113
University of Cambridge855
Newcastle University847
Subject Centre for Information and Computer Sciences669
Leeds Metropolitan University383
Top subjectsRecords
HE – Creative Arts and Design4068
HE – Engineering1227
HE – Veterinary Sciences, Agriculture and related subjects874
HE – Mathematical and Computer Sciences767
HE – Physical Sciences454

Using Data Journalism processes should also be helpful when considering how the data is communicated using techniques like interactive slideshows and providing brief narration to the data.

With this in mind it was useful to revisit Geoff McGhee’s Journalism in the Age of Data (embedded below)

A lot more to do in 2012

What I’ve starred this month: December 28, 2011

Here’s some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.

OER Visualisation Project: Exploring UKOER/JORUM via OAI with Google Refine and visualising with Gource [day 11]

I should start with the result so that you can see if it’s worth doing this:

The video shows the deposits from institutions and Subject Centres to Jorum tagged ‘ukoer’ from January 2009 to November 2011. In total over 8,000 deposits condensed into 5 minutes (there are more records, but these were the ones that could be reconciled against an institution name).

Here’s the recipe I used to do it, which should be easy to modify for your own and other repositories. As the explanation takes longer than to actually do it I’m going to assume you understand some basic tools and techniques, but you can always leave a comment if something isn’t clear.

Let start by looking at what it is we are trying to achieve. The animation is generated using code from the open source Gource project. Gource uses an input log file to visualise software commits using the format shown below. So for the  Jorum visualisation we need to generate a file with timstamp, creator (in this case the submitters host institution) and title (prefixed by subject classification).

Gource log format

The user andrew adding the file src/main.cpp on Thu, 03 Jun 2010 05:39:55 GMT (1275543595):

1275543595|andrew|A|src/main.cpp

Getting the data – building a source

Building the log file we need details of the records from Jorum. Fortunately Jorum implements the OAI Protocol for Metadata Harvesting, which is designed to allow the easy sharing and access of repository data. I say easy but in reality its easy if you have another repository on a server somewhere that can consume OAI data, but its not easy to find a desktop based solution. After a lot of trial and error I’ve arrived at a solution using a combination of MS Excel and Google Refine (BTW “Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another” – it’s also open source).

I had hoped to do all of this in Google Refine but was struggling with the initial data import, recognising the schema and including records with multiple subject classifications, so we briefly start with MS Excel.

In Excel (I’m using 2010, other versions may work) we want to start a new workbook. In the Data ribbon select ‘From Web’. In the dialog that opens in the Address bar enter http://dspace.jorum.ac.uk/oai/request?verb=ListIdentifiers&metadataPrefix=oai_dc .

Excel data ribbon

Once it’s finished loading (which can take a while) click Import. You’ll now get some dialog boxes warning you about the xml import but you can ignore those. You should now have a sheet of List Identifiers, that is a list of all the record identifiers (ns1:identifier4) and the subject set they are attached to (ns1:setSpec6status3) – you’ll find that there are more columns, mainly blank, which we don’t need.

Excel XML data imported

Next we want add some readable subject classification to the data by changing setSpec ids into text equivalents. This data is also available via Jorum’s OAI service and the raw data can be seen by looking at http://dspace.jorum.ac.uk/oai/request?verb=ListSets.

To get this data into Excel we want to follow a similar process to above in the same spreadsheet getting Data – From Web using http://dspace.jorum.ac.uk/oai/request?verb=ListSets as the address. This gives us a sheet similar to below with setSpec ids and associated category name.

Getting subject classification

Next we want to match this data to the sheet of List Identifiers. To do this we first want to sort the data we just captured on the setSpec column. Now in the original sheet add a new column and enter the following formula in the cell immediately beneath the column name (row 2):

=VLOOKUP([@[ns1:setSpec6]],ListSets!A:B,2,FALSE)

This formula looks up the setSpec6 value, matches it against the data we just got and returns a setName. You can now save this spreadsheet.

Getting more data using Google Refine

So far we’ve got a list of record ids from Jorum and the subject category for that record. We still need to get when the record was created, who by and resource title. To do this we are going to use Google Refine. If you haven’t already here’s how to install Google Refine. Open Google Refine and create a new project from the Excel file we just created. The default setting should work just make sure you select the sheet with 19,000 plus rows.

After the project has been created next we want to get more information for each record identifier. From the ns1:identifier4 column drop-down menu select Edit column > Add column by fetching URLSs:

Google Refine - add data from column In the dialog box that opens use the following settings:

  • New column name – record
  • Throttle delay – 500
  • On error – set to blank
  • Language – GREL
  • Expression – "http://dspace.jorum.ac.uk/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier="+value

Google Refine - Add column by fetching URL

When you hit OK is Google Refine will use the row value to fetch even more data from Jorum and enter it into a cell. This is done using another entry point to OAI services using each identifier to get all the record data (here’s an example response). As this has to process over 19,000 requests it can take some time. If you would prefer not to wait here’s an export of my Refine project with the data already collected.

Google Refine - Returned data

So now we have all the information we need but it’s all in one cell, so we need to do a bit more refining.

Extracting a date

You’ll notice that each record has a couple of dates stored in dc:date. Lets look at extracting the first date we find. Google Refine has a couple of ways to parse a cell and get data out. Initially I tried using Jython but didn’t get very far, but thanks to some help from the Google Refine community found I could use Refine’s GREL language. Here’s how.

From the new ‘record’ column dropdown select Edit column > Add column > Add column based on this column.  In the dialog that opens set the new column name as first date and enter the following GREL expression:

forEach(value.parseHtml().select("dc|date"),v,if(length(v.htmlText())>11,v.htmlText().slice(0,10)+" "+v.htmlText().slice(11,19),"bad format")).sort()[0]

What is happening here is within the cell forEach <dc:date> If the result is length than 11 characters slice the text for the first 10 characters (yyyy-mm-dd) and a space then slice characters 11 to 19 (hh:mm:ss). As the dc:dates are temporarily stored in an array we sort this and get the first ([0]) value, which should be the smallest.

Next we want to turn the date, which is being stored as a string, into a UNIX timestamp (the number of seconds or milliseconds since midnight on January 1, 1970). We need a timestamp as this is the date/time format used by Gource.

To get this we want to add a column based on firstDate. In the Add column based on column firstDate enter the name timestamp and switch the language to Jython (I found this the best for this procedure) and the expression:

import time

return int(time.mktime(time.strptime(value, ‘%Y-%m-%d %H:%M:%S’)))

This takes the cell value and turns it into a Jython time object by matching the date/time pattern used in the firstDate column. As Jython times are stored as UNIX timestamps we can just return the value to the new cell.

Some basic timestamp validation

Google Refine Facet/FilterI obviously didn’t start up Refine drop the expression from above in and get to this point. There was a lot of trial and error, testing assumptions like all the dates are in yyyy-mm-ddTHH:MM:SSZ format, and checking the processed data.  For example, if we want to check we’ve got valid timestamps for all the rows from the timestamp column dropdown menu we can select Facet > Customized facet > Facet by blank. To filter the blank rows we have to click on include in the Facet/Filter menu on the left hand side (we can also conveniently see that 3616 rows are blank).

Initial visual inspection of the results show that the status column contains a lot of records marked deleted. From The status column dropdown we can create an addition Facet > Text Facet. In the Facet/Filter section we can see that there are 3616 occurrences of the text ‘delete’, so we can conclude that blank timestamps are because of deleted records, which we can live with.

Important tip: As we have filtered the data if we do any additional column operations it will only be applied to the filtered rows so before moving on remove these facets by click on the little ‘x’ next to them.

Next lets sort the timestamps to check they are in a reasonable range. Do this by clicking the dropdown on timestamp ad using the sort option, sorting the cells as numbers (check both ascending and descending order). You’ll notice some of the dates are in 2004, I’m happy with these as Jorum has been going for some time now.

Google Refine - Numeric Histogram[By turning on the numeric facet for the timestamp column we also get a little histogram which is handy for filtering rows].

Before moving on make sure timestamp is sorted smallest first

So we now have a timestamp next lets extract the resource title.

Extracting a resource title

This is relatively straight forward as each record has a. So from the record column drop down select Edit column > Add column > Add column based on this column. In the dialog box use GREL, name the new column ‘title’ and use the following expression

value.parseHtml().select("dc|title")[0].htmlText()

[Each record only has one <dc:title> so it’s safe to just return the first title we find]

Reconciling who ‘damn’ made these resources

The headache comes from resource creators filling in information about their submission including information about who made it. This means that there are inconsistencies with how the data is entered, some records using a separated creator for the institution name, others including it with their name, or omitting this data altogether.  For the visualisation I wanted to resolve the resource against an institutional name rather than an individual or department. Here’s how the data was reconciled.

Lets start by extracting all the recordsto let use see what we are dealing with. We can do this by again using Edit column > Add column > Add column based on this column from the ‘record’ column. This time lets call the new column ‘creators’ and use the following GREL expression:

forEach(value.parseHtml().select("dc|creator"),v,v.htmlText()).join(",")

This will forEachget the value and store as a comma separated string.

For the early records you’ll notice that it’s a named person and there is little we can do to reconcile the record against an institution. For the later records you’ll see named people and an institutional affiliation. So lets see if we can extract these institutions into their own column.

From the creators column dropdown add a column based on this one calling it inst_id and using the following GREL expression

if(contains(value.toLowercase(),"university"),filter(value.toLowercase().split(/[-,.;\(\)]|(\s+for+\s)+/),v,contains(v,"university"))[0].trim(),if(contains(value.toLowercase(),"centre"),value.toLowercase(),""))

What this expression is doing is if the value contains the word ‘university’ the string is split into an array using the symbols –,.;() or the word ‘for’ and the array value with ‘university’ is stored, else if the value contains the word centre this value is stored (the OER Programme has projects from Universities and HEA Subject Centres).

Some additional refining via faceted filters and edit cells

Google Refine - Blank facetTo let us refine this data further from the new inst_id column and click the dropdown menu and select Facet > Customized facets > Facet by blank. Click on true so that we are just working with the blank inst_ids.

Scrolling through the records we can see some records the a creator that begins with ‘UKOER,Open Educational Repository in Support of Computer Science’. On the creators column from the dropdown sect ‘Text filter’ and use ‘Open Educational Repository in Support of Computer Science’. With this facet in place we can see there are 669 records. As we are confident these files were submitted as part of the Information and Computer Sciences Subject Centre’s work we can autofill the inst_id column with this data by clicking the dropdown on the inst_id column and selecting Edit cells > Transform. In the expression box enter “Information and Computer Sciences Subject Centre” and click OK.

Google Refine - Cell transformation

Remove the ‘creators’ filter by clicking the small ‘x’ in the top left of the box.

Let add a new text filter to the records column (you should know how to do this by now) with the word ‘university’. This should filter 878 rows or so. To make it easier to see what it is matching press Ctrl+F to bring up you browser Find on page and look for university.

Moving through the data you’ll see things like:

  • 384 rows can have inst_id’s by using the cell transformation filter(cells["record"].value.parseHtml().select("dc|publisher"),v,contains(v.htmlText().toLowercase(),"university"))[0].htmlText()
  • 89 rows include the term “University of Plymouth” in the dc:description, we can filter and fill these using the subject centre method.
  • 81 rows can have university names pulled from dc:subject using filter(cells["record"].value.parseHtml().select("dc|subject"),v,contains(v.htmlText().toLowercase(),"university"))[0].htmlText()

At this point if we just use the blank inst_id facet we’ve got 10,262 true (ie blank inst_id’s) and 9199 false, so a 47% hit rate … not great! But if we add a ‘ukoer’ text filter to the records column this improves to 8433 inst_id’s in 9955 matching rows which is a 84% hit rate. Whilst this isn’t perfect it’s probably the best we can do with this data. Next to turn those institutional id guesses into reusable data.

The real magic reconciling institutional names against CETIS PROD

So far we’ve tried to extract an institutional origin from various parts of the Jorum data and there is a lot of variation in how those ids are represented. For example, the inst_id column might have ‘the university of nottingham’, ‘university of nottingham’ or even ‘nottingham university’. To make further analysis of the data easier we want to match these variations against a common identifier, in the example above the ‘University of Nottingham’.

Google Refine has some very powerful reconciliation tools to help us do it. More information on Google Refine Reconciliation here.

In the inst_id column select Reconcile > Start reconciling.

Google Refine - Reconciliation

Google Refine has existing Freebase databases, which we could use to match institutional names against database ids, but as we are dealing with JISC/HEA projects it makes more sense to try and reconcile the data against the CETIS PROD database (this opens up further analysis down the line).

Kasabi - Reconciliation urlFortunately PROD data is mirrored to Kasabi, which includes a Reconciliation API for use with Google Refine. To use this data you need to register with Kasabi and then subscribe to the PROD data by visiting this page and clicking ‘Subscribe’. Once subscribed if you revisit the previous link and then click on the link to the ‘experimental API explorer’ and copy the url in the blue book including your apikey e.g. http://api.kasabi.com/dataset/jisc-cetis-project-directory/apis/reconciliation/search?apikey=aaaaaaaaaaaaakkkkkkkkkkeeeeeyyy

Back in the Google Refine Reconciliation dialog box click on ‘Add Standard Service …’ and enter the url you just created. Once added click on the new Reconciliation API and select ‘Reconcile against no particular type, then Start Reconciling’.

Google Refine - Using Kasabi data

Google Refine - edit cellOnce complete you should hopefully see from the inst_id judgment facet that the majority of names (all but 131) have been matched to PROD data. Filtering on the ‘none’ you can do mass edits on unmatched inst_ids by clicking the ‘edit’ and ‘Apply to All Identical Cells’. Once you’ve done this you can re-run Reconcile > Start reconciling to get additional matches.

Exporting to Gource using a custom template

Almost there people ;). At the very beginning I mentioned that the visualisation tool Gource has it’s own input log formats, shown below as a refresher:

1275543595|andrew|A|src/main.cpp

Another useful feature of Google Refine is Export Templating, which allows us to control how our data can be written out to a separate file.

In Google Refine make sure you have a text facet on the record column filtering for ‘ukoer’ and inst_id: judgement is on ‘matched’ (this means when we export it just include this data). Now select Export > Templating …. Remove any text in Prefix, Row Separator and Suffix and in Row Template use:

{{cells["timestamp"].value}}|{{cells["inst_id"].recon.match.name}}|A|{{replace(cells["subject"].value," / ","/")}}/{{if(length(cells["title"].value)>20,cells["title"].value.slice(0,20)+"...",cells["title"].value)}}

This will write the timestamp cell value, then the reconciled name for the inst_id, then the subject value (stripping whitespace between slashes) and the resource title stripped down to 20 characters.

Finally, Gource

Google Refine will spit out a .txt file with the formatted data. Before we use it with Gource there is one thing we need to do. Initially I was getting log file format errors in Gource and then discovered it was a .txt file encoding problem. So open your newly created .txt file (which is in UTF-8 format) and File > Save As changing the encoding to ANSI.

Save As ANSI encoding

To test you visualisation download gource and extract the files. In the same directory as your extracted files place a copy of your refined log file. To view what you’ve got open your command line, navigate to your extracted gource location and executing:

gource nameoflogfile.txt

The gource site has more instructions on recording videos.

- THE END -

well almost … Here’s:

The bigger picture

This work was undertaken as part of my OER Visualisation work (day 11) and while it’s useful to have the Jorum OER snowflake visualisation in the bag, having a refined data source opens up more opportunities to explore and present OER activity in other ways. For example, I immediate have a decent sized dataset of OER records with subject classification. I’ve also matched records against PROD data which means I can further reconcile against project names, locations etc.

Yummy data!

OER Visualisation Project: Beginnings of linking data from PROD to Google Spreadsheet and early fruit [day 8] #ukoer

One of my ambitions from Day 1 of the OER Visualisation project was to start linking PROD data in to Google Spreadsheets. Whilst this was primarily designed to help me with the project after speaking to some of the JISC/JISC CETIS people it sounds like it would help them and others.  

Here’s a spreadsheet which is the beginnings of a PROD Datastore (Warning: work in progress). Data is currently being populated from a Talis Datastore using a number of different sparql queries which are outputting csv data via a sparqlproxy (this will be switched to the Kasabi store when a suitable proxy has been sorted). You might find the SPARQL queries used to fetch this data useful for making your own so I’ve compiled them in this commentable document.  

What can we do with the spreadsheet? 

Creating pivot table reports

I used to find pivot table creation quite daunting but they are a great way to filter and analyse large sets of data. The PROD spreadsheet contains 2 example pivot reports one for technology and the other for standards (if you want to enable pivot table options you’ll need to File > Make a copy of the spreadsheet then on the pivot sheets select Data > Pivot Table Report).

The example ‘Technology Pivot’ is summarising the data from phase 1&2 of the OER Programme. You can see there is a number of technologies were recorded (over 100), the top three being YouTube, Flash and Slideshare. This data can be shown graphically using Google Spreadsheets chart tools and embedded as an interactive or static graphic.

The charts in Google Spreadsheets aren’t that exciting but there is a framework for extending these, which I’ll come back to later. Something I was hoping to do was link the data from Google Spreadsheet to IBM’s to more powerful visualisation service Many Eyes. For example below are examples of the technology pivot data as a bubble diagram and comparison Treemap and it would have been nice to automatically generate these. Tony had posted on this a couple of years ago using Many Eyes Wikified and Google Spreadsheets, but alas this part of the service was pulled last year) .

UKOER Technologies Phase 1&2 (Bubble chart) Many EyesUKOER Technologies Phase 1&2 (Treemap Comparison) Many Eyes

Visualising project relationships

One of the great things about the PROD data is there is a lot of relationship data already there. For example if you look at the PROD page for the ENABLE project you can see there are details of the projects that ENABLE builds on or was built on by, related projects and even comments that relate to the individual relationships.

This relationship data can all be extracted from PROD and in this case imported to the Spreadsheet. On the Relates_to sheet I’ve imported details of all the JISC funded project ‘relates_to’ relationships. What can we do with this data? Well at a basic level in column B we have a source id and column F has a target id which makes it suitable for using in a force layout diagram. Fortunately I’ve been playing around with online force layout diagrams for a while and most recently created a Google Spreadsheet Gadget to display this info (this is how you can extend the basic chart selection).

Whilst this gadget still needs to be verified by Google for anyone to see the results we can use the spreadsheet as a datasource for the gadget’s big brother EDGESExplorer. Publishing the spreadhseet to the web, using the built in tools to do this, we can reformat the data in EDGESExplorer to see how all JISC funded projects stored in PROD are related (click on the image below for the interactive version, you can explorer individual nodes by clicking on them). 

JISC Project Relationship

I think this graph provides a useful interface for seeing and exploring the relationship between JISC funded work. To become really useful some additional interfacing is required but there’s code I can reuse from my Guardian Tag Explorer and Twitter Conversation explorer tools, which gives you more project info and a link back to the appropriate PROD page (btw interesting post on ouseful.info on Information Literacy, Graphs, Hierarchies and the Structure of Information Networks).

Summary

So to recap: a Google Spreadsheet is being populated from PROD (live data). Users can create reports and charts within the Spreadsheet environment (live data) or export data to other services like Many Eyes (dead – as in the live link is broken – data). Finally we can publish live data from the Spreadsheet for use in other tools like EDGESExplorer.

My question for you is what data would you like in the Spreadsheet? Summary of projects by institution? Breakdown of projects by partners? Projects by JISC Programme Managers? Let me know what you would like to see ;)

Bye-bye Protovis Twitter Community Visualizer: Hello D3 Twitter Community Visualizer with EDGESExplorer Gadget

The problem with starting my new OER Visualisation job is I’ve got less time to explorer my own little projects. This may be a blessing for some of my readers as it means I can do less waffling. So in brief:

Tony came up with this brilliant solution for plotting twitter hashtag community graphs (it was brilliant because it used Google Social Graph API to get friend/follower relationships, no authentication required). Then I came along and dumped his code into Google Spreadsheet and wrapped the output in an embeddable gadget. Google then launched Google Plus which appears to coincide with a deterioration in friend/follower information in Social Graph (one presumes my social network is better than yours type posturing).

So now here’s a new version of my spreadsheet which plays nice with Twitter and uses API authentication to get friend follower information:

*** D3 Twitter Community Visualizer ***

The big news probably isn’t the spreadsheet (yawn), but how the data,shown below if your RSS reader hasn’t stripped it, is rendered. It was achieved by wrapping a version of my EDGESExplorer tool in a Google Gadget. This means if you’ve got any Google Spreadsheet with two columns of edge data (target/source) you can insert this gadget, display the data and embed elsewhere (if your host allows <script> inserts).

To use the gadget in other Google Spreadsheets select Insert > Gadget > Custom then enter the url http://hosting.gmodules.com/ig/gadgets/file/108150762089462716664/edgeexplorer.xml

One thing to bear in mind in the gadget is unverified. The main impact of this is if you ‘Publish to the web’ or ‘File > Share’ viewers won’t see anything (but as can seen above publish embed works). I’ve tried to get the gadget verified but the send it to us link is broken (I’ve logged it as an issue so hopefully it will be fixed soon).

I have a job to go to in the morning so I’m going to let you ask/tell me it doesn’t work using the comments ;)

Free the tweets! Export TwapperKeeper archives using Google Spreadsheet

Dec 8th, 2011: Transition update

Twapper Keeper’s archiving is now available in HootSuite! As a result, we will be shutting down Twapper Keeper. Existing archives will be kept running until Jan 6, 2012, after which you will not be able to access your archives anymore.

Thanks for using TwapperKeeper – we look forward to seeing you at HootSuite.

Noooo I hear you cry. I know a lot of academic staff use did use the Twitter archiving service TwapperKeeper to archive tweets from events and conferences. Whilst often these archives are usually only used for quick post event summaries I think there is a lot more value to be gained from these resources. They are rich records of ideas and resources, shared within the instant but too often forgotten.

In projects iTitle and TAGSExplorer I’ve been exploring how these archives can be brought back to life so it is a great shame to see that TwapperKeeper will soon be no more. As staff consider new alternatives to keep archives of their events (you might want to consider my Google Spreadsheet solution, it’s free and your to control!) here’s a way to export your existing TwapperKeeper archives.

Inspired by LIBREAS.Library Grab your TwapperKeeper Archive before Shutdown! I’ve developed a Google Spreadsheet to export TwapperKeeper Archives. Here’s how:

  1. Open this Google Spreadsheet and click File > Make a copy (if this is greyed out you need to be logged in to Google first)
  2. Enter the archive name, type and the number of results you want to get in cells B9 to B11
  3. Click  the button ‘Get the archive!’

[Note: this solution works for archives of a 15,000 tweets or less]

If you need more archives make more copies of the spreadsheet.

Please share so that we can free the tweet!

Update: Having read Brian Kelly’s post on the Responding to the Forthcoming Demise of TwapperKeeper it occurred to me that some coordination might be required. So if you use this template could you publish a copy (File > Share > Anyone who has the link can view and File > Publish to the web) of the spreadsheet and leave a note of where it is using this form which is embedded below (responses can be seen here):

Responses:

OER Visualisation Project: Adding a bit of OAI-PMH collecting activity data in Google Spreadsheets [day 5] #ukoer #ooher

I told you I wasn’t intending on posting every day ;) Yesterday ended up being quite frustrating with a number of dead-ends. It all started following on from Processing a resource feed to find frequency using Google Spreadsheets where I took Leeds Metropolitan University’s Repository feed from Yahoo Pipes to get activity data into a spreadsheet (BTW Nick Sheppard has documented how the Pipe was made here).

For the next step I wanted to do something similar with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). You can read more about OAI-PMH here but basically its a method for repositories to share information about the resources they’ve got. For example, you can visit the Social Policy and Social Work (SWAP) Subject Centre: SWAPBox repository and see what they’ve got. They’ve got a nice interface for browsing and searching for resources which is great for humans but a struggle for machines. Instead the machines have their own interface using the SWAPBox OAI service (even though this service is designed for machines it’s been beautified with Chris Gutteridge’s OAI to XHTML XSLT). Here you can get access to individual record details or the thing I was more interested in a list of record ids and date stamps.

I had a look around/asked for a desktop OAI Harvester that could export data from. The closest I could get was the suggestion by James Toon to use Google Refine but OAI results usually have pagination and I couldn’t find a way to automatically get the next page of results. I also tried putting PKP’s Open Harvester System on a USB drive by running it on an instance of MoWeS portable webserver (now there’s an idea UKOER on a stick, could be a great conference bag goodie?).  Unfortunately I discovered there wasn’t a way to easily export harvested records (although I may be able to dump from MySQL).

Instead I reverted back to Google Apps Script (something I was avoiding as parsing XML in Apps Script is never fun). Here is a script I wrote to read OAI service ListIdentifiers and write to a spreadsheet. Using this with the SWAPBox repository we can get a distribution of submissions (the spreadsheet for this is here)

[As Leeds Met Unicycle repository also has an OAI service I thought I’d collect data from there to compare the RSS feed but I was getting timeouts]

PS Whilst looking for OAI export solutions  I came across Ghent University Library Download/APIs page which includes: daily/weekly exports of the complete datasets of the Ghent University Academic Bibliography in various formats; every report record can be accessed by HTML, RDF, Dublin Core, METS, MODS,  MPEG-21/DIDL; an Author export; department exports; classification type exports; and OAI-PMH; and other stuff.  All this lovely data, it just makes you want and go and do something with it! It’s a shame to have all these wonderful repositories of open resources but hidden away. Here are some tips from the JISC Digital Infrastructure Team on Making OER visible and findable.

PPS Here are more real world examples of Open Bibliographic Data  H/T @ostephens

OER Visualisation Project: Processing a resource feed to find frequency using Google Spreadsheets [day 3] #ukoer #ooher

I’m really not intending on posting each  day … honest (although some people would like that ;), but here’s another quick hack for getting a feed post frequency using Google Spreadsheets.

Looking down Phil Barker’s UKOER pilot phase sources list (and getting over the shock of limited full repository feeds <sigh>) I noticed that Leeds Metropolitan University have a full feed(?) available via Yahoo Pipes <yummy>. The feed contains 341 items and I was interested in the frequency of item submission. My initial thought was to pull the feed into Google Spreadsheet using the ImportFeed function  (e.g. =importFeed("http://pipes.yahoo.com/pipes/pipe.run?_id=286bbb1d8d30f65b54173b3b752fa4d9&_render=rss"; "items"; true;100))  but it was limited to 20 results (looking at the raw feed I saw it had 100 items – I’m sure there must be a switch in Pipes to get more).

imageInstead I switched to getting the Pipe as JSON. JSON input powers a lot of my other Google Spreadsheet toys like TAGS and Google+ network generator so I’ve a growing bank of code snippets I can throw together.

Here’s my ‘Frankenstein’ code to read a JSON source and write to a sheet. Two little tricks in here are to recast the pubDate as a date so that it’s interpreted as such in the spreadsheet and writing the JSON objects to a sheet using normalised column headers to identify object elements (I borrowed this technique from the writing data Apps Script tutorial).

Once the data is in the spreadsheet there are some standard spreadsheety things you can do like using the Frequency function to count occurrences of a dates and generate charts like this one:

or something less conventional like using the data in a visual timeline gadget to produce:

image

View interactive version of Leeds Metropolitan University Unicycle feed (this link lets you file > make a copy of the spreadsheet for your own use)

And as this data was grabbed using Google Apps Script if I had a list of JSON sources for UKOER repositories I could automate the whole process …

About

This blog is authored by Martin Hawksey+

Twitter Friendviz

The visualisation below is the network of people I've mentioned or chatted with on Twitter recently (refreshed daily) and how we are connected. You can use your mouse to pan and scroll.

I made this

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Recent tweets

Creative Commons Licence
Unless otherwise stated this work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License