Assessing text analysis tools

Text analysis was the topic of our latest DITA lecture and more specifically, some tools available to conduct it. This post aims to answer the questions:

  • Which tools did I find the most interesting?
  • What did they help me find and how?
  • Why did I think what I found was interesting?

The tools these refer to were Wordle, Many Eyes and Voyant and they were tested out using datasets collected from previous sessions on TAGS and Altmetric.


This tool or ‘toy’ as the site defines it, allows you to produce wordclouds for free from a body of text. I used a TAGS dataset showing journals articles around the theme of prisoners to try out this tool and the result was the following…wordle

This image highlights that the most numerous words within the journal titles are ‘Prisoner’ and ‘Prison,’ a fairly obvious and not particularly useful conclusion. It may have been more interesting had I been given the option to amend the stopword list to remove these words to try and declutter the cloud from terms which may not have been so predictable. As such I can’t really see the value in using wordcloud other than as something nice to look at – there is definitely something very aesthetically pleasing about seeing lots of words all jumbled together. Unfortunately however, there is no significant semantic content that it really being portrayed and I imagine there only would be if the wordcloud had shown that the most oft used words varied considerably from what I’d expected.

Many Eyes

Unfortunately the work currently being undergone on this site, made it very difficult to use to any effect. I had a bit of trouble making the creation tool understand how to use the body of text as well. The options for producing inforgraphs using the corpus were ‘ranked from most to least relevant’, but the only tool which was able to produce anything properly was the wordcloud, which was in the centre of the list and was pretty ugly to be honest (although you do have the option to chose from a list of fonts.)

many eyes

So, once again I was left with a largely useless wordcloud and no desire to experiment more with the tools before the site was in a more useable state.


This tool is, as our classnotes explain, “the most interesting and promising”, which appears to be the resounding conclusion after reading my DITA colleagues own posts on the topic. The dashboard-style page offers more than just the wordcloud and uses graphs and visual tools to display the quantitative interpretation of the text in a far more interesting way than the other two tools.


Voyant allows you to edit the stopword list that informs the wordcloud, which is a lot more practical for the reasons outlined earlier and in the image above, I have applied a wordlist, which also excludes the words “prison” and “prisoner”, which I think does help the other words standout a bit more although to what purpose is still a bit ambiguous. The centre column shows the text in its entirety and as you engage with the other tools available on the dashboard the corpus is involved and shows you, for example, were certain words appear which makes everything feel a lot more engaging and dynamic.

The ‘keywords in context’ tool is interesting in that it gives you a better idea of the semantics of the text by aiding an understanding of in what context a word may appear. For example I searched for the keyword ‘health’ and I was able to see quickly that the majority of uses of the word were in reference to mental health as opposed to health in general.

The ‘word trends’ (the line-graph above) option allows for you to see at what point in the corpus a word is used. This is also an interesting touch and I imagine could be useful if you are trying to pull out themes within a piece of text without having to read the whole thing.

My main preoccupation when looking at these tools has been, how can they be useful to a librarian and largely at this stage I can’t really come up with much to answer that. The voyant tool definitely is the most interesting to use, but I can’t imagine a scenario it which it would really help me to fulfil an enquiry or aid me in my work, but I hope to find instances where it does.


Altmetric Exploration

Last week DITA introduced us to the burgeoning area of alternative metrics and more specifically, the platform ‘Altmetric’, which offers academics an analytic tool to measure the “social impact” of the work they produce. The idea behind these metrics is to supplement the more traditional means of assessment of counting the number of citations with insight into how much attention each article is receiving online.

During our lab session, we began to experiment with the Altmetric Explorer and to familiarise ourselves with the distinctive ‘Altmetric donut’ which shows the altmetric ‘score’ and it’s breakdown of sources where the article has been referenced whether it be in tweets, blogs, new sources or any of the other sources it altmetrics vast database. Many of my DITA colleagues have written really good introductions to altmetrics which are definitely worth a look for a good grounding in it.

Carrying out searches and looking through the data on the site, it’s definitely interesting to see how an article reaches audiences. As an academic it’s easy to see how this could prove useful is tracking and widening the dissemination of your work. Whether or not that work is well received or even read, is where the altmetric idea fails, which is overtly recognised by the creators themselves as it’s not a failing which is readily solvable. On the site they are quick to highlight that,

“Altmetric measures attention, not quality. People pay attention to papers for all sorts of reasons, not all of them positive.”

The platform does make considerable effort to mitigate this by allocating more points in each Altmetric score to those sources which are of higher ‘quality’ themselves, therefore suggesting at the quality of the work itself. It’s not a fool-proof plan, but it would be very hard to see what would be.

On beginning to write this post, I took a second look at Altmetric and decided to run a search on alternative metrics themselves, to see if there has been much written about them in the academic field and more particularly about how them measure ‘impact.’

Using the search toolbar I selected the keywords “altmetrics AND impact” which I thought may return a large scope of articles, of which any commenting on the effectiveness of altmetrics in measuring impact would be included.

alt metric search

The result was a collection of sixty articles, which I exported into a CSV Excel document to better view the details of each entry. Notably, glancing through the journals column the majority of sources are scientific journals, or at least journals relating to health and the sciences primarily with just a few from information science specialist or humanities specific publications, which seems in tune with the critique that altmetrics have largely been adopted more by the science community than any other. It’s also quick to see that the majority of attention paid to the articles were as tweets or as items viewed via Mendeley.


The first ‘hit’ on the list of data, fortunately, is the most relevant to my interests and looking at how the Altmetric score has been calculated is an interesting way to see how others have engaged with it. The number of tweets vastly outnumbers any of the other mentions is gains, with a total of 1337. Looking at the number of related blog posts and Mendeley readers, my first thought is how many of these is coming from the same person. One individual may have tweeted the article, read it on Mendeley and posted a blog referencing, which instantly reducing the scope for measuring how many individuals have engaged with the material (although of course these activities may have lead to others reading the article.) Plus, as mentioned above it also fails to show whether these references were positive or negative. All of this attention may have been a result of considering the content to be poorly considered or inaccurate.


I decided to take a closer look at these instances of attention to see whether the numbers could be a good indication that people are really engaging with the article and whether they were positive, negative or neutral. Much of the references are just neutral mentions that the article is there. It’s surprisingly difficult to find material which is actually commenting and critiquing the work. The best source in this respect appears to be in blogging. The select few of the 30 highlighted were more likely to comment on the article and give a perspective on it. In this sense, I think seeing the number of citations made to the articles in other academic outputs alongside these ‘alternative’ sources would be useful because at least it shows the article has been read and considered. It’ is all the more important then I think, to remember that altmetrics are a complimentary from of measuring impact and are should ideally be viewed in combination of other metric tools. Considering the lapse in time between publication and citation of an article, however, the benefit of at least being able to see if the word is out on your work is incredibly beneficial, but one can only hope and wait that others have picked up the signposts and taken the time to engage.

Interestingly the conclusion of this article is that tweets (or “tweetations”) can correlate positively with citations, though it does stress what I’ve covered here, that “social impact” is very different from knowledge uptake in academia. Even so, the authors remain optimistic for the future of altmetrics.

So on that note, here is a link to the article itself for anyone interested and hopefully an extra blog post mention may take it a small step towards further notice in the wider academic communities…

Can Tweets predict citations?



As a brief, while writing this entry I stumble upon a blog post about librarians and altmetrics, which may be useful to some of you: 4 Things Every Librarian Should Do With Altmetrics

If it’s boring, it’s important

bored statue

Hurrah for reading week!!

I’m sure I share the sentiment with my fellow #citylis students and here’s hoping we’ll all get our  heads down and thinking caps on to help digest the last five weeks of feasting on library science.

Within this time it’s DITA which I have found the most challenging, having come to it with no experience of digital technologies other than that of a user the concepts have been a bit tricky to get my head round. In spite of this, or maybe because of it, I have also found it the most rewarding. I’m beginning slowly but surely to get an ‘in’ on this very strange and alien language of computers, which is ever increasingly surrounding me and becoming harder to side-step. Ernesto tweeted after his showing of The Internet’s Own Boy, a quote from Aaron Schwartz which had particularly resonated with him:

It’s easy to understand why and I think his timing couldn’t have been better. After just five weeks of participating in his module, I know I am not the only person on the course who has come to realise how little they knew about a subject which is so fundamentally entwined with our social and cultural identity. It’s almost embarrassing to think that as somebody who uses the internet every day of her life, I have never felt motivated enough to invest time and energy into beginning to gain a sound understanding of how it works. I think of it in a similar vain as I do about taxes. I don’t really understand how the system works, which is complicated to understand (no doubt intentionally) with acronyms and jargon and formulas for calculations that make my eyes bleed. Plus it’s incredibly dull. So, I’ve largely left it in the hands of the powers that be and trusted that they’re getting it right, which can only be deemed a reckless abandonment of the responsibility I have over my own finances.

The internet is the same. It’s only on beginning to scratch the surface of comprehension that you realise that what you see on a screen is only the tip of the iceberg and every tweet, facebook update, shop item clicked does something in the world-wide-web like a pebble thrown into water. The consequences of which, when brought to light, can be pretty eye-watering. Fortunately, there is this motivating factor when being introduced to the building blocks of the web as we have been during the first half of term, becoming familiar with the likes of HTML, XML and JSON. A motivating factor which I’m finding essential when attempting to grapple with – yes I hate to say it- boring material. I’ve never been attracted to anything remotely mathematics based and have always favoured words over numbers or patterns so this new venture into coding does feel more effortful than most.

Luckily my approach to subjects which I have had little interest in up until now has definitely changed, as I’ve come to realise increasingly that boring = important. If I’m struggling with an idea and being ever more strongly tempted to procrastinate, I feel all the more seriousness of the task at hand. Ten months of working in a corporate law firm has a lot to do with it, having had out of necessity to learn about the financial world and all the mind-numbing details entailed with insolvency, private equity, capital markets and hedge funds to offer a taster (don’t chew, just hold your nose and swallow). With every brain-melting article I read came a stronger realisation of how shocking it was to me that I didn’t already know this stuff which dictated how the economy worked and how I was a player in it.

In the same way understanding digital technology shouldn’t be seen as pursuing an interest, but taking rather as carrying out a duty of self-education. The importance is too great and the risks of not doing so too high and as librarians making people aware of that duty is an important part of our role.

So despite my lack of excitement about databases and mark-up language and the like, I know I won’t waste this reading week. With the help of coffee on tap and a very uncomfortable chair I am quite ready to let myself be bored over the next couple of days, because I know the pay-off will be too important to be missed.