Text analysis was the topic of our latest DITA lecture and more specifically, some tools available to conduct it. This post aims to answer the questions:
- Which tools did I find the most interesting?
- What did they help me find and how?
- Why did I think what I found was interesting?
The tools these refer to were Wordle, Many Eyes and Voyant and they were tested out using datasets collected from previous sessions on TAGS and Altmetric.
This tool or ‘toy’ as the site defines it, allows you to produce wordclouds for free from a body of text. I used a TAGS dataset showing journals articles around the theme of prisoners to try out this tool and the result was the following…
This image highlights that the most numerous words within the journal titles are ‘Prisoner’ and ‘Prison,’ a fairly obvious and not particularly useful conclusion. It may have been more interesting had I been given the option to amend the stopword list to remove these words to try and declutter the cloud from terms which may not have been so predictable. As such I can’t really see the value in using wordcloud other than as something nice to look at – there is definitely something very aesthetically pleasing about seeing lots of words all jumbled together. Unfortunately however, there is no significant semantic content that it really being portrayed and I imagine there only would be if the wordcloud had shown that the most oft used words varied considerably from what I’d expected.
Unfortunately the work currently being undergone on this site, made it very difficult to use to any effect. I had a bit of trouble making the creation tool understand how to use the body of text as well. The options for producing inforgraphs using the corpus were ‘ranked from most to least relevant’, but the only tool which was able to produce anything properly was the wordcloud, which was in the centre of the list and was pretty ugly to be honest (although you do have the option to chose from a list of fonts.)
So, once again I was left with a largely useless wordcloud and no desire to experiment more with the tools before the site was in a more useable state.
This tool is, as our classnotes explain, “the most interesting and promising”, which appears to be the resounding conclusion after reading my DITA colleagues own posts on the topic. The dashboard-style page offers more than just the wordcloud and uses graphs and visual tools to display the quantitative interpretation of the text in a far more interesting way than the other two tools.
Voyant allows you to edit the stopword list that informs the wordcloud, which is a lot more practical for the reasons outlined earlier and in the image above, I have applied a wordlist, which also excludes the words “prison” and “prisoner”, which I think does help the other words standout a bit more although to what purpose is still a bit ambiguous. The centre column shows the text in its entirety and as you engage with the other tools available on the dashboard the corpus is involved and shows you, for example, were certain words appear which makes everything feel a lot more engaging and dynamic.
The ‘keywords in context’ tool is interesting in that it gives you a better idea of the semantics of the text by aiding an understanding of in what context a word may appear. For example I searched for the keyword ‘health’ and I was able to see quickly that the majority of uses of the word were in reference to mental health as opposed to health in general.
The ‘word trends’ (the line-graph above) option allows for you to see at what point in the corpus a word is used. This is also an interesting touch and I imagine could be useful if you are trying to pull out themes within a piece of text without having to read the whole thing.
My main preoccupation when looking at these tools has been, how can they be useful to a librarian and largely at this stage I can’t really come up with much to answer that. The voyant tool definitely is the most interesting to use, but I can’t imagine a scenario it which it would really help me to fulfil an enquiry or aid me in my work, but I hope to find instances where it does.