Our 8th DITA lecture covered data mining and included our first guest speaker Ulrich Tiedau, who works on a research project focusing on digital humanities and reference cultures using data mining in order to carry out this research. It was a useful and interesting insight into how data mining is being used by academics and librarians for research purposes.
The aims of our lab session were to explore further and compare the Old Bailey Online resource and one of the research data mining research projects hosted by Utrecht University, of which Ulrich’s was one.
We’d previously used Old Bailey Online when learning about API’s, so I was already familiar with the way it works and fortunately it is a user friendly site. It’s an incredibly interesting one at that and too much of a temptation for avid procrastinators like myself. To start off I began trying out a few searches to refamiliarise myself with the search function. It very helpfully has a dropdown menu with suggestions for the subject lines of offence, verdict and sentence. I opted to search by offence, selecting ‘concealment of a birth,’ which produced 543 results. In order to export this data I had to recreate the search via the API demonstrator on another part of the site, where the search function is slightly different, for example it offers the additional option of being able to choose the gender of the victim. I carried out the same search but also filtered them to include only guilty verdicts and the results was 365 hits.
The API demonstrator links to the data mining tool Voyant (as well as Zotero), which we explored in the lecture the week before, allowing you to transfer the data easily through the click of a button, making the process far less labour-inducive. It also includes a “More Like This function that allows you to build new searches based on a Text Frequency – Inverse Document Frequency (TF-IDF) methodolology” which further assists in a text analysis you may wish to carry out using the sites content.
In the lab session itself, I wasn’t able to use the Voyant function as it was being used by upwards of 30 people at once, but on attempting again at home I was able to successfully transfer the first 100 documents, which is the greatest amount the function allows.
I explored the functionality of Voyant a little more than last time to see what content I could extract which may be useful or meaningful. Unfortunately, some of the searches weren’t possible because there were too many documents, for example I wasn’t able to view where instances of a particular word were across the whole corpus. It is still quite an impressive tool, however I found it a bit hard to test how useful or effective it was without a specific aim in mind.
In the second part of the lab, we were asked to compare the Old Bailey Online to on of the data-mining research projects from Utrecht university. Having just heard Ulrich speak I decided to take a the website of the research project he had spoken of. The mission statement outlines the project and its purpose:
“The program uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century.”
The method by which they aim to carry out this analysis is through data-mining from newspapers which have been digitised, measuring long-term trends in national discourses. This method would have be near impossible without being about to data-mine information from diigitised text. The only alternative would be to read through hard-copy newspapers one-by-one picking out terms and trends one-by-one. Unfortunately, with data-mining being in its infancy, not all of the material required has been digitised and of that which has been, not all have an API available to use, free of charge or otherwise.
Compared to the Old Bailey Online, this project is of a very different nature primarily because it involves data-mining from numerous external sources and collating that information together over a long period in attempt to measure trends in national discourse. Old Bailey Online has only one source to focus on, the archives of the London central criminal court and aims to digitise and make the data available for text-mining, serving more as a source for research projects rather than as one itself.