A brief introduction to Web 3.0…

In Week 9, our attention turned towards the future of digital information and more specifically towards the Semantic Web or ‘Web 3.0.’ This vision of the way in which information should be made both human and machine readable, first appeared a lot longer ago than I could have guessed – with Wikipedia dating the beginning of the movement sometime back in the 1960’s. Considering the pace at which technology has advanced and how the exchange of information has progressed with the birth of the internet and the World Wide Web, it seems a little surprisingly that this idea has not made a lot more ground.

The main feature of this Semantic Web is that the relationships between different bits of information are recorded as bits of information themselves, thus making them ‘understandable’ to the machine. This would enable information retrieval to be far more accurate. Instead of searching key terms and filtering through the search results of those terms in very different contexts, you would be able to search for the relationships between the search terms and the search engine would be able to ‘understand’ what that means and begin the filter those results which don’t meet the specific context searched for. This shift towards a semantically organised Web, would obviously need to involve using mark-up languages in a different way and we began to try to understand how this vision of the Web may be possible.

As part of the lab session we explored a website called Artists Books Online, “an online repository of facsimiles, metadata and criticism.”


This site adopts a DTD (document type definition) which acts as a template to ensure that the correct information for each artefact is present. This largely follows a format which reflects traditional bibliography, requiring information which describes the books, it’s author, publisher, place of publication etc. This style lends itself quite nicely for us to be able to begin to visualise how the Semantic Web might work, by looking at the given information for a chosen item and seeing what tags could be added to the data in the site to make the relationships between the information more machine understandable. As Ernesto (via Joanna Drucker) details in our lab notes,

“Linked Data” or “data linkage” on the Web is mainly achieved through a standard known as the Resource Description Framework or RDF…There are a number of different technical ways to express RDF (for example XML), but the basic concept is that things are described through “triples” which take the form of Subject – Predicate – Object sentences.”

Using the information from this site, we were able to create our own ‘triples,’ e.g.

Subject           “Monuments to the Industrial Revolution”

Predicate      was authored by

Object            Charles Agel


Old Bailey Online Revisited

Our 8th DITA lecture covered data mining and included our first guest speaker Ulrich Tiedau, who works on a research project focusing on digital humanities and reference cultures using data mining in order to carry out this research. It was a useful and interesting insight into how data mining is being used by academics and librarians for research purposes.

The aims of our lab session were to explore further and compare the Old Bailey Online resource and one of the research data mining research projects hosted by Utrecht University, of which Ulrich’s was one.

We’d previously used Old Bailey Online when learning about API’s, so I was already familiar with the way it works and fortunately it is a user friendly site. It’s an incredibly interesting one at that and too much of a temptation for avid procrastinators like myself. To start off I began trying out a few searches to refamiliarise myself with the search function. It very helpfully has a dropdown menu with suggestions for the subject lines of offence, verdict and sentence. I opted to search by offence, selecting ‘concealment of a birth,’ which produced 543 results. In order to export this data I had to recreate the search via the API demonstrator on another part of the site, where the search function is slightly different, for example it offers the additional option of being able to choose the gender of the victim. I carried out the same search but also filtered them to include only guilty verdicts and the results was 365 hits.

old bailey api results

The API demonstrator links to the data mining tool Voyant (as well as Zotero), which we explored in the lecture the week before, allowing you to transfer the data easily through the click of a button, making the process far less labour-inducive. It also includes a “More Like This function that allows you to build new searches based on a Text Frequency – Inverse Document Frequency (TF-IDF) methodolology” which further assists in a text analysis you may wish to carry out using the sites content.

In the lab session itself, I wasn’t able to use the Voyant function as it was being used by upwards of 30 people at once, but on attempting again at home I was able to successfully transfer the first 100 documents, which is the greatest amount the function allows.

voyant old bailey

I explored the functionality of Voyant a little more than last time to see what content I could extract which may be useful or meaningful. Unfortunately, some of the searches weren’t possible because there were too many documents, for example I wasn’t able to view where instances of a particular word were across the whole corpus. It is still quite an impressive tool, however I found it a bit hard to test how useful or effective it was without a specific aim in mind.

In the second part of the lab, we were asked to compare the Old Bailey Online to on of the data-mining research projects from Utrecht university. Having just heard Ulrich speak I decided to take a the website of the research project he had spoken of. The mission statement outlines the project and its purpose:

“The program uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century.”

The method by which they aim to carry out this analysis is through data-mining from newspapers which have been digitised, measuring long-term trends in national discourses. This method would have be near impossible without being about to data-mine information from diigitised text. The only alternative would be to read through hard-copy newspapers one-by-one picking out terms and trends one-by-one. Unfortunately, with data-mining being in its infancy, not all of the material required has been digitised and of that which has been, not all have an API available to use, free of charge or otherwise.

Compared to the Old Bailey Online, this project is of a very different nature primarily because it involves data-mining from numerous external sources and collating that information together over a long period in attempt to measure trends in national discourse. Old Bailey Online has only one source to focus on, the archives of the London central criminal court and aims to digitise and make the data available for text-mining, serving more as a source for research projects rather than as one itself.

Assessing text analysis tools

Text analysis was the topic of our latest DITA lecture and more specifically, some tools available to conduct it. This post aims to answer the questions:

  • Which tools did I find the most interesting?
  • What did they help me find and how?
  • Why did I think what I found was interesting?

The tools these refer to were Wordle, Many Eyes and Voyant and they were tested out using datasets collected from previous sessions on TAGS and Altmetric.


This tool or ‘toy’ as the site defines it, allows you to produce wordclouds for free from a body of text. I used a TAGS dataset showing journals articles around the theme of prisoners to try out this tool and the result was the following…wordle

This image highlights that the most numerous words within the journal titles are ‘Prisoner’ and ‘Prison,’ a fairly obvious and not particularly useful conclusion. It may have been more interesting had I been given the option to amend the stopword list to remove these words to try and declutter the cloud from terms which may not have been so predictable. As such I can’t really see the value in using wordcloud other than as something nice to look at – there is definitely something very aesthetically pleasing about seeing lots of words all jumbled together. Unfortunately however, there is no significant semantic content that it really being portrayed and I imagine there only would be if the wordcloud had shown that the most oft used words varied considerably from what I’d expected.

Many Eyes

Unfortunately the work currently being undergone on this site, made it very difficult to use to any effect. I had a bit of trouble making the creation tool understand how to use the body of text as well. The options for producing inforgraphs using the corpus were ‘ranked from most to least relevant’, but the only tool which was able to produce anything properly was the wordcloud, which was in the centre of the list and was pretty ugly to be honest (although you do have the option to chose from a list of fonts.)

many eyes

So, once again I was left with a largely useless wordcloud and no desire to experiment more with the tools before the site was in a more useable state.


This tool is, as our classnotes explain, “the most interesting and promising”, which appears to be the resounding conclusion after reading my DITA colleagues own posts on the topic. The dashboard-style page offers more than just the wordcloud and uses graphs and visual tools to display the quantitative interpretation of the text in a far more interesting way than the other two tools.


Voyant allows you to edit the stopword list that informs the wordcloud, which is a lot more practical for the reasons outlined earlier and in the image above, I have applied a wordlist, which also excludes the words “prison” and “prisoner”, which I think does help the other words standout a bit more although to what purpose is still a bit ambiguous. The centre column shows the text in its entirety and as you engage with the other tools available on the dashboard the corpus is involved and shows you, for example, were certain words appear which makes everything feel a lot more engaging and dynamic.

The ‘keywords in context’ tool is interesting in that it gives you a better idea of the semantics of the text by aiding an understanding of in what context a word may appear. For example I searched for the keyword ‘health’ and I was able to see quickly that the majority of uses of the word were in reference to mental health as opposed to health in general.

The ‘word trends’ (the line-graph above) option allows for you to see at what point in the corpus a word is used. This is also an interesting touch and I imagine could be useful if you are trying to pull out themes within a piece of text without having to read the whole thing.

My main preoccupation when looking at these tools has been, how can they be useful to a librarian and largely at this stage I can’t really come up with much to answer that. The voyant tool definitely is the most interesting to use, but I can’t imagine a scenario it which it would really help me to fulfil an enquiry or aid me in my work, but I hope to find instances where it does.

Altmetric Exploration

Last week DITA introduced us to the burgeoning area of alternative metrics and more specifically, the platform ‘Altmetric’, which offers academics an analytic tool to measure the “social impact” of the work they produce. The idea behind these metrics is to supplement the more traditional means of assessment of counting the number of citations with insight into how much attention each article is receiving online.

During our lab session, we began to experiment with the Altmetric Explorer and to familiarise ourselves with the distinctive ‘Altmetric donut’ which shows the altmetric ‘score’ and it’s breakdown of sources where the article has been referenced whether it be in tweets, blogs, new sources or any of the other sources it altmetrics vast database. Many of my DITA colleagues have written really good introductions to altmetrics which are definitely worth a look for a good grounding in it.

Carrying out searches and looking through the data on the site, it’s definitely interesting to see how an article reaches audiences. As an academic it’s easy to see how this could prove useful is tracking and widening the dissemination of your work. Whether or not that work is well received or even read, is where the altmetric idea fails, which is overtly recognised by the creators themselves as it’s not a failing which is readily solvable. On the site they are quick to highlight that,

“Altmetric measures attention, not quality. People pay attention to papers for all sorts of reasons, not all of them positive.”

The platform does make considerable effort to mitigate this by allocating more points in each Altmetric score to those sources which are of higher ‘quality’ themselves, therefore suggesting at the quality of the work itself. It’s not a fool-proof plan, but it would be very hard to see what would be.

On beginning to write this post, I took a second look at Altmetric and decided to run a search on alternative metrics themselves, to see if there has been much written about them in the academic field and more particularly about how them measure ‘impact.’

Using the search toolbar I selected the keywords “altmetrics AND impact” which I thought may return a large scope of articles, of which any commenting on the effectiveness of altmetrics in measuring impact would be included.

alt metric search

The result was a collection of sixty articles, which I exported into a CSV Excel document to better view the details of each entry. Notably, glancing through the journals column the majority of sources are scientific journals, or at least journals relating to health and the sciences primarily with just a few from information science specialist or humanities specific publications, which seems in tune with the critique that altmetrics have largely been adopted more by the science community than any other. It’s also quick to see that the majority of attention paid to the articles were as tweets or as items viewed via Mendeley.


The first ‘hit’ on the list of data, fortunately, is the most relevant to my interests and looking at how the Altmetric score has been calculated is an interesting way to see how others have engaged with it. The number of tweets vastly outnumbers any of the other mentions is gains, with a total of 1337. Looking at the number of related blog posts and Mendeley readers, my first thought is how many of these is coming from the same person. One individual may have tweeted the article, read it on Mendeley and posted a blog referencing, which instantly reducing the scope for measuring how many individuals have engaged with the material (although of course these activities may have lead to others reading the article.) Plus, as mentioned above it also fails to show whether these references were positive or negative. All of this attention may have been a result of considering the content to be poorly considered or inaccurate.


I decided to take a closer look at these instances of attention to see whether the numbers could be a good indication that people are really engaging with the article and whether they were positive, negative or neutral. Much of the references are just neutral mentions that the article is there. It’s surprisingly difficult to find material which is actually commenting and critiquing the work. The best source in this respect appears to be in blogging. The select few of the 30 highlighted were more likely to comment on the article and give a perspective on it. In this sense, I think seeing the number of citations made to the articles in other academic outputs alongside these ‘alternative’ sources would be useful because at least it shows the article has been read and considered. It’ is all the more important then I think, to remember that altmetrics are a complimentary from of measuring impact and are should ideally be viewed in combination of other metric tools. Considering the lapse in time between publication and citation of an article, however, the benefit of at least being able to see if the word is out on your work is incredibly beneficial, but one can only hope and wait that others have picked up the signposts and taken the time to engage.

Interestingly the conclusion of this article is that tweets (or “tweetations”) can correlate positively with citations, though it does stress what I’ve covered here, that “social impact” is very different from knowledge uptake in academia. Even so, the authors remain optimistic for the future of altmetrics.

So on that note, here is a link to the article itself for anyone interested and hopefully an extra blog post mention may take it a small step towards further notice in the wider academic communities…

Can Tweets predict citations?



As a brief, while writing this entry I stumble upon a blog post about librarians and altmetrics, which may be useful to some of you: 4 Things Every Librarian Should Do With Altmetrics

If it’s boring, it’s important

bored statue

Hurrah for reading week!!

I’m sure I share the sentiment with my fellow #citylis students and here’s hoping we’ll all get our  heads down and thinking caps on to help digest the last five weeks of feasting on library science.

Within this time it’s DITA which I have found the most challenging, having come to it with no experience of digital technologies other than that of a user the concepts have been a bit tricky to get my head round. In spite of this, or maybe because of it, I have also found it the most rewarding. I’m beginning slowly but surely to get an ‘in’ on this very strange and alien language of computers, which is ever increasingly surrounding me and becoming harder to side-step. Ernesto tweeted after his showing of The Internet’s Own Boy, a quote from Aaron Schwartz which had particularly resonated with him:

It’s easy to understand why and I think his timing couldn’t have been better. After just five weeks of participating in his module, I know I am not the only person on the course who has come to realise how little they knew about a subject which is so fundamentally entwined with our social and cultural identity. It’s almost embarrassing to think that as somebody who uses the internet every day of her life, I have never felt motivated enough to invest time and energy into beginning to gain a sound understanding of how it works. I think of it in a similar vain as I do about taxes. I don’t really understand how the system works, which is complicated to understand (no doubt intentionally) with acronyms and jargon and formulas for calculations that make my eyes bleed. Plus it’s incredibly dull. So, I’ve largely left it in the hands of the powers that be and trusted that they’re getting it right, which can only be deemed a reckless abandonment of the responsibility I have over my own finances.

The internet is the same. It’s only on beginning to scratch the surface of comprehension that you realise that what you see on a screen is only the tip of the iceberg and every tweet, facebook update, shop item clicked does something in the world-wide-web like a pebble thrown into water. The consequences of which, when brought to light, can be pretty eye-watering. Fortunately, there is this motivating factor when being introduced to the building blocks of the web as we have been during the first half of term, becoming familiar with the likes of HTML, XML and JSON. A motivating factor which I’m finding essential when attempting to grapple with – yes I hate to say it- boring material. I’ve never been attracted to anything remotely mathematics based and have always favoured words over numbers or patterns so this new venture into coding does feel more effortful than most.

Luckily my approach to subjects which I have had little interest in up until now has definitely changed, as I’ve come to realise increasingly that boring = important. If I’m struggling with an idea and being ever more strongly tempted to procrastinate, I feel all the more seriousness of the task at hand. Ten months of working in a corporate law firm has a lot to do with it, having had out of necessity to learn about the financial world and all the mind-numbing details entailed with insolvency, private equity, capital markets and hedge funds to offer a taster (don’t chew, just hold your nose and swallow). With every brain-melting article I read came a stronger realisation of how shocking it was to me that I didn’t already know this stuff which dictated how the economy worked and how I was a player in it.

In the same way understanding digital technology shouldn’t be seen as pursuing an interest, but taking rather as carrying out a duty of self-education. The importance is too great and the risks of not doing so too high and as librarians making people aware of that duty is an important part of our role.

So despite my lack of excitement about databases and mark-up language and the like, I know I won’t waste this reading week. With the help of coffee on tap and a very uncomfortable chair I am quite ready to let myself be bored over the next couple of days, because I know the pay-off will be too important to be missed.

Embedding 8tracks

I had to do a little troubleshooting when attempting to embed a playlist from 8track by following the instructions given on the shortcode guide to embedding from 8tracks,com. The shortcode link I created isn’t converting to the playlist, but remaining as an instruction which doesn’t appear to do anything…

[8tracks url=”http://8tracks.com/jellyfxsh/your-ego-is-getting-bigger-and-bigger-like-my-hand”height=”250″ width=”300″ playops=”shuffle”]

For those wanting to embed a playlist and coming across the same problems as I am, you can try the following:

1. Go to 8tracks.com and select your playlist

2. Click on the ‘Share’ button to the right of the audio player which looks like an arrow coming out of a box

3. Select the icon ‘Embed player’ <>

4. This will bring up a pop-up window where you can select ‘WordPress shortcode’, which will produce a code in the bottom of the pop-up

5. Copy and paste the code in to your new blog post.

I’d be interested to know if anybody knows why the original shortcode didn’t word for me? ?Has anybody else had the same problem?

Me, My Data & I

The Wikipedia entry on Information retrieval quotes the character Dr. Seward from Bram Stocker’s Dracula,

“But do you know that, although I have kept the diary [on a phonograph] for months past, it never once struck me how I was going to find any particular part of it in case I wanted to look it up?”

Diary-keeping is an age-old form of recording intimate personal information and Dr. Seward’s conundrum is one which still remains, though the rise of say digital journal keeping for example has begun to offer a solution. We’ve all experienced that feeling of knowing that we’ve written something down which we want to refer to later, but can’t find it again. Digital recording is making it easier to recall information, which can be accessed through search functions and the rise of the relatively new area of Personal Information Management (PIM) is making this all the more possible. Seemingly, it not just recording ‘useful’ personal data which is on the up, but according to Bigthink we are recording any personal data possible and in the process becoming “data-sexuals”:

“The datasexual looks a lot like you and me, but what’s different is their preoccupation with personal data. They are relentlessly digital, they obsessively record everything about their personal lives, and they think that data is sexy.”

This idea, (though a bit cringe-worthy) doesn’t seem so far-fetched when I think of the amount of personal information peers and colleagues put in social media – what they ate for breakfast, what they’re reading, what film they just saw, where they’ve been for drinks, what music they’re interested in and the list goes on. More and more of our personal preferences, choices and habits are going online whether it’s really of any use to us or not. It’s now even possible to record and publish online the contents or your bin and your heart rate in real time.

And of course all this information ends up in databases.

Our latest DITA lecture concerns databases and information retrieval, which introduced many of us to the (largely) hidden world of stored and categorised data behind the websites and search engines we all interact with on a daily basis. The amount of data held in databases which makes these platforms useful is incomprehensible, particularly when you think of the amount of connections and associations made between each object within them.

The Web 2.0 era in which we all find ourselves is characterised by an increased ability to interact with the net and influence and mould the ways in which information is presented, used and made. Becoming more personally involved with the databases that feed the world-wide web is practically unavoidable,with social media sites, search engines and online shops recording and constantly updating our personal data. It’s come as no surprise then that in this lecture, I began to wonder how much of these databases compiled by online platforms concern information about me and what form it takes.

I’d like to think that the parts of myself which end up in databases reflect well on me. I’d like to think that if I met the database me I’d be pleased with myself, but the trouble is much of the data about me isn’t consciously selected and recorded by me and of course neither is it controlled by me. I don’t know who can access it or indeed change it for whatever purpose, if they wished to do.

Beginning to think about databases has not only given me a greater basis of understanding of how digital technology works, but it also brings to the surface inevitable questions of what information is in those databases and who (or what) chooses that information and controls how it is used.

A good read which considers these questions along with how  personal data may be collected and used in the future is Evgeny Morozov‘s ‘To Save Everything Click Here,’ which I’d recommend to other DITA students interested in reading around the power of the internet and what it may have in store for us.

An Introduction


To Giu{dita} e Oloforne, my blog created to reflect on and contribute to the #citlis module of Introduction to Digital Information Technologies & Architectures (DITA). What better way to introduce a discipline than by an attempt to define it – according to Louis Rosenfeld and Peter Morville as outlined in the “polar bear book” (Information Architecture for the World wide Web):

Information architecture is…

1. The combination of organization, labelling, and navigation schemes within an information system.

2. The structural design of an information space to facilitate task completion and intuitive access to content.

3. The art and science of structuring and classifying websites and intranets to help people find and manage information.

4. An emerging discipline and community of practice focused on bringing principles of design and architecture to the digital landscape.

Within the four walls of this blog I hope to engage with the content of the module and explore these and other definitions in an attempt to understand and practice the art (or science) of information architecture within my current role as a law librarian and beyond.


As a novice blogger, my decisions on the structure of this blog will no doubt alter over the following weeks through trial and error. With a host of options, bells and whistles at my disposal my main priority has been to use as little as possible to the greatest efficiency. The template of the site was chosen largely out of aesthetic appeal (who doesn’t like a big picture of a fresco), but the components within the site hopefully will show a greater level of thought and consideration.


I’ve selected two tabs to remain static: the first to list posts as they are posted and the second About with some basic information about the author, their position and contact details.


  • Search – basic google-style search to ease navigation
  • About Me – a brief biography, which allows the viewer to see the author of the blog
  • Twitter feed – listing the 5 most recent tweets, which will largely be relevant to the library world
  • Recent Posts – this will allow the user to view any previous posts in chronological order, which until there is a larger body of posts will be the quickest way of navigating amongst content
  • Blogroll – showing useful links relevant to posts and library science at large

The order of widgets is based largely on my own experience of viewing blogs and the how useful I find the information they provide. This order is likely to change to reflect any categories which evolve from the posts I make and to show archived content as it becomes available. As this blog is new, I think the priority is showing clearly who is the author and what other news and information from other sites are linked to it, to give a better idea of the intention and approach that will be adopted going forward.

I’ve chosen to list the widgets in a single right-hand sidebar. Primarily this is a result of the site template I have selected, but this was also chosen as it follows a common structure of blogs, which no doubt is done so as it allows more content to be view on the homepage and in my experience it’s where the eye naturally falls on opening a page – it also doesn’t seem to be off-putting when reading a post and doesn’t bombard the user with too much intrusive information.

Please watch this space for upcoming posts and feel free to comment on, criticise, advise and share anything.