METADATA FOR GOAL POSTS

When James Gleick wrote “We can see now that information is what our world runs on: the blood and the fuel, the vital principle” he was clearly on to something. Information is the growth commodity of our age, whether it be in the financial sector in the form of economic analysis and algorithms for the stock markets, or the huge amounts of metadata that is created from digitisation projects which can then be data mined by scholars to find previously unknown of trends. However, this new appreciation of information is now spreading to all parts of society even to sports and professional football in particular.

A quick Google search for “information jobs” and “football” produces a number of jobs in or connected to Association Football that match the skills set of most information professionals. Here are just three: Football Analyst and Social Media Assistant to target football fans online for a betting firm; Knowledge and Insight Analyst for the Football Foundation which provides facilities for people across the UK to participate in football; Academy Analysts at Reading Football Club to develop youth team players by filming and analysing practice sessions and matches, and maintaining a database of team and individual statistics.

One company which has tapped in to the information needs of professional football clubs in the UK is Prozone. They developed the Prozone propriety player tracking software which used up to eight cameras around the pitch to track players and monitor their performance – how much they run, touches of the ball, fouls committed, etc. Many top teams use the Prozone technology to monitor the performance of their own players and track potential signings.

Prozone was first championed by Sir Clive Woodward, and he used it extensively in preparing the England team in the run-up to their 2003 world cup victory. Woodward believed access to data was essential to prepare your team. He almost sounds a bit like modern information professionals! The far more commercially ruthless world of professional football was quick to see the potential of this new tool, and Prozone broke out of the confines of Rugby Union. Prozone is essentially about collecting data for football: they are effectively data-mining players. The tools they use are the tools used by library and information professionals: XML schema to send results to clients as email attachments, creating databases of players, analysing the statistics of potential signings for football clubs. In short, the growth of information based skills in football is a great example of how skills in the LIS sector are needed beyond the walls of the library.

Advertisements

Some Reflections

Now the DITA module is coming to an end, it is now time for reflection.

 
On a personal level I am probably a bit of a Luddite, so the module has been a challenge as I have had to get out of my comfort zone. However, this is a good thing. I mean, what would be the point of enrolling on any academic course if you were not challenged and knew all the answers which were posed during the course. Thanks to this module I now have a basic understanding of such terms as Tags, Mark-up languages, and the Semantic Web. These were all terms I had heard before, but I had no real understanding of them, they made about as much sense to me as chemical compounds or a foreign language I was unable to speak. Now, thanks to Ernesto and Ludi, the mist has started to lift. In particular, I can start to see practical applications. This is very true with regard to data mining and text analysis.

 
The whole blog business was something I was initially quite worried about, but, to my surprise, I have really enjoyed it. Much of what I have written is the meanderings of a novice in this field. However, again, this is good thing. Writing the blog has given me a really good way of reflecting and analyzing on a subject which was totally new to me. Added to this, writing the blog has got me back in to the discipline of writing. This was very important as I have been away from Higher Education for 15 years! Yes, I am officially an old godger!

 
I think the artist, Pablo Picasso, summed up best my very positive experience of DITA:

 
“I am always doing that which I cannot do, in order that I may learn how to do it.”

 
That really is the essence of learning.

Mark-Up Languages and The Semantic Web

After our DITA lecture about the Semantic Web last week, we took part in a lab session designed to help us understand Mark Up languages. This hands on session to form of looking at Artists Books Online and comparing it with Old Bailey Online.

Artist book again
Artists Books Online is a great resource for artists and art students as it has a large collection of art books by North American artists who are outside the mainstream. The site consists of fully digitised books and extensive bibliographic information and abstracts about each each work. It also includes information about the physical condition and any conservation issues relating to each work. So the metadata is very good. For those of a technical mind, there is also an option to see the mark-up language. At first glance, this looks very much like a MARC record. But I guess MARC is just a mark-up language for library catalogues. This use of mark-up language, which can be understood by humans and is also machine readable, is a good example of how the Semantic Web is developing. In particular, the use of RDF (Resource Description Framework) is simple enough to represent any fact or entity, but structured in a way that computers can do a lot with it. An example of RDF is XML. The Old Bailey Online uses XML. The basic premise is that things are described in “triples”; Subject – Predicate – Object. Two examples from Artists Books Online is given below.

 
Subject                       Predicate                     Object
Johanna Drucker         Wrote                           Damaged Spring
Damaged Spring         Was Published in          2003

 
I found Artist Books Online very user friendly. You could search by artist, publication date, or a particular collection. It was all inter-connected, so it was an example of, if unintentionally, the Semantic Web in microcosm. The Semantic Web is essentially a Web of Meaning,and it is the Mark-up languages that allow both computers and humans to make sense of The Web. We are getting closer to Tim Berner-Lee’s vision of the Web:

“The Web was designed as an information space, with the goal that it should be useful not only for human-human communication, but also that machines would be able to participate and help. One of the major obstacles to this has been the fact that most information on the Web is designed for human consumption, and even if it was derived from a database with well defined meanings (in at least some terms) for its columns, that the structure of the data is not evident to a robot browsing the web. Leaving aside the artificial intelligence problem of training machines to behave like people, the Semantic Web approach instead develops languages for expressing information in a machine processable form.”

From Semantic Web Roadmap By Tim Berner-Lee (September 1998).

http://www.w3.org/DesignIssues/Semantic.html

 

Some more thoughts about Altmetrics

Our DITA exercise from a few weeks ago introduced us to Altmetric.com. This is a database of academic articles from the web. It is a very quick and user friendly way to find genuinely academic articles (rather than journalistic articles) that were born digital. The site is a private company, so they have a commercial imperative (subscription is about £50 a month), but for institutions it is a fantastic resource.
After our lab exercise, I decided to do a quick search of altmetric.com concerning a subject close to my heart – Parkinson’s disease. My mother suffers from Parkinson’s, so the subject understandably interests me.
The filter option is very good on altmetric.com and includes various mentions of articles in the media. I searched for the key word “Parkinson’s”, and then filtered by articles that were mentioned on Twitter, in news outlets, on Facebook and in policy documents. I got over 100 hits. Sadly, many of the articles were abstract only as the full articles were blocked by pay walls. The massive drawback for altmetrics.com is that they do not have a filter option to discard articles with a pay wall.
The pay wall issue is particularly worrying for two reasons: Firstly, much of this research is funded by public money, so to deny access is a bit like the academic equivalent of the American Revolution struggle for No Taxation without Representation; Secondly, for research and data mining it is greatly reducing the corpora available to the researcher. The good news is that in June this year the UK Parliament allowed data mining for non-commercial research, making it clear that this does not infringe copyright law. Sadly, as Murray-Rust, Brooks and Oppenheim made clear in their article in D-LIB Magazine many publishers “…took an extremely long time to decide how to respond, so that in practice the permission was not obtainable.”1
Sadly, it looks as though freedom to information is an ongoing historical struggle in Britain.

 

 

1. M. Brook, P. Murray-Rust, C. Oppenheim, “The Social, Political and Legal Aspects of Text and Data Mining (TDM)”, D-Lib Magazine, November / December 2014. Volume 20, Number 11/12.

Digitization and Data- Mining

Introduction
This week in DITA we looked some more at data-mining and the growing area of digitization of documents. Text analysis, which we did last week, is basically a subset of data-mining. Text analysis is simply using technology to find word frequency and lexical combinatiuons. Data-mining can often be a far broader activity, including even just doing a Google search. Data-mining, with particular reference to the Humanities and Social Sciences, would involve extraction of data from a body of texts in order to research questions, including very major research projects.
Lab Exercise
We practiced data-mining using two major digitization projects: Old Bailey Online and one of the digitization projects from the University of Utrecht. The Old Bailey Online site is particularly useful as it has a very user friendly API demonstrator to allow data-mining.
To start with I searched Old Bailey Online using my surname “Moore” and used the “Punishments” options to refine the search. I chose “executed” (there were some very terrifying options such as “burning” and “public whipping”). I got 573 hits! I clicked on one which mentioned Elizabeth Moore who was on trial for stealing a yard of needle work lace. Back in the 18th century theft could be a capital punishment. I think some Brits would like to see a return of this legal system! Anyway, I should stay focused. Sadly I could not see the original digitised document as the rights were owned by Harvard University. This highlights one of the hurdles in digitization – the collection may have parts held by various stake holders. Copy right is always a headache with large scale digitization projects. I then did the same search on the Old Bailey API where I found an extra refine option; gender. I changed this to “female” and changed the punishment to “transportation” for the year 1674. This produced just two hits. The API page had a link straight to Voyant, so I just sent one of the documents (this time I had access to the full facsimile). Please see below word cloud.

Transportation
The choice for the Utrecht project was Circulation of Knowledge and Learned Practices in 17th Century Dutch Republic. This is a collection of letters written by Dutch scholars which should provide an insight in to a boom during the 17th century which happened in the Netherlands in the area of economics and exploration. The major difference between this and the Old Bailey project was that the Utrecht project seemed to only offer a full text version rather than be able to view the digital image of the original document. I also could not work out how to transport data to Voyant. In the end I just cut and pasted one text and put it in to Voyant that way. The search I conducted used the key word “Maritime”, as, like Britain, the Dutch were a major Maritime power at the time. Also, I thought a word with a Latin root may be used in Dutch as well as English. See screen shot.

Dutch Data Mining
In very simple terms, I found the Old Bailey Online much more user friendly. Also the uses for both serious social hsitory research and at the other end of the spectrum family history are massive. Although, I get the impression the Utrcht project is still a work in progres, and has the potential to become something very special.
Similar Projects
The Old Bailey Online site is far ahead of most projects thanks to it’s API capabilities. Sadly, the massive British Newspaper Archive (over 9 million digitsed pages of local and regional British and Irish newspapers already) does not allow this. This is very sad, as the BNA is a fantastic resource for social, economic, political and local history. House of Commons Parliamentary Papers is another very good project whereby thousands of Bills, Select Committee Papers and Parliamentary Reports have been digitised. It is very user friendly and reminds me a great deal of the Old Bailey project. This resource can be accessed free of charge from the British Library reading rooms.

TAGS

This post follows on from the DITA session on the 10th November 2014.
In this lab exercise we used an application developed by Martin Hawksey. This was a mash-up using Twitter search API and Google API. This allowed us to get the data in text form, thus much easier to analyse. The exercise invloved us extracting all the occurances of #citylis on Twitter. This was a good way (and easy to understand way) of how anyone with fairly basic IT skills can now extract metadata for searching and archiving. This is obviously a massive bonus researchers and academics.
It think this is good opportunity to mention the history and applications of hastag*, and how in many ways they are part of a wider history of information studies.
A tag is essentially a non-hierarchical keyword given to information. This is a form of electronic metadata that allows for searching so that the information can be found again and again. Tagging developed and expanded during the Web 2.0 phase. However, the non-electronic roots of tagging can be traced way back to the earliest libraries, where tags of wood were attached to clay plates containing proto-writing. The basic principle was the same then as now; it help us to find information.
A big difference with tags in information systems and physical tags is not just technological, it is also philosophical. The traditional classification system, or taxonomies, were top-down, with an ordered way of classifying data. Nowadays, thanks to tagging, there is an unlimited way to classify something, and no wrong or right answer. This is something I guess Foucault would approve of. Tagging is not a traditional taxonomy** in the top-down sense, with all the connotations of the classification being culturally biased. Technology has made the collection of metadata far easier and massively increased the scale of collection. Yet, again, metadata is part of a longer history. The Dewey Decimal System of library cards were an organized way of collecting and storing metadata related to books and periodicals started in 1876. Also census gathering is an example of large scale metadata which can be traced back to the census returns of Ancient Egypt.
There is an obvious problem with tagging that we all need to be aware of (this is where, yet again, the power of technology and the human mind need to work together); meaning. When users can freely chooose tags, rather than selecting from a controlled vocabularly***, the resulting data can be very confusing. In particular, there is the problem of definition. For instance, a tag for “Orange” could refer to the fruit, Unionist politics in Northern Ireland, the colour, or the town in Southern France.

* Hastag is metadata marked by the prefix symbol #. Form is used extensively on social media.
**Taxonomy is the practice of classification. Derives from Greek words Taxis meaning order and nomos for law or science.
***Controlled Vocabularly in LIS this is a selected list of words and phrases used to tag units of information to make them easier to retrieve.

Text Analysis

Tags

Our text analysis exercise was probably my favourite DITA exercise so far. It really helped me see just how potentially useful this technology is to academics and researchers. Text analysis massively speeds up searching text and finding connections and threads of thought. In this regard, it seems to address the concerns of one of the visionaries of information technology, Vannevar Bush, when he spoke about “Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling.”1

The exercise initially involved using data from our Altmetrics datasets. I used column D which was the list of journal titles. This was then put in to Voyant Tools for text analysis. This produced a list of the how many times the individual words in the text were mentioned (starting with the most common) and also a Cirrus* image to help visualise the proliferation of certain words in the text. I then used the “stop word” function that allows you to remove words that do not tell you much about the text such as the, and, is, etc. To do this I chose the “English (Taporware)” option.** See below image.

Text Analysis

Just for a bit of light relief (after all, one of the texts mentioned by Ernesto was called “The Hermeneutics of Screwing Around”2), I decided to complete a text analysis on Voyant of a BBC website article about how beards became fashionable in this country after the Crimean War. Basically, the freezing conditions meant soldiers could not shave and the British Army relaxed their normal hard-line about being clean shaven. So, all you bearded hipsters, not so cutting-edge after all! I’m sorry; I couldn’t resist being a bit bitter and twisted. See funny word cloud below.

Big Beard Data

A nice addition to the Voyant website was the text they have talking about data retrieval and text analysis. This includes useful tips to get you started and also an article about Descartes and his 1637 Discourse on Method. In this Descartes talks about the need for solitude when thinking and problem solving, and contrasts this with the modern practice of group research which has moments of solitude. Yet, even when physically alone, a researcher / academic can still be in touch with lots of fellow professionals thanks to our digital technology. However, this data analysis, which looks at frequencies and occurrences of words and phrases, is not meant to replace the Hermeneutic method. Rather it should aid and complement it. It is still essential to interpret a text and try to understand the spirit in how it was written.

*Word cloud displaying the frequency of words appearing in a corpus. Words occurring more frequently appear larger.

**TAPoRware is a set of text analysis tools that enables users to perform text analysis on HTML, XML and plain text files, using documents from the users’ machine or on the web.

Notes

1.Bush, Vannevar. “As We May Think.” The Atlantic. July 1945. Reprinted in Life magazine September 10, 1945.

  1. Ramsay, Stephen. “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” In Pastplay: Teaching and Learning History with Technology, edited by Kevin Kee, 111-20. Ann Arbor: University of Michigan Press, 2014.

Information as Commodity (Using #Tags / Twitter as example)

Tags

Our DITA exercise involving extracting data from Twitter the other week – which is clearly a very useful tool for academic and personal research – got me thinking about large scale data extraction. In particular, as Twitter charge companies for this kind of service, are we now in an era where data / information is a commodity?

In a way data has been a commodity from the early history of the document. After all, book sellers and printers have been around for hundreds of years. But the ease with which technology can now collect and order huge amounts of data, not to mention access our private lives through social media, has led to a paradigm shift in information being seen as a commodity.

This development has been noted by a number of scholars. Harlan Cleveland sees changes in our understanding of information due to technology having consequences reaching beyond libraries and in to political economy and even the law. Firstly, Cleveland asks the question that won’t market exchange have to take account of the fact that more and more of our economic  activity now consists of what are by nature sharing transactions? Secondly, in law how should we adapt the concept of property in facts and ideas when the widespread violation of copyrights and the shortened life of patent rights have become the unenforceable Prohibition of our time?1 Nevertheless, many corporations clearly feel that big business can make huge amounts of money dealing in information. As Malcolm R. Parks wrote “…commercial entities such as Facebook, Twitter and Google…these companies either deny or tightly manage data access by researchers, leading to fears of new digital divides and the creation of classes of researchers who are either data rich or data poor.”2

Net Freedom

An interesting development in this area is the issue of net neutrality. This is basically where ISPs (Internet Service Providers) would like to end internet neutrality and charge people and companies for quicker, more reliable internet access. This is fine for multi-national companies and well-off people, but for the majority of us we would be living in a pay-per-view internet environment. This is essentially information apartheid, and it is a very popular idea with many communication firms. As AT & T big wig Edward Whitacre said when he compared the web to plumbing “Anybody who expects to use pipes for free is nuts”3 (if you are a member of CILIP or can get hold of their October issue of Update, check out Phil Bradley’s excellent article about net neutrality).

http://www.cilip.org.uk/cilip/membership/membership-benefits/monthly-magazine-journals-and-ebulletins

https://twitter.com/philbradley

So, technology is great, and can improve our understanding of the world and our lives beyond words, but, as ever, we need to watch out for the barbarians.

 

  1. Harlan Cleveland, “Information as a Resource”, The Futurist, 16th December 1982, pages 35-38.
  2. Malcolm R. Parks, “Big data in communication research: Its contents and discontents”, Journal of Communication, Issue 64, 2014, pages 355-360. Doi:10.1111/jcom.12090
  3. Phil Bradley, “The inside track on net neutrality”, Update, October 2014, pages 28-31.

 

Mashups & Remixes

Tags

This week in DITA we focused on APIs, Web Services and mashups. In very simple terms, that even I can understand, APIs (application programming interface) are used as platforms so data can be shared between different web services, and so allow people to embed content to create their own digital spaces. This includes some amazing website and digital resources. A particularly good example of this is the British Library online learning resource Sisterhood and After: An Oral History of the Women’s Liberation Movement, which includes text, film and audio recordings.

Thanks to APIs, the tools are there to create amazing multimedia experiences. Just one website can contain google maps, live twitter feeds, footage from YouTube, etc, etc. However, as with much in the digital world, there needs to be a creative edge combined with a theoretical understanding. Without any sense of design and aesthetics, websites with embedded elements can just look busy and messy (they can give you a headache by just looking at them!). Also as this technology can be used by anyone without any in-depth IT knowledge, there does seem to be something of a pedagogical gap. We as human beings do need to understand something before we can fully appreciate it.

Just as a postscript, I thought I would mention mashups. Mashups are basically two or more sources used to create a single interface or product. Isn’t that just a new form of mixing? This basic idea has been around since the birth of Hip-Hop when DJ’s mixed together two vinyl records to make one track. This first started in New York in the late seventies and early eighties. The best DJs, such as Afika Bambattaa, turned this in to an art form. Just like todays mashups, technology and art combined to create something very special.

Hip-Hop

A mixer and turntables – technology meets art!

*Please note my blog now has my Twitter feed and Google Maps (showing location of City University) now embedded.