Archive for the ‘Text Analytics’ Category

Oedipus HyperPo’ed

May 23rd, 2009 No comments

I wear the hat of Head of Digital Library Programs at Case Western Reserve University. As well, I’m the Managing Librarian for the Samuel B. and Marian K. Freedman Digital Library, Language Learning, and Multimedia Service center. Recently, the Freedman Center hosted the annual Freedman Fellows Program, which is a venue for getting faculty to not only use multimedia tools, but to think about how they can enhance their curriculum as well as the experiences of students; additionally, with a new gift from the Freedman Family, the Freedman Fellows Program is encouraging the use of digital tools for research. Good examples of this include the Center for History and New Media at George Mason University; Maryland Institute for Technology in the Humanities (MITH); and SIMILE at MIT.

Text Overview

Text Overview

Many of the offerings located at these schools allow researchers to visualize data in new and unique ways. There are many forms of data that lend themselves to visualization—obvious examples include GIS data or GPS data; any form of numeric or date data, really; but less obvious collections can be visualized, too. Examples of these include texts—entire texts. Using optical character recognition (OCR) documents now can be marked up quickly using Text Encoding Initiative (TEI) processes. TEI uses a form of XML to mark-up documents. As a subset of SGML, it is very like HTML, but infinitely more flexible and descriptive. As a general markup, at TEI level two or three, you can just add paragraph tags, tables of contents, indexes, etc; but there is a level five which allows for very descriptive markup, including tears in manuscripts, margin notes, gps coordinates for place locations, and more. One thing that can also be done, is each word in a text can be marked and then tools at SEASR (pron. Caesar) can run text analytics (tokenize) and record not only the instances of words in a text, but the exact place of the words in the text. This allows for very comprehensive and complex relationships to come to the surface that may not have been ‘visible’ before.

Our ‘keynote’ speaker for the Freedman Fellows Program was Tanya Clement, from the University of Maryland, and she talked about various tools for text mining and text analytics that she used in her work on Gertrude Stein’s The Making of Americans with the MONK Project. Part of what she discovered is highly complex patterns of repetition that were largely dismissed by critics as non-sense or attempts at intentional confusion, examples from modernists abound, including Ulysses by Joyce or Finnegan’s Wake, where language itself is not only stretched to the limits of its ability to express meaning, but new words and concepts and meanings are created.

Word Frequencies

Word Frequencies

One tool that is freely available is HyperPo. HyperPo lets you analyze a text quickly to see word frequencies, word occurrences within sentences, you can remove “stop words” (and, the, or, it), and you can even visualize the frequencies. MONK Workbench lets you run various analytic routines on texts as well (I’m not a statistician, so I can’t speak to them all).

contextThe overall point, is the tools that many universities and projects are making available allow for “reading” texts in new ways that can reveal more details about them. I, for instance, am looking at various images in Oedipus. Not only can I find the frequencies at which these images occur, I can also see the context in which they occur, what other words they appear near, and so on. I hope to report on what I find over the next several weeks.

%d bloggers like this: