Archives for the month of: February, 2013

Technology has had a huge impact on the way we, as researchers, conduct ourselves when doing research. The sheer number of tools we now have at our disposal is incredible and they allow us to quickly and easily find the information we need to complete our research projects.

The article Quantitative Analysis of Culture Using Millions of Digitized Books takes a look at two of the major advancements in the way we can now conduct our research; digital archives and the ngram viewer. According to the article, about twelve percent of all books ever published have now been digitized and are available in some way electronically. This provides a huge benefit for researchers of any discipline as it allows easy access to materials anywhere at anytime. Digitizing books also allows us to ensure that books may be preserved in some form or other regardless of age so that the information they contain may be preserved as well.

Ngram viewers, such as the Google Ngram Viewer, allow us to search these digitized books for specific words. By plugging specific keyword(s) into the Ngram Viewer we are quickly able to see how significant those word(s) were at a particular point in time which can help to narrow down a timeframe to study. For example, I plugged in a few keywords relating to cars: Mercedes, Benz, Ford, Toyota, gasoline and diesel to see how frequently these words were used in the English language from 1800 to 2000. I was quite surprised by my results to find that “Ford” was actually frequently used all the way back to 1800, although likely because it is also a fairly common surname. “Mercedes”, “Benz” and “gasoline” had very limited use until the turn of the twentieth century likely due to the very recent introduction of the gasoline-powered automobile at that time. The word “diesel” reaching peak popularity during the 1930s and 1940s, likely due to the introduction of the first diesel-powered car in the 1930s and the extensive use of diesel-powered Mercedes vehicles by the Nazis. Not surprisingly, “Toyota” shows up relatively little in English until the 1960s and 1970s when Toyotas began to be sold in North America. The Google Ngram Viewer is a great tool for anyone trying to find out the historical significance of different words.

Mining the Dispatch is very similar to the Google Ngram Viewer except that in this case the site focuses specifically on texts from Richmond, Virginia during the Civil War. During the Civil War, Richmond, Virginia was the capital city of the Confederate states during the war and this website draws largely from the Richmond Daily Dispatch. This site is a great tool for anyone looking to study the Civil War, especially if they are looking in the Confederate side of the Civil War, as it allows you to quickly see how significant certain words were in the Confederate media during the war. It also helps to get a sense of the events of the war and Confederate war propaganda. The ngrams posted on the site also show how the war was going for the Confederates such as for war bonds which shows a dramatic spike in the use of war bonds towards the end of the war when the Confederates were running out of resources to continue fighting. The ngrams for death notices and casualties can also be used to identify major battles or lulls in the fighting.

These are just a few examples of what is out there on the web to support researchers today. We, as researchers, must get to know these tools and use them as there are many great benefits and time-savers to aid us as we delve into the past.

Last week we discussed archival history and textual analysis. We looked at the preservation of books that were centuries old, extraordinary to say the least. The class discussed how such tools would be of use to Historians of the future, as the art of preservation is creeping further from it’s humidity controlled basement roots, to the new age of digital preservation. We were fortunate enough to be exposed to both forms of preservation and it is up to us to decide what are the benefits and flaws of both.

Web articles such as Quantitative Analysis of Culture Using Millions of Digitized Books proved to be another useful tool in the gathering of digital data as it gave us access to a large corpus of digitalized books; 4% to be exact, of all books ever printed. It’s information like this that is available to so many students, not just in the faculty of history; however, there is little exposure of information of this kind available to students out there. For example, I was not aware that the University of Waterloo had such a distinguished and renowned archive of books, dating back hundreds of years; books with a vast array of information about the time period all at our disposal. Primary sources lay right under our very nose, almost literally and we were unaware of it.

Another important tool we discussed in class was Mining the Dispatch. This is a program that was originally created by Robert K. Nelson. The program is unique in the sense that it uses different words or phrases that are usually found together to create topics. This is a similar program to that of N-Gram found on Google. The N-Gram program essentially looks at larger bodies of Data to create topic modeling. Here is an example of Mining the dispatch, “It uses as its evidence nearly the full run of the Richmond Daily Dispatch from the eve of Lincoln’s election in November 1860 to the evacuation of the city in April 1865. It uses as its principle methodology topic modeling, a computational, probabilistic technique to uncover categories and discover patterns in and among texts”.

http://dsl.richmond.edu/dispatch/

 

What was most intriguing to me about last week was our discussion of archives. After reading “Archives in Context and as Context” by Kate Theimer, It gave me a broader understanding of what “archives” really are and how they should be viewed through the digital humanities.  The main gist of the article entailed that the definition of “archive” is something that shouldn’t be separated in regards to the digital community, touching on the opinions of other scholars such as Kenneth Price; “Therefore, it is important to note that the formal definition of “archives” used in the archival community cited here recognizes no differences for electronic records, born digital material, or materials presented on the web. Price’s definition, put forward for a digital humanities audience, may be correct in that community of practice, but it should come as no surprise to digital humanists that archivists have concerns about that definition”. This past week has brought to our attention a great deal of knowledge and tools at our disposal, all of which will help shape the world of digital history for future generations. I am curious as to what we will discuss in the future with Human programming and the tool PYTHON.

Modern technology is now being used to change the way we do research and looking at data. Vast amount of information is provided just by searching it all up. The real problem nowadays is not having enough research material but actually effectively using all the information we can get.

In order to help combat these problems, new methods of looking at data are being formulated. Program such as N-gram Viewer and Mining the Dispatch are examples newer tools to looking at information. N-gram Viewer and Mining the Dispatch are tries to sort and present the information through techniques called textual analysis and topic modeling.

N-Gram Viewer

N-Gram Viewer is a textual analysis tool that provides visual representation of how often certain word is used in given time period. This is search is done through finding out how frequently the chosen word appear in collection of books provided by Google Books. N-gram viewer has simple and easy to use design. In order to graph the data, you would just type in a word and decide on the time period. Using N-gram view, user can see the “history” of the word.

donut n-gram

Looking at the example of “donut”, it shows general timeline of the use of word donut. According to the N-Gram Viewer, word donut wasn’t in use until 1860s, fell out of use until 1930s and became increasingly popular.

In addition to general search method, N-Gram Viewer offers more interesting way to use the graph using their built in functions. N-Gram Viewer can used to find a number of times sentences starts or ends with certain words using _START_ or _END_ function. It can also be used to find out how often two words come up together using => function. Aside from the techniques mentioned, N-gram viewers could also use Google search operators (“+”, “-“, “*”, “/”, and “:”) and find out how many times the word was used as either noun or verb. This isn’t the full list, but some of more interesting ones offered.

Above shows how => is used to find food is dependent on the word cheap. Examples of words which are considered are cheap food, cheap delicious food, or cheap Chinese food.

Above shows how => is used to find food is dependent on the word cheap. Examples of words which are considered are cheap food, cheap delicious food, or cheap Chinese food.

While N-Gram Viewer provides interesting way to look at information, I feel that the lack of way to get more specific information makes the tool feel very lacking. It does not tell you anything about the sources used to make the graph. You cannot find the name of the book or the context of the word. Furthermore, we have no real information about the database, Google Books, which is used to generate the graph. Google Books does not hold all the books in the world. Since the database is not complete, the graph presented might provide misleading information. In the first example with donut, it states that the word donut has not been use since the 1861. However, the word donut has been in use as early as 1803 in English cookbooks. Also, if we were to use the graph to compare two different words, we would have no way of knowing if the information is accurately represented; Google Books might have more collection on certain topic of books skew the graph.

Mining the Dispatch

Mining the Dispatch does same function as N-Gram Viewer except the information is limited to 1860 to 1865 American Civil War. It provides information on the American Civil War subjects such as on soldiers, slavery and economy. One feature which is given in Mining the Dispatch but not in N-Gram Viewer I really liked is Exemplary Articles. This section shows Daily Dispatch articles which were published during the selected month from the graph and ranks them according to the relevance to the topic chosen.

Mining the Dispatch does same function as N-Gram Viewer except the information is limited to 1860 to 1865 American Civil War. It provides information on the American Civil War subjects such as on soldiers, slavery and economy. One feature which is given in Mining the Dispatch but not in N-Gram Viewer I really liked is Exemplary Articles. This section shows Daily Dispatch articles which were published during the selected month from the graph and ranks them according to the relevance to the topic chosen. Another textual analysis I really liked which had similar function was Voyant Tool. Voyant Tool is combination N-Gram Viewer and Wordle, and does analysis on a single document.  It has a neat feature of provide the context of the word used in a document. I can understand how this might be difficult task in N-Gram Viewer since there could be millions of sentences from all the books. But similar feature can be used to find what books are about instead of sentences.

voyant

Example of Voyant Tool using the review of The Donut: Canadian History. The analysis displays Cirrus, Reader, Word Trend and Keyword in Context and Word in Document categories.

Conclusion

Although N-gram viewer is intuitive way to look at information, there are room for improvements. Even by just incorporating many features from similar tools, N-Gram Viewer can improve what it can offer to the users. While other tools provide more comprehensive data, the database is not as big as N-Gram Viewer’s. While these methods of researching provides interesting and new way to look at information, I believe improvements are needed. It could be used to get some general information or perspective on topics but would be inadequate for serious research. However, I still see potential textual analysis tools such as N-Gram Viewer and refining these tools might perhaps change the way we do research in the future.

Textual analysis through the Ngram Viewer and Mining the Dispatch offers new ways to explore into our past by analyzing documents for specific words or for certain combination of words within databases. This process allows for the incorporation of much more information at a scale never fathomed before. The Science article exhibited this well through the examples of it taking 80 years to read only the English entries at 200 words/min without rest and how it would reach the moon and back 10 times over if written out in a straight line.

Such vastness simply means new methods are required in order to derive the most effective use for the public. Mining the Dispatch incorporates the method of Topic Modeling in an attempt to distinguish topics with the association from the words used together. The shortcoming from this tool lies with the lack of critical thinking involved with the association which led to the grouping of fugitive slave ads to the topic of “entertainment and culture” through the presence of clothing within the description. I found this practice to be innovative as it attempts to establish connections from mere text rather than an analysis of the context. This method of research supplements the current study of history with sets of ideas which opens up new areas to potentially look into.

Google’s Ngram Viewer is a tool which plots the frequency of the phrases inputted within a certain period from the corpus by Google books. This could be used to compare different subjects and to chart the change of occurrence over time. I found this aspect to be particularly effective since a lot could be derived from knowing what was included and excluded from the press. The user friendly features such as the year and the percentage of occurrence following the cursor as we move around are simple additions which add to the effectiveness of this tool. Another addition to this would be the vastness of literature in which it draws upon as the database contains 8 different languages. While playing with the Ngram Viewer I decided to explore how the three main sciences of Biology, Chemistry and Physics interacted since the 16’th century. The results indicated that there were long gaps within the progression of these studies with great spurts at certain periods, which would be distinct moments in history with specific reasons for the increase or decrease of occurrence within books. A tool as Google Ngram would be a great starting point to explore into how certain topics relate with one another along with how the relations may change over time.

Image

The Science article further supported the potential of the effectiveness from this digitization process through the introduction of projects which successfully used textual analysis. The project to determine how many words are in the English language and the evolution of grammar are prime examples of the potential that lies with textual analysis. The article also discussed the impact of censorship within Nazi Germany through the frequency of the Jewish artist Marc Chagall between the English and German language which exhibited the effects of Nazi censorship as his name disappeared completely during the years of the Third Reich.

I believe textual analysis should be integrated in the study of history as the vastness of material in this day and age calls for it, but to be done alongside normal research for optimal results. The shortcoming from Mining the Dispatch would be the occasional grouping of unrelated topics, and the lack of specific details would be the downfall from Google Ngram, and both could be solved with additional research.

Computer Science background, here to help again.

Who would’ve thought?

This one flow chart defined much of my existence from 2008 until 2011.

Make no mistake, I am completely aware that I am a strange man. I started coding in Java for the first time in years last week and couldn’t help but smile the entire time, even when my program failed miserably. I suppose that sort of interest in computer science is what made me so interested in Google’s Ngram Viewer. I am no stranger to analyzing large sets of data and find the commonalities and other properties of the data set, but the idea of analyzing millions and millions of books makes my loins all warm and tingly.

The article in Science that tells about some of the amazing things like mapping the English language and finding out how when we will most talk about a single year was really eye opening. I found myself already knee deep in some of the fun things to do with Ngram, but the article made me realize that I had only seen the tip of the iceberg that was the full potential of this engine.

The Mining The Dispatch site was really fun explore as well. The author of the site has found some really interesting conclusions and trends that I could connect with after just completing my American History course last term. I find the true “lesson”, if there is one, from the website is that our field is changing so dramatically. The idea of looking at the long term trends of hundreds, thousands, or millions of anything let alone books, newspapers or – potentially a new medium of study – video games, was unheard of a decade ago. It is fascinating, exciting and nerve wracking all at the same time realizing that we have yet to discover the full technological potential of our field, but I digress.

The peak of New York’s influence in literature is in the same year Sinatra sings “New York, New York” for the first time. Coincidence, I think not.

As I mentioned above, I have only begun to discover the potential of this engine but I found a couple of cool things that I thought were worth mentioning. Considering that the default timeframe to look between was 1800 to 2010, I thought looking at the rise and fall of the greatest cities in the world would be a great start. I mapped the terms, “Paris”, “London”, and “New York” (I had put in more cities but their results didn’t reveal as much as these 3 did). What I found was great. Paris and London each had a large stake in literature throughout this period but New York became this rising star that overtook Paris in about 1865. Paris retook second place to London in 1868 before New York surpassed it permanently in 1882. London was able to maintain its top-dog status until about 1911 and New York never looked back. I tried a number of other cities as well but as far as I can tell, New York is the most influential literary city in the world. Note that this only applies to English literature and for the select number of books that Ngram draws from.

Another term list I decided to look at was Aristotle vs. Plato, just cause it’s always fun to pit two philosophers against each other. I searched only those two and I included books from what I believe is the oldest date of books they have scanned which is 1500. To be frank, it was awesome.

Seriously, am I the only one that thinks this is amazing?

I have no idea where to start. First, it seems obvious that for a significant chunk or time, probably until the late 1500s, the amount of data that Google has compiled is simply too limited to analyze but as we go further along in time we see just how much these two philosophers fluctuate back and forth, jockeying for position in the race for prominence in literature. Also, I haven’t taken a class on Early Modern Europe in a long time but what made the 1650s such a haven for Platonic and Aristotelian literature? Why did Plato dominate over Aristotle in the 1660s? There are so many questions that arise from this information that I simply don’t have the answer to. For fear of going on to long I don’t think I’ll go too much deeper into that topic but seriously, I just plugged in some terms that popped into my head, imagine what some focused research could surface.

The Ngram Viewer (as well as its connection to Google Books) and other tools like it are going to revolutionize the way that historians conduct research and present findings. My toying around with the tool revealed so much in such a short amount of time. It will be very cool to see what findings will be drawn from this tool in the next 10 to 20 years.

The two sites Mining the Dispatch and the Google Ngram viewer seem to me to be scientific and it is difficult to understand their purpose from what can be read or not read on the main pages of each site. Each contain statistical jargon and lack a simplified explanation for someone unversed in the required fields. The language and description in Mining the Dispatch is very difficult to understand and I think this makes entering the site more daunting. Google Ngram viewer is much more user-friendly after playing with the site I began to under stand its purpose.

Mining the Dispatch is a historical site that pulls information from the Richmond Daily Dispatch news paper. The site focuses on social and political life in Richmond, Virginia headquarters of the confederacy. The description on the home page is heavy laden with historical and scientific rhetoric. The wording is not simplified for a general audience so it might be harder to understand for someone outside the topic/discipline, it has a very specific audience in mind (ex. “It uses as its principle methodology topic modeling, a computational, probabilistic technique to uncover categories and discover patterns in and among texts.”). The site contains a lot of written text and graphs on every topic which is a nice visual to have while reading. Clicking the links in the topics section was interesting. I found their purpose difficult to understand, for example the fugitive slave ads, out of the first ten articles on the screen three were all the same, $10 reward for a slave named Parthena, and the other seven were all identical in giving a one hundred-dollar reward for a slave named Sam. I also found the graphs difficult to read because there were no labels/titles on them. Overall I think the links are good if you are researching specific information on Richmond, Virgina or the information could be used in conjunction with other research.

The Google Ngram viewer seems more user-friendly. What it does is shows the use of terms over time (years), you can enter in a couple of specific words and it searches the frequency of their use in literature and graphs it. It is easy to find the “about tab” at the bottom of the page that goes into a better breakdown/explanation of the graphs. It seems to more systematically describe its findings which helps a lot in understanding the site. I hit a section (table charts) in the “about page” that was completely blacked out, it looked like they were highlighted but it didn’t work and I couldn’t view the information (unreadable). It provides a FAQ section that gives useful tips (ex. Ngram is case-sensitive) to use if you are experiencing problems, and it helps to analyze findings.

The science article Quantitative Analysis of Culture Using Millions of Digitized Books is a research article that focuses on the use of digital technology to examine historical through digitized books. Some of the things that can be learned from this study include the evolution of grammar, trends concerning event,dates, names, detecting censorship and suppression, and culturomics. Basically, the use of the computer allows humans to look at enormous amounts of data and use predictive software to analyze the past and possibly make predictions for the future. Because a computer can look at vast amount of data at extremely high speeds, data can now be quickly analyzed, whereas in the past, it would have taken a lifetime, or been impossible. The article also provided a variety of graphs that compared data over time.

The possibilities of each of these examples of  technology are promising, however, I found the science article to be quite complex and difficult to follow. The science article was someone else’s work and it refered to using digital technology to conduct research, so I did not find it as useful as the more hands-on Google Ngram viewer.  While each of the sites provide statistical analysis and information I found that, overall, the Google Ngram allows you to do some of the research yourself, while it is quite easy to use and understand.

Simply put, Textual Analysis involves quantitatively studying the use of language in a given text, looking at the frequency with which words are used, rather than a qualitatively study of the text itself. My initial reaction to this type of analysis was that it was counterintuitive or counterproductive with regards to studying history, since History is based almost entirely on qualitatively studying documents and texts. But as the article in Science Magazine “Quantitative Analysis of Culture Using Millions of Digitized Books“, the Mining the Dispatch project, and the Google Ngram Viewer demonstrate, taking a new and different approach to dealing with text can radically alter how we think about studying the past. 

The article in Science Magazine deals with a project to digitize roughly 5 million books, about 4% of all books ever printed, in order to better understand changes in language, collective memory, changes in culture and censorship, among others. By looking at the frequency with which a particular word is used, they were able to learn a great deal about how changes in language reflect changes in culture, caused by historical events. One of their examples involves looking at the frequency that the terms “The Great War”, “World War I”, and “World War II” are used. Looking at how often these terms are used between 1914 and 2013 using the Google Ngram Viewer, a Google project which applies textual analysis tools to the vast number of books digitized in the Google Books database, you can clearly see how popular usage of the term “Great War” dropped off significantly after the 1940s, replaced by “World War I”, and a significant increase in the use of “World War II”. 

There are some problems with this kind of approach to studying history. Simply looking at how often a term is used gives absolutely no context to the reader. It gives no reason for the decline in “Great War”, or the increase in “World War I” or “World War II”. As people studying History, we of course know that the term “World War I” didn’t exist until the early 1940s, because until then there was no need to differentiate between the Great War which began in 1914 and the Great War of 1939.

However, this lack of context can provide very interesting opportunities for Historians. On example is the Mining the Dispatch project, which applies Textual Analysis tools to digitized copies of the Richmond Daily Dispatch newspaper, between 1860 and 1865, in an attempt to better understand the city which played such a significant role in the American Civil War. By searching and graphing various terms, they were able to find specific trends in the use of various terms in the newspaper, giving a deeper look into life in the city.

Tools such as the Google Ngram Viewer definitely provide some rather fun and interesting opportunities, both for Historians and people who are generally interested in cultural trends. I personally had quite a bit of fun searching terms which have to do with my own personal interests, and then trying to figure out why the term is graphed the way it is. For example, I indulged the somewhat geekier side of me and searched “Godzilla”, looking specifically at English-language texts. The graph shows little-to-no use of the term until the early 1980s, steadily increasing until its use spiked in 1993, and sharply declining after 2000, but leveling off in the mid-2000s with a relatively high rate of usage. With a bit of reading, I figured that since the search was limited to English-language (ie, Western) sources, the term wasn’t popularized until the early 1980s when some of the Japanese films began to be edited and re-released for a Western audience. The release of Jurassic Park in 1993 accounts for the spike in its use in that year, due in part to the “giant lizard-monster” theme of the movie, and talks for an American remake around that same time. Use spiked in 1998 with the release of the film, and then sharply declined afterwards, but staying at a steady rate, significantly higher than it had been prior to 1993, with the character and series being introduced integrated into Western culture.

Some of the other interesting terms I searched, and then felt compelled to research include medical terms such as “Amputation”, and “Amputate”, which show a high rate of use until the 1790s when the words’ use declined as the procedure became less necessary, as well as terms such as “Communism” and “Communist”, which show a slow and steady increase, peaking during the American Red Scares, and slowly declining after 1990 and the fall of the Berlin Wall and Soviet Bloc. 

Dear class:

As noted in an e-mail to you, we will be meeting in the Dana Porter Library, Lower Level outside the Archives and Special Collections on TUESDAY at our regular time for our “Digitizing Primary Sources” class. Come prepared to take notes and explore some primary documents, and we’ll be discussing the visit on Wednesday (during tutorials) and Thursday. I will have project proposals ready for you after our time in the archives.

For Wednesday, here are some links (no need to visit them before class). The theme of the tutorial will be “Making History Beautiful.” 

Historical Statistics of Canada (for data)

Infogr.am (for visualization)

IBM Many Eyes (for visualization)

We Feel Fine (for fun and inspiration)

See you all later this week.

Books have been one of the primary resources available to historians when studying the past. Books can describe an event, a person, or provide an insight into a historical figure’s mind. Since the invention of the printing press, the number of published books has been rising exponentially every century. Which makes it harder for historians to sort through all the available information out there. Quantitative Analysis states that it would take a person 80 years to read a corpus of 5,195,769 books; which is only about 4% of total books published. With emergence of modern technology driving the cost of publishing to a record low, how can current, and more importantly future historians sort through the mounds of information?

The Google N-Gram tool is one of the methods of sorting through this data. It allows researcher to look for patterns in the way our culture is using words at certain periods in time. This does not necessarily allow historians to actually read the books, but rather could serve as cultural map for them to use when examining a certain issue, like for example slavery.

Slavery N-Gram

By searching the word “slavery” in the Google N-Gram viewer, we can see that from 1500s to about 1740, the word slavery was used periodically, but was not popular. Then around the time of the American Civil War in 1776, the word becomes a lot more popular, and the trend continues to climb. Past 1920, the word starts to decline, but still remains prominent. This simple analysis can indicate to historians which years were important for anti-slavery movement by looking at the amount of discussion about the topic in a given year. With increasing amounts of information, pattern recognition will help historians narrow down their focus to a handful of potential dates relevant to their research. Furthermore, pattern recognition paints a larger picture about our global society as a whole. It shows us the way our views, values and ambitions progress through time.

The limitation of this tool is that it is a largely English based application, with 361 billion English words and the next largest represented language is French with 45 billion. Considering that majority of the planet is speaking Mandarin, Cantonese and Hindi, we are still missing a huge chunk of information. The hope is that with time, the access to information in all languages will increase, not just English. As mentioned before, N-Gram viewer will not show any detail, but simply identify a pattern of one n-gram or a combination of n-grams.

Another useful tool is Mining the Dispatch which is created by Robert K. Nelson from the University of Richmond. It uses the archive called Daily Dispatch to create, or model, “topics”. Which is essentially taking two words or phrases that would be likely to appear together in the same document. Similarly to the N-Gram viewer, topic modeling is useful on a macro scale to look at larger patterns. Continuing with the topic of slavery, here is one of the topics available from the Mining the Dispatch which shows the graphical representation of fugitive slave posters by year. What is particularly great about this tool, is that unlike the N-Gram viewer, you can examine the actual documents while looking at the chart. A historian could use the N-Gram Viewer to identify a particular time of interest, for example mid – 1800s and then use Mining the Dispatch to take a closer look at what was happening during those years.

I think both of these tools are a great start of historians in beginning to solve the problem of information overabundance, but they are by no means the complete solution. The amount of information will never decrease, it will continue to increase, and historians need to begin adopting tools such as N-Gram Viewer and Mining the Dispatch as part of their regular tool kit.

For most of human history, gathering information was restricted to what a single person or group could both gather and read. Despite new technologies which made the gathering of information easier, history remained limited by its largely narrative nature. Now, with the advent of computer programs with powerful searching capabilities, this is no longer the case. However, as with all quantitative studies, this new tool comes with a cost, namely that of context. The real question, therefore, is how much raw words can truly tell a historian without personal knowledge of their environment.

In order to evaluate the advantages and disadvantages of a new tool, one must look to those that have already begun to use it, in this case 3 particular websites which delve the usefulness of quantitative analysis in the study of literature. The first of these, Google Ngram Viewer, is a tool which looks for specific words within the Google Books corpus and maps out the frequency of their use over time. Mining the Dispatch, a website chronicling the studies of Robert Nelson, scrutinizes the word cluster trends within the editions of the Richmond Daily Dispatch issued during the Civil War in order to analyze changes in social and political life. Finally, Quantitative Analysis is an article published in Science which discusses the values of quantitative analysis of literature and the new methodology of culturomics.

Through the examination of these three websites, several advantages become quite clear. First and perhaps foremost, tools of quantitative analysis allow historians to capitalize on the growing number of available sources. When making judgments based on qualitative study, there is always the risk of missing a crucial contradictory source, but with quantitative analysis this risk is greatly reduced. In addition, quantitative studies, unlike their narrative-based counterparts, are able to map large scale changes, such as those of language and the public consciousness. Finally, the massive scale of the new tools enables them to overcome one of the classic weaknesses of smaller quantitative studies. Population bias, a constant threat when using information gathered by even moderately large surveys, is almost inconceivably unlikely when the corpus being examined is as large as Google Books.

However, this is not to say that quantitative analysis is without flaws. The largest problem with raw data is that it lacks context. Without knowing what sentences, nuances, and ideas surround a particular word in a particular place, it is impossible to make complete judgments on what they mean. Useful conclusions almost always require other sources of knowledge in order to discover why a particular trend occurred, even if quantitative data makes finding the trend easier. In addition, it is impossible to control for exceptions to the regular use of a given rule. Tools like Google Ngram Viewer also tell us nothing about what is being said or thought, only what finds its way into the written record. Finally, quantitative information is equally or even more dependent on the researcher’s interpretation than any other source, since the words have been extracted from the writer’s original intentions and thoughts.

In order to illustrate the advantages and disadvantages of quantitative analysis, I would like to examine two examples. First, let us look at a graph produced by Google Ngram Viewer of references to ‘Sherlock Holmes’.

Image

This graph, while illustrative of the role of other mediums on literature when combined with additional research (the spike in the late 1920s was preceded by several movies about Sherlock Holmes), is completely unintelligible without that or similar knowledge. In addition, the graph, while seemingly significant, is incapable of proving the movies caused the spike in writing about Sherlock Holmes, as it is possible the two had a mutual cause. Another useful example is the following Ngram, which maps the usage of various terms that commonly refer to the war that occurred between 1914 and 1918.

Image

As Quantitative Analysis discussed, N-grams can be useful in mapping out changes in language, but, as Mining the Dispatch talks about in the section on soldiers, they cannot tell us what the words refer to, as in this case it can be seen that the term ‘Great War’ came into use well before 1914.

The frequency words occur can tell us a great deal about what is being discussed, shedding insight on long term developments and taking advantage of the wealth of information now available to historians. However, words out of context are in the end just words. Especially in the study of the past, quantitative data in isolation tells only a partial story, being unable to properly evaluate the way the word is being used and the context it is being used in. Quantitative analysis is a powerful new tool in the historian’s arsenal, but it has a limited capacity without being supported by more traditional methods of research.