According to Google’s new n-gram tool, when researching history, words count.
By analyzing over 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish, the n-gram tool allows users to track the usage of words from 1500AD onwards.
The implications of this tool in terms of historical and cultural research are just beginning to come to light. In the article “Quantitative Analysis of Culture Using Millions of Digitized Books,” Jean-Baptiste Michel and his fellow researchers suggest that Google’s n-gram can be used to track the emergence of diseases, state censorship and the relative “celebrity” of a given person.
There is no doubt that the n-gram is, and will continue to be, an extremely useful tool in historical inquiry. However, there are some limitations that need to be addressed.
Firstly, the Google n-gram is limited in regards to language. Most of the collected works are written in English. Although this is helpful for me (an Anglophone student from Canada), some of the world’s most spoken languages, like Arabic and Hindi, are not even present in the database.
Furthermore, as Jean-Baptiste Michel notes, the Google n-gram tool simply measures the frequency of words within books, and books alone. Therefore, other publications like newspapers, and academic journal articles are marginalized from each search. The impact of this becomes quite clear when you compare n-gram searches on Google, and an n-gram search that browses through local newspaper clippings like the site, Mining the Dispatch. On Mining the Dispatch, users are able to see the relative frequency of fugitive slave ads that made it into the local Richmond newspaper during the Civil War. Because of its larger scope, and inability to browse through newspapers, this kind of historical deduction cannot be made through Google’s n-gram.
I think it’s also important to note that language, although an important (and often forgotten) indication of culture is certainly not the only one. As historians know, geography, religion and class, all play a critical role in shaping the thoughts, actions and mindsets of a given people. Language is only one small piece of what makes us who we are.
Indeed, Canada, the United States, and the UK, may all be English speaking nations, but we have very different cultures.
Just to prove this point, I decided to gauge the relative frequencies of three major sports: baseball, hockey, and football. From 1900-2008, the frequency of hockey was dismal compared to football and baseball. However, this was a search that took into account all English books written during the designated period. I imagine if I were to search a corpus containing only Canadian books, hockey would be mentioned far more frequently.
But more than that, words themselves are limited.
Think about Twitter. Depending on the words we choose to use in our hashtags, our statuses are more searchable. Similarly, if we tweet about a topic that’s trending, what we say is viewed by a larger audience. But what if we don’t use the right words to categorize what we’re saying? What if we type in an extra “s” or add an apostrophe where it doesn’t belong? But more pertinent than that, what if we say one thing, and mean another?
My previous example with sports provides an interesting example. In English, the word “football” can either mean soccer, or American football. In my search, this discrepancy wasn’t accounted for. Therefore, any mention of the word “football,” whether that book was actually talking about soccer or American football, was nonetheless counted. And therein lies another problem with Google’s n-gram: the tool gives us no sense of context.
And for the historian, context is king.
An old Chinese proverb claims that, “If you wish to know the mind of a man, listen to his words.”
After playing around with the Google n-gram, and uncovering its uses, I think this is extremely accurate. However, words are only one investigative tool in the proverbial historical tool-belt that can be used to understand history and culture.