According to Google’s new n-gram tool, when researching history, words count.
Literally.
By analyzing over 500 billion words from 5.2 million books in Chinese, English, French, German, Russian, and Spanish, the n-gram tool allows users to track the usage of words from 1500AD onwards.
The implications of this tool in terms of historical and cultural research are just beginning to come to light. In the article “Quantitative Analysis of Culture Using Millions of Digitized Books,” Jean-Baptiste Michel and his fellow researchers suggest that Google’s n-gram can be used to track the emergence of diseases, state censorship and the relative “celebrity” of a given person.
There is no doubt that the n-gram is, and will continue to be, an extremely useful tool in historical inquiry. However, there are some limitations that need to be addressed.
Firstly, the Google n-gram is limited in regards to language. Most of the collected works are written in English. Although this is helpful for me (an Anglophone student from Canada), some of the world’s most spoken languages, like Arabic and Hindi, are not even present in the database.
Furthermore, as Jean-Baptiste Michel notes, the Google n-gram tool simply measures the frequency of words within books, and books alone. Therefore, other publications like newspapers, and academic journal articles are marginalized from each search. The impact of this becomes quite clear when you compare n-gram searches on Google, and an n-gram search that browses through local newspaper clippings like the site, Mining the Dispatch. On Mining the Dispatch, users are able to see the relative frequency of fugitive slave ads that made it into the local Richmond newspaper during the Civil War. Because of its larger scope, and inability to browse through newspapers, this kind of historical deduction cannot be made through Google’s n-gram.
I think it’s also important to note that language, although an important (and often forgotten) indication of culture is certainly not the only one. As historians know, geography, religion and class, all play a critical role in shaping the thoughts, actions and mindsets of a given people. Language is only one small piece of what makes us who we are.
Indeed, Canada, the United States, and the UK, may all be English speaking nations, but we have very different cultures.
Just to prove this point, I decided to gauge the relative frequencies of three major sports: baseball, hockey, and football. From 1900-2008, the frequency of hockey was dismal compared to football and baseball. However, this was a search that took into account all English books written during the designated period. I imagine if I were to search a corpus containing only Canadian books, hockey would be mentioned far more frequently.
But more than that, words themselves are limited.
Think about Twitter. Depending on the words we choose to use in our hashtags, our statuses are more searchable. Similarly, if we tweet about a topic that’s trending, what we say is viewed by a larger audience. But what if we don’t use the right words to categorize what we’re saying? What if we type in an extra “s” or add an apostrophe where it doesn’t belong? But more pertinent than that, what if we say one thing, and mean another?
My previous example with sports provides an interesting example. In English, the word “football” can either mean soccer, or American football. In my search, this discrepancy wasn’t accounted for. Therefore, any mention of the word “football,” whether that book was actually talking about soccer or American football, was nonetheless counted. And therein lies another problem with Google’s n-gram: the tool gives us no sense of context.
And for the historian, context is king.
An old Chinese proverb claims that, “If you wish to know the mind of a man, listen to his words.”
After playing around with the Google n-gram, and uncovering its uses, I think this is extremely accurate. However, words are only one investigative tool in the proverbial historical tool-belt that can be used to understand history and culture.
You are so right in the need for inclusion of alternative sources other than just books. Also, I really like with your last sentence – context is everything! Overall google has def taken a step in the right direction, but I agree with you that it def needs revision and improvement.
I was going to ask about ‘football’, given that there’s more than one sport by that name (there’s also Australian Rules Football, and on occasion people have dropped the word ‘rugby’ from ‘rugby football’, although that last one is vanishingly rare…
Language is a fascinating subject, its impact both obvious and subtle, if only because it informs the how you think as much as the what you think. I hope that makes sense…
This is a really cool tool (which I never even knew about it). Your analyses though of its limitations are spot on and this is something to keep in mind when using it. Congrats on being FP!
Wow, I didn’t realize that Google n-gram had such limitations. As a History major I used it in presenting a project, but now I’ll know to keep all this in mind the next time I decide to use it.
Thanks for posting your thoughts on this resource.
Wow, that’s quite the tool. Problem with things like that is they are always so literal and don’t account for other factors. Thanks for the info!
I’ve always found it intriguing how linguists are brought in to establish “fraud or genuine” on archeological dig finds that are engraved – many ‘fakes’ were rooted out because the words or syntax of the engraving do not match the time period the piece is dated to – even though the ‘written words’ are of ancient language – –
Great post and so glad you got Freshly Pressed so I could find you in WP land – Congrats!
Wow. What a clever concept of Google to have produced. Thank you for introducing me to this new technology. It is very very interesting and i think you are right, it will be very useful for many years to come. Thanks for sharing this. You wrote it in a way that anyone without much tech skill (myself) can easily understand the concept behind. Good write.
I haven’t heard of the n-gram tool. But I went to play around with it, and it is very cool. I also noticed that you can download the data to play with yourself.
I think that not only historians but others should mind what has been said. Even if they don’t agree, I think there should be some respect. Don’t you?
This is pretty interesting. I am glad you were freshly pressed.
Please tell me this doesn’t mean future generations will find it easier to learn about Edward and Bella than Abraham Lincoln… 😦
Dang…
Abraham who?
Kidding…
This is very interesting from a rhetoric prospective. Despite having various conflicts; think about how political leaders can use this tool to their advantage. Rhetoric is all about persuasion, and if you can use key words (trending words), and highlight current problems which may be affecting a large audiences you have one a significant victory.
Great post by the way.
lol remind to proofread next time… *won.
Theoretically, anyone could use that. I mean, if you use words that appear to be trending in your blog tags, tweets, or any sort of post then it’s more likely to be shown because more people are searching for those terms. The politician point is a good one though, but I see this being used more from a business promotion standpoint at first. Very interesting tool though, lots of opportunities for use as it gets better and better at differentiating things in context (as it inevitably will).
I’ll second that great post comment!
Thank you everyone for your wonderful thoughts and comments. I can’t tell you how much I appreciate it!
Reblogged this on Ian Royce Fadol.
Congratulations on being Freshly Pressed and thanks for sharing your article. I am not very techie but you were able to explain in such a way that it made sense to me and I learned something new. Now I can start up a conversation about Google’s new n-gram tool with my 13 year old (very techie) son and maybe impress him?
😉
Congrats on FP. This is news to me, an avid history buff, and admitted fanatic on ancient history – I sigh. My red flags are screaming in horror. Yes – useful perhaps in certain contexts, yet dangerous if taken as undeniable truth. Written history excludes all but the opinion of the author. It omits people whose history was eradicated, or those without written words such as the Hopi.
I’m glad I stumbled upon this post on Freshly Pressed. As a history undergraduate I found it a very interesting read and I will definitely explore the rest of your blog!
Reblogged this on Oyia Brown.
Interesting stuff. As a journalist writing non-fiction, and someone who casually analyzes content as I read it, I also search — often in vain — for women’s voices in histories of…anything. We all know that “history” is written by those in power, by those whose voices are thought to matter most. I wrote a book about American women and guns (I’m a fellow Canadian) and was most struck by how many interesting/brave women I discovered when I focused on their own narratives. If you read David McCullough’s book about the building of the Brooklyn Bridge, he gives barely a paragraph to the historic role of a woman in its completion…
Congrats on being FPed!
[…] Should Historians “Mind” What’s Been Said? (hist291.wordpress.com) […]
Reblogged this on The Narcissistic Anthropologist and commented:
Love this affirming article: a review of Google’s n- gram, which is a tool that supposedly can “quantify culture”. A good piece of the puzzle to help us understand cultural evolution? Sure. But, as I always say, context is everything. And you can’t paint a complete picture without it. It’s why Anthropology and Sociology will not only never be fields at risk of becoming irrelevant. Rather, they will becoming increasingly important as we resist the temptation to go deeper and deeper down the virtual rabbit-hole.
A clever post indeed…
[…] Should Historians “Mind” What’s Been Said? Digital History @ UWMore History […]
DAATTTTAAAAAA… (insert Homer Simpson drool here). Great analysis. Google says it’s randomly sampled modern English books, but how do they select books in other languages? I wonder if they’re also randomly sampled or there’s some rhyme or reason.
Thanks for writing about the new tool. I think it’s a game changer given its accuracy and efficiency. Google making history once again!
Pretty cool.. Thanks for sharing this..
This is very interesting. It is always helpful to have one extra tool to back up your research – or even provide you with a place to begin. It is a shame that the language barrier is so limited. I am sure that will severely inhibit its usefulness.
Very Interesting
Impressed with this new data, as I am infatuated with History.
Reblogged this on Simon Hamer and commented:
Interesting – I hope there will be some sort of accurate filtering against white noise, otherwise our era will be remembered for Beiber and Gaga, is that really representative of this generation?
Great sentence, “Language is only one small piece of what makes us who we are.” I watched Les Miserable last night. It confirmed music is a crucial outlet to understanding, not to mention beautiful.
[…] Should Historians “Mind” What’s Been Said?. […]
Wow this is really cool. If they develop this tool more it will be awesome to see what they find.
Reblogged this on nintendorick.
Woah
This was a very interesting post. I love ending on context is king.
Reblogged this on pluginforslot and commented:
All I ever wanted to have in my life.
Gonna play with a lot in the future.
I recently played around with ngram to track the change in the use of names given for New Zealand’s 19th century wars. While there were a few pitfalls to watch out for I was pleasantly surprised to see the results mirror the trends that I expected.
It seems to be a pretty exciting time to be a historical researcher and I hope that n-gram will continue to grow as a strong tool, hopefully including newspapers sooner than later.