Text Analysis Experiment: Evaluating word usage with the Google Ngram Viewer

Last week we looked at some different methods of text analysis and I wrote about the Google Ngram Viewer, a tool that scans through Google Books’ collection for all the instances of a word and graphs the percentage of that word’s usage over time. This week I am looking at the Ngram viewer as a way of conducting a text analysis experiment.

I thought it would be interesting to see how the usage of different sports in writing would change over the years, and to see what events may have caused them. For this experiment, I decided to focus on just hockey (I had to remember to search “ice hockey” because I didn’t want any results about field hockey or whatever other kind of hockey there is) and soccer (also searching football for foreign usage) because those are the sports that I know the most about and had an idea of big events that could skew the usage into large peaks or valleys. I searched these keywords in different language collections as well because different parts of the world would have had different usages of these words at different times.

What I found was pretty interesting, and for the most part what I expected. I knew that big events like world cups or Olympics would impact the usage of these sports in writing, but others I had to look up and do some research to even guess why a certain sport was being written about more than usual. Obviously this is not the best way to do this experiment seeing as the Ngram viewer doesn’t have a way to look at where a book was written or published, just the language it was written in, and any conclusions I came to were just me assuming something based on an event that occurred during a year where usage was notable.

Looking at ice hockey appearing in the English language (assuming that the British English language encompasses Canadian sources) a few things stand out. As a topic in English writing, ice hockey didn’t really exist before the creation of the NHL in 1917, and as expected after the NHL expanded from 6 teams to 12 in 1967 a sharp rise in the usage of ice hockey as a term in writing occurred. The largest growth in the usage of the term occurs between 1995 and 2002 which I’m assuming that the 1994-95 NHL lockout had some effect, as well as the introduction  of Women and NHL players  into the Olympics in 1998. The peak comes in 2002 where I can only assume that every author in Canada wrote a book about Canada winning the gold medal at the Salt Lake City Olympics and breaking a 50 year drought. There is a plateau around 2005 which is probably related to the cancelled NHL season.   In American English sources there are some other interesting assumptions to be made. For example, there is a big growth in the usage of the term ice hockey starting in the late 1980s which coincides with the Wayne Gretzky trade to Los Angeles that many people believe was a spark for the popularity of hockey in the United States, so maybe this Ngram viewer experiment can give a bit of evidence to that claim.

For soccer (football) I realized that looking at English sources, especially American ones wouldn’t be that interesting so I looked at other languages like French, German, and Spanish to see if there was more interesting stories. In French sources, there is a pretty significant dip in the usage of “football” in 2002, which makes sense considering that when I did some research about soccer in France, 2002 was an embarrassing year where they blew out of the World Cup by losing every game despite coming into the tournament as the defending champions. Looking at German sources for the usage of “football” and “soccer” the most obvious change occurs beween 2004 and 2006 where usage grows exponentially, but then drops off right away after 2006. I am pretty confident in suggesting that this outcome is due to the fact that Germany was hosting the World Cup in 2006, so in the year leading up to it there was tons of literature written about the event and that afterwards there was nothing left to write about. The graph showing that the usage of both “football” and “soccer” declined at the exact same time in the same proportion is further evidence to this theory. Germany not winning might have also played a role in the sudden decrease. The stats for the usage of “football” and “soccer” in the Spanish language are the most interesting because the reasons for the results are not entirely obvious. Obviously the Spanish language is used in many countries so there is no one reason, but a few assumptions could be made. There is a huge peak of the usage of “football” around 1930 and then a large drop off through the Second World War. The first World Cup was held in Uruguay in 1930, which could be an explanation for the large peak, but the dramatic decrease in usage offers no clear possibility. One of the reasons could be the Second World War, or that Spain was undergoing a civil war in the late 1930s, and the instillation of a fascist government that may have had an impact on the usage of football in .Spanish literature.

Overall this experiment in text analysis using the Google Ngram Viewer provides some very interesting information, but it is much too basic to base any actual historical analysis on. The Ngram Viewer is great for getting a simple look at the usage of a term in writing over time, but that simplicity is not enough to develop a stable thesis off of. It doesn’t show data from every single book ever written, just the ones that Google has access to and there is no way to specify where you want your results to come from. For this experiment I was assuming a lot of things due to the limited features of the Ngram Viewer, like what countries were providing the sources for each search. For hockey I was assuming that British English sources were mostly Canadian, and for soccer it was hard to even guess what countries were providing the majority of French or Spanish results. The search function for books using the keywords is also very basic and not very helpful in this type of experiment which meant I was relying on my own research outside of the Ngram Viewer to attempt to figure out why the usage of the keywords would have gone up or down as they did.

Text Analysis Experiment: Evaluating word usage with the Google Ngram Viewer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s