Google Ngrams and religion

Google Ngrams has been the death of my productivity since Friday morning. Google Ngrams are the new word clouds, but more exciting because the output is a chart-- and, as I discovered when I did a mock scientific analysis of university library graffiti, people take charts seriously. It's a first step towards a potentially insightful analysis, but by itself it's not much more than eye candy.

That said, I've found some great eye candy.

It's important to not forget that the corpus is books. As I was using it, I found myself thinking of it as if it were the entirety of Google's "corpus"-- all scanned books + the entire internet. I tried pirate vs ninja vs zombie (1900-2008); I tried World Wide Web vs the Web vs information superhighway (1990-2008); I tried comparing a variety of web standards and programming languages (1992-2008); on a whim, I tried salad bar vs. wine bar vs. sushi bar (1965-2008), and the less-healthy X-and-Y food combos, like "bacon and eggs" (1950-2008). But I began to wonder if these popular culture and technology topics are optimal for a cross-linguistic corpus of books spanning a couple of centuries. A topic that spans both time and language would be ideal.

I decided on religion.

(If you want to jump to a particular section, you can choose from soul/faith/church, New vs. Old Testament, Heavenly Virtues, Deadly Sins, beyond Christianity, of nomenclature and spelling, or other religious texts).

Soul, faith, church

I checked out the words "soul" (blue), "faith" (red) and "church" (green), from 1800-2008 in English and Russian, and 1900-2008 in Spanish, French, and German. A few conclusions:

  • In English, "church" was used significantly more than "soul" or "faith" in the early 19th century, but by 1860 it was only slightly more common than "faith" and "soul". "Soul" was more common than "faith" from approximately 1890-1940. Since ~2005, "soul" has regained some ground, and they're now very close-- and on the rise.
  • In Russian, "church" dominated "soul" (which, itself, dominated faith") until the Russian Revolution*. During the Soviet Union, "church" and "soul" were neck-and-neck, until the fall of communism when "church" shot ahead again. Since 2000, "church" has been declining more rapidly than "faith" and "soul", which have largely leveled off.
  • In Spanish, at the start of the 20th century, "soul" shot past "faith". "Soul" began plummeting in the mid-1960's, bottoming out significantly lower than "faith" (but still higher than "church", which has been consistently lower during the whole 20th century) in the late 1980's. As of about 2005, "faith" and "soul" are about even.
  • In French, like in Spanish, "church" has always been lower than "soul" or "faith" (though it's had more ups and downs over the 20th century). "Soul" was more common until about 1970, with peaks around 1925 and 1945. Since 1970, "soul" and "faith" have been going back and forth in popularity.
  • In German, "soul" had two distinct peaks in the early 1920's and mid-1940's. But other than during the first peak of "soul", "church" has always dominated. "Church" and "faith" shared the post-WWII peak with "soul", and after "soul" began to decline in the late 1960's, it's been almost as infrequent as "faith". "Church" had another peak in the mid 1990's, but has been declining since.

Old Testament vs. New Testament

In these corpora from countries with traditions of Christianity, is the Old Testament (blue) or the New Testament (red) referenced more often?

  • English: In the 19th century, the New Testament was referenced much more often than the Old Testament. As the absolute number of references started to decline in the late 19th century, the discrepancy between the number of references to the two Testaments decreased. From about 1950 to the late 1970's, the number of references was almost identical. Since then, the number of New Testament references has been increasing more quickly than the number of Old Testament references.
  • Russian: References to the New Testament have always been more numerous than references to the Old Testament-- though not always by much. When communism fell, the number of New Testament references increased much faster than references to the Old Testament; the absolute number of references has been declining since 2000, and the gap is narrowing.
  • Spanish: Unlike in the English, German, and Russian corpora, where the absolute number of references to the Testaments has risen and fallen at various points, the trend has consistently been upward for both since about 1905. What's more, there has never been much of a discrepancy between the two, except perhaps from around 1975-1990.
  • French: Like Spanish, there's a general upward trend and the two Testaments are fairly close to the each other. The New Testament pulled ahead of the Old Testament between 1930-1945 and 1980-1995.
  • German: The absolute numbers of references are more like the English and Russian corpora, with highs and lows throughout the 20th century. However, the New Testament and Old Testament have largely the same ups and downs, with references to the New Testament being significantly higher except towards the end of WWII.

Heavenly virtues

The early 19th century was a good time for talking about virtues. The absolute number of references has been declining steadily since the 1840's, and while it bottomed out around 2000, virtues have been making a comeback since around 2002.

  • Chastity has always been very unpopular.
  • Temperance has been almost as unpopular as chastity, except during the temperance movement in the late 19th century and during Prohibition
  • Diligence has been about as unpopular as chastity and temperance since the 1940's, although it has been referred to more commonly since 2003-- perhaps due to the phrase "due diligence"?
  • Humility was matching the decline of diligence from the 1840's to 1920's, but has remained largely steady since then, instead of declining to the level of chastity and temperance.
  • Kindness was the most commonly referenced virtue until about 1920. By 1940, it settled firmly into third place. In mid-2006, it overtook charity.
  • Charity always played second fiddle to kindness (even during a spike in the 1870's), but from the 1940's - 1960's it was tied with patience for first place. Since then, it has trailed only slightly behind patience.
  • Patience was solidly #3 for a long time, but by 1940 it was sharing first place with charity, and has pulled ahead slightly since the 1960's.
  • Seven Heavenly Virtues, 1800-2008

    Deadly sins

    The seven deadly sins have, overall, experienced a pattern similar to that of the heavenly virtues, albeit with a smaller post-2002 comeback.

    • Gluttony has never been referenced much.
    • Sloth was a greater concern from 1800-1830, before declining to a gluttony-like level of unpopularity.
    • Until about 1860, greed was the least referenced deadly sin. Since 1880, it's been #5-- noticeably more referenced than sloth and gluttony.
    • Lust has been a pretty middle-of-the-road sin, staying fairly steady since about 1880.
    • Envy has reliably been #3, before converging with wrath around 1940. It pulled ahead around the 1970's, but has experienced a smaller post-2000 bump than wrath.
    • Wrath was #2 until the 1940's, fell behind envy in the 1970's, and is making one of the strongest post-2000 comebacks.
    • Pride is quite literally off the chart. (If you want a chart that includes pride here it is.) Its comeback started earlier, in 2000, even though phrases that commonly come to mind involving pride (American pride, national pride, ethnic pride) have been steady or declining since then.

    Six of the seven deadly sins, 1800-2008

    Beyond Christianity

    Even in corpora from traditionally Christian countries, there's plenty to be said about religions other than Christianity.

    • Hinduism: From 1800-1840, both Hinduism and Buddhism were referenced very rarely. While Buddhism took off, Hinduism has been increasing slowly but steadily, and has lagged behind the other religions.
    • Buddhism: Interest in Buddhism developed steadily from around 1835 onward, but after the publication of Edwin Arnold's book The Light of Asia (a poetic depiction of the life of the Buddha) in 1879, references to Buddhism to surpass references to Islam until the late 1910's. From the late 1950's until mid-1980's, Buddhism was referenced more often than Judaism.
    • Judaism: Until 1900, it was generally the most commonly-referenced non-Christian religion. Since then, it's been going back-and-forth with Buddhism. Since 1984, it's been ahead.
    • Islam: Other than a period of 30 years when Buddhism was (at times, only barely) ahead of Islam, it has always been the most referenced major non-Christian religion. Since the 1950's, it's been far ahead of all the others.

    Judaism, Islam, Buddhism, Hinduism

    Of nomenclature and spelling

    Transliteration is a tricky business-- particularly, it seems, from Arabic. There have been two major spelling variations for the holy book of Islam (note: I did try a number of other ones, but their results were fairly negligible):
    Spellings of Koran

    This pales in comparison to the variations in terminology (and spelling) for adherents of that faith; the standard term used today only had the majority of references after 1940:

    Muslim and variants, 1800-1960

    Today's spelling for someone who follows the predominant religious tradition of South Asia only began to dominate in the 1870's; before that, you're more likely to see references to Hindoos:

    Hindu vs Hindoo, 1800-1920

    Karma, Dharma and Nirvana provide a good example of how capitalization can have a huge effect on the the general shape of the graph:
    karma, dharma, nirvana, 1870-2008
    Karma, Dharma, Nirvana, 1870-2008

    Most striking here is karma, which has increasingly been used in a more general way, not particularly tied to Buddhist doctrine:
    Karma, 1900-2008

    Other religious texts

    Let me conclude with the frequency of references over time for three religious texts that are dwarfed by the Upanishads, let alone the Koran and Bible: the Kama Sutra, the Tao Te Ching, and the Mahayana Sutras:
    Other scriptures, 1900-2008
    All three have generally increased over time, but the periods of particularly sharp increase for the Kama Sutra align with periods of expanding sexual freedom.

    * One other interesting effect of the Russian Revolution: "god" dominated "God" for much of the 20th century. Compare to the English equivalent.


Work-safe xkcd shirt

xkcd shirtThis summer, after I had amassed a delightful collection of geeky t-shirts, I was promoted to a managerial position where I couldn't really wear them to work. My solution? Design geeky fabric, print it at Spoonflower, and sew it into button-down shirts.

I started with a design based on birchbark letter 206, the 13th century drawing of a young boy named Onfim. It turned out quite well, though the downside was that Slavic conferences are the only environment where people recognize the pattern if you point it out. For my second design, I went through every single xkcd ("a webcomic of romance, sarcasm, math, and language", by Randall Munroe, and my absolute favorite comic), from #812 until #200, and extracted and arranged the figures. It's not for sale in the Spoonflower marketplace, due to the non-commercial provision in the Creative Commons license that xkcd uses, but anyone is welcome to download the pattern and print it for their own non-commercial use.

I've worn the shirt to work a few times already, and the results have been interesting. I sit and have long conversations with people who I know are fans of xkcd, and they don't even notice until I finally point it out. Maybe after three years of looking at graffiti, I overestimate other people's inclination towards noticing the small things around them, but it took me by surprise.

The remaining fabric scraps found a new life as an elephant:
The xkcdphant


Qi Lu on Bing, Facebook and markets: economists needed, linguists need not apply

A couple weeks ago I was fortunate enough to be invited to a talk by Qi Lu at the University of Chicago's Computation Institute. The President of Microsoft's Online Services Group, Mr. Lu's talk was on opportunities for collaboration with academics on the Bing search engine. Microsoft had recently announced Facebook integration with Bing, and I was particularly interested in what he had to say about it.

Mr. Lu's talk-- and particularly the Q&A that followed-- provided some interesting answers to the often-contemplated question "What is Microsoft thinking?", and thoroughly convinced me that Bing is not the search engine for me.

On Facebook integration

Mr. Lu was excited about the possibility of using recommendations from friends-- an element of real-world decision-making-- to influence Bing search results. I first heard the news about Bing’s integration with Facebook in a blog post questioning the value of decontextualized "like" data, so I decided to ask him about it.

I figured it was a pretty soft-ball question, given the number of computational linguists in the room and the premise of "collaboration with academics". In the last year or so, services like twendz have emerged, providing automated analysis of whether tweets are positive or negative. Why not develop similar systems to analyze "Like" data? One could look at the follow-up comments on a news article or web page-- if there's content criticizing the article for being biased, or incorrect, or poorly-written, then the "Like" shouldn't be valued positively. Similarly, if someone has "Liked" a product or business so they can write critical messages on the product's Wall, those wall posts could be factored into the valuation of the "Like".

When I mentioned the ambiguity of "Like"-- given the many motivations for doing it, including gaining access to a product's Wall-- and asked what Microsoft was doing to re-contextualize the data in order to correctly assess its value, Mr. Lu didn't quite seem to understand what I meant. He agreed that "Like"ing an article can just indicate that the person finds it interesting (rather than necessarily liking the contents, particularly in reference to articles about tragedies), and he said that his group is looking into hiring social scientists to follow the emerging trends in people's behavior on-line and why they "Like" the things they do. I re-emphasized the point about "Like" being more than "this is interesting"-- it can be used to enable a form of protest. Mr. Lu's replied that people will always find ways to "abuse the system", but he feels that "the good will win out".

It's understandable that from the point of view of the business being criticized, "Like"ing a product in order to protest it amounts to "abusing the system", but I was taken aback to hear the same sentiment from a company that-- one would think-- would be interested in correctly valuing the data being used to influence their search rankings, regardless of the sentiment. Then again, the way Bing apparently does its search rankings to begin with doesn't exactly inspire my confidence.

Bing's search results: brought to you by the masses

A colleague described to Mr. Lu his recent experiment looking for a map of Budapest on Bing. He typed in Budapest, figuring the city's tourism page would come up first, and that would have some appropriate map. The day he searched, all the top results were about Angelina Jolie and Brad Pitt filming a movie in Budapest. As he scrolled, he saw a headline about Brad Pitt accidentally walking into the wrong house and, feeling guilty ("They'll think I found what I was looking for!"), he clicked on it.

Mr. Lu replied that Bing results depend on what people are searching for, stating that searches for "Columbia" right before the space shuttle disaster probably referred to the country, but immediately afterwards people were searching for the latest news, and the system has to adapt nimbly. And since my colleague clicked on something he found interesting, wasn't that, in a sense, a successful search?

Apparently with Bing, a dancing baby-- or whatever else is entertaining the masses at the current moment-- is always the right answer. "We don't rank our results," said Mr. Lu, seemingly with some pride. The Borg Collective of John Q. Public (which, as I look just now, has informed the default Bing drop-down that I'll be looking for 'Megamind movie') takes care of that for you.

How do you prevent your search results from being invaded by vacuous pop culture? Mr. Lu mentioned that Bing is very respectful of your privacy, but later noted that Bing tries to provide the best search results based on everything it knows about you-- which seems to suggest that if your interests and inclinations differ from those of the masses, you have to surrender some of your privacy to get reasonable search results.

The future of mobile search

Mr. Lu was enthusiastic about the strides Microsoft is making in turning data about the world around users into search queries, remarking that entering words into a box isn't the most natural way to search from a mobile device. Instead, why not take a picture of something and let that be the search input? If you take a picture of a business or church*, you could see the associated website, opening hours, menu, etc.

Personally, I despair at a future where such a system is the default input for search queries, nor am I sure that it would even work. Even if you're looking for a product to buy, it's likely because you don't have one and therefore can't take a picture of it. Maybe the hope is that you would take photos of billboard ads, to facilitate your easy purchase of things that companies want you to buy? If I'm looking up a restaurant to find the hours, chances are I'm not there at the moment to take a picture-- otherwise I could look at the hours posted on the door. And what if you're looking for something more abstract, like the history of touchscreens, and don't just want one to buy? There's already mobile apps that let you purchase an item based on a photograph; what else would Bing offer?

Profits, not products

The intensity of Mr. Lu's focus on profits took me aback. Before the talk started, we all introduced ourselves briefly, and the audience included a number of scholars from the Computation Institute (including linguists), a number of IT staff, two business school students, and someone from a nonprofit. Notably absent were the economists, but why should they attend what was ostensibly a talk about opportunities for collaboration on a piece of technology?
Knowing there were no economists in the room, Mr. Lu persisted in directing his offers for collaboration to these absent economists. He described the University of Chicago as a great school for economic theory-- while arguably true, I can't help but wonder if it wasn't a little insulting to the people who were in attendance. Mr. Lu wants to collaborate with scholars in developing a better model for a market to sell adwords for Bing. He referred to a second of human attention as the most valuable commodity, destined to become even more valuable as "great products designed especially for you!" compete for your attention. It seemed pretty clear that the focus of Bing is on profits-- getting people to see as many ads tailored to their demographic as possible, and developing more effective ways to extract money from advertisers-- and not on getting people the search results that they actually want.

Maybe Mr. Lu should wait until the Milton Friedman Institute opens at UChicago. He might just find the collaborators he's looking for there.

* It's a small thing, but I was struck by how Mr. Lu repeatedly used the example of looking for information about a church. While not as bad as getting the name of your host institution wrong, it's a detail that suggests some degree of unfamiliarity with UChicago. Everyone I've known in the Divinity School has been agnostic, if not atheist.

(As usual, the opinions expressed here are mine alone and not those of the University of Chicago, a place where the world’s leading minds come to be agreed and disagreed with.)

Bing pop can by Jason Walsh (CC BY)



