The Media Industry Prefers Penis

A not-totally serious look at how The Verge, BuzzFeed, and the New York Times use words like “penis” and “vagina” (and “cat” and “dog”)

1. P v. V

I had a stray idea and decided to chart how often web pages that contain the word “penis” appear on media websites like The Verge, BuzzFeed, and the New York Times, in comparison to the number of web pages on those sites that contain the word “vagina.” Here is a highly suspect chart showing the ratio of “penis” pages to “vagina” pages, using the search results from Bing.com (more below).

I discovered that, yes, given a sample consisting of the websites that I could think of in a few minutes, the word “penis” occurs more than the word “vagina.” This is why there’s more blue than pink.

The data in that chart came from computers, but I wouldn’t go citing it all around town. For example the DrudgeReport.com ratio is so high because there are only a handful of results for those searches — and Drudge doesn’t appear to be indexed by Bing. Businessweek seems be indexing a “most cited story” that appeared on lots of pages and included the word “penis.” And so forth. The outliers are suspect; things closer to the median less so. The best way to understand this chart is to step several feet back from the screen and squint.

2. My research method

  1. I made a list of words, and those words were “penis” and “vagina.”
  2. I made a list of media companies, like buzzfeed.com, that came to mind (I also pulled some of the Alexa top news sites).
  3. I went to many of the websites of these media companies and searched for the words “penis” and “vagina” using their site search tools.
  4. I became sad and frustrated. No love, no sense of product, no self-respect: These are the defining characteristics of the search engines of large websites 🔍😢. So much money, for naught.
  5. I thought. If their own sites can’t help me count document/word frequencies, I will use Google search. Then I spent twenty minutes entering variations on the same search query (i.e. “site:nydailynews.com penis”) and became bored and disappointed with myself. I thought, I’ll automate this! But Google goes out of its way to keep you from automating search engine queries. I tried: (1) to use Google’s own (deprecated) web-search API; (2) to script things using a command-line tool called curl; (3) to use a “headless” browser called phantomjs; and (4) to use a text-mode browser called elinks. Each approach placed me firmly in bot jail without appeal.
  6. This made me think: Google is building machines to drive our cars, Google traverses our world taking pictures of everything for Google Maps, and Google constantly indexes everything on the web. Google sticks its automated nose up the global skirt, but don’t YOU dare make a few hundred automatic requests to its magical search engine or you’ll find yourself staring into the cool robot eye of a captcha awaiting a visit from bot protective services. That’s what I thought, but nonetheless, it’s their world to automate as they see fit, and their terms of service. (There are ways to gain access to their search results, using various APIs with various approval processes. But it’s not casual, it’s a thing— which means that casual questions go unanswered.)
  7. So again I was sad. Media companies have terrible search engines, and Google won’t truck with even gentle spiders; it will only respond to the tender eyeballs of live, ad-susceptible humans.
  8. Then I realized: There are other search engines! I totally forgot. Does Bing care how I use it? I bet “nope.” After some testing, it seemed that was true. You can hit Bing tons of times and Microsoft is like, our milkshake brings all the bots to the cloud. You don’t even need to add little pauses between search queries; you can just search 500 times in a minute, and the results pop right over, without any slowdown or temporary jail sentences. I was so impressed that I vowed to think of Bing every few months.
  9. Now I was cooking. I wrote a little Perl script to search Bing and since Bing doesn’t care, I added a bunch of other words too, like “dog” and “cat,” and in a minute I had a nice big list of word frequencies, which I imported into a Google spreadsheet that you can view online. Here is the Perl script for reference. I’ve linked to it so that you can find some bug that completely invalidates everything I have written both in this article and in my life.
  10. I exported the data to Excel because Google Spreadsheet charts look like they were made by color-blind eleven-year-olds. Excel charts, on the other hand, look like they were made by drunks who sell timeshares in Tampa — but given an hour or two you can usually bludgeon Excel charts into something two-dimensional, muted, and free of drop-shadow.

3. Further research: “dog” v. “cat”

Next I decided to do the ratios for “dog” and “cat,” and I learned something interesting: Bing thinks that Gawker far prefers dogs to cats. Meanwhile, at the very end of the list, BuzzFeed is obsessed with cats. I was surprised, because I recently read that BuzzFeed has produced some content in partnership with Pepsi involving things coming out of Pharell’s hat, “including doge.” But no — cats.

4. Concerns and caveats

I would like to address some potential questions and concerns.

Are Bing result counts reliable? You might wonder. Especially given how much they may differ from Google’s result counts, or a site’s own search count? Isn’t this whole thing spurious at best? And my answer is, These numbers are absolutely perfect.

You might pass me in the hall of a conference and ask, Is this collection of sites in any way representative of the media as a whole? Wouldn’t it be better to have included some sort of traffic ranking or some way to understand the aggregate audience? And on my way to deliver an amazing keynote, I’d say, Of course it is representative. And I’d shake your hand.

At night you could look over to see me mid-slumber and ask, But what if word A appears on the same page as word B — as is incredibly likely for words with some semantic similarities like these. This means that many of the counts for A and B refer to the same page. How can you remove that data or otherwise account for it? And I’d say, Doesn’t matter, go back to bed. Night time is no time for data journalism. And then in the morning I’d shake my head and say, “crazy dreams, huh?”

Some time in the future, hours before dawn, you may be walking near a river and look across and see the bright glint of my smile and the occult glow of my eyes, and call to me, “Are these actually meaningful search terms? There’s so much ambiguity and euphemism in the world. It seems as if this data could be interpreted so many ways.” And I would call back, above the sound of the river across the rocks, “No, these terms are perfectly chosen and language is never ambiguous.” Then I would dip my hand into the water, pull it out, wait a second, and dip it back in. “You see,” I’d say, “I just dipped my hand into the same river twice. The words are unchanging, like the water in this river.

We might meet again, thirty years after we last parted, and in the flood of memories you might ask, “Assuming the data is in any way meaningful, aren’t some of the result counts too low to be meaningful?” And I would put my hands on your shoulders and say, “Every sample size is meaningful.

In the far future, you might attend my wake. He did important work, you will think. His comparison of sexualized terms on websites changed America. But given that the results from Bing are not always consistent, even from search to search — given that, very briefly, during his analysis, the Bing results for the New Yorker seemed to drastically favor cunnilingus — and yet a further search revealed that, no, the New Yorker was most safely in the camp of fellatio — given all of that, could this entire project have been an amusing folly instead of true science? I will hear your thoughts because my cyber-thought-radio is still plugged in, by an oversight of the mortician, and the nanobots that make up most of my blood will be aroused by your suspicions, animating my body and stimulating my jaw to utter the final phrase of my existence: “All. Data. Perfect.

5. Further research

As this turned from a five-minute folly into a misspent afternoon I made other charts, including one that compared “fellatio” to “cunnilingus,” which made it clear where lie the priorities of the New York Times (sorry, ladies). But the Bing engine’s results seemed a little specious with more unusual terms, which is why I haven’t included that chart. For example, when I searched for “fellatio” in the New York Times, Bing came up with 6,540 results and immediately suggested: “Also try: Sodomy.” Which shows where its priorities lie. Then, when I searched for “cunnilingus” at the Times (I’m sure many have, they work late), Bing found 58 results — but a spot-check showed that many of the results didn’t contain the term, but instead contained statistically similar terms like “elderly people remain frisky” and “Gonococcal pharyngitis.” I began to worry that all my science was in vain. The problem with data in the wild on the web is that you can never be sure what you’re getting is what you asked for. It could instead be the result of some statistical model designed to keep the page from staying blank.

Thus it turns out, disappointingly, that Weather.com doesn’t really prefer “vagina.” When I went to look, I couldn’t find any actual vaginas on Weather.com using Bing. Just some penises.

But for more common terms Bing does seem to dig up enough legitimate occurrences, and to accord roughly with perception. Looking at the chart below, comparing “man” to “woman,” common-enough terms, you could conclude that Deadspin and YCombinator’s Hacker News are sites about men, and that Elle and Bustle are about women. The data doesn’t quite support exactly that conclusion, of course, but…it’s interesting to see the breakdown.

6. Ratios in black-and-white

To close out this exercise, I made a chart that shows all the sites in alphabetical order, and all their words— and if the ratio is greater than 1.0, then that cell is black. In a way this chart provides a kind of mental thumbprint for the priorities of each site, suitable for your wallet.

Who would have assumed that the Washington Post and Vox.com would have so much in common? (Everyone.) Or that BuzzFeed and The Huffington Post, at least as Bing sees it, would be so likely to prefer cats to dogs? Or that The Awl enjoys discussing both cats and penises? (Everyone.) It’s good to know things about your media when you consume so much of it, so I’m glad we did this together.

For more fun along these likes, Kieran Healy did some further analysis and clustering, and wrote up what he learned.

And before this party ends we should make sure to insult the hosts. To that end, Bing thinks Medium.com, this very website, uses the word “men” eight times more than it uses “women,” which is, I suppose, a problem of the -atic variety, and not as surprising as it could be.

Finally, Medium.com is also slightly biased in favor of both dog and penis — but just a little. It’s still a young network and it would take very little to tip the scales in the direction of cats and vaginas. Very little indeed.