It’s been five years since the first edition of Hadoop Weekly went out on January 20, 2013 (see the 245+ issue archives at HadoopWeekly.com). Five years ago, the Hadoop ecosystem didn’t consist of much more than HDFS, MapReduce, Pig, Hive, HBase, and Oozie. Streaming systems were just starting to take off, with Twitter open sourcing Storm in late 2011 (other projects that started around the same time, including S4 and HStreaming have since become defunct). Spark wouldn’t enter the Apache incubator for another six months.
Now, five years later, Hadoop distros include many more projects (Amazon EMR, for one, supports over 18), Apache Kafka is at the center of many applications, and we’re seeing tons of innovation in the container orchestration space. While Hadoop Weekly has always covered the “Hadoop ecosystem,” that definition has shifted significantly over the past five years. (See my brief analysis below).
As new technologies arrive and Hadoop plays less of a role, I’ve resisted renaming the newsletter for one reason or another. But I finally have a new name and the time to make the change. In early 2018, Hadoop Weekly will become Data Engineering Weekly (dataengweekly.com -website coming soon). Don’t expect drastic changes to the content, but I’ll slowly be making the change in name over the course of the next few weeks.
Hadoop Weekly now has well over 10,000 subscribers. Growth has been consistent and organic (I haven’t run marketing or advertising campaigns to drive subscriptions). Thanks to all the readers, especially those that send me content, recommend Hadoop Weekly to their friends/coworkers, and otherwise encourage me to keep the newsletter going.
Maintaining a high quality newsletter is a lot of work, and there have been several times that I nearly gave up. As part of my reflection on whether or not to keep up the newsletter, I’ve analyzed the data that I’ve captured over the years. Through weekly snapshots of technical topics, it’s easy to see some high-level trends.
Caveat: Hadoop Weekly very much expresses my editorial bias, and thus the following analysis also reflects that bias. What follows is not a scientific analysis, but hopefully you’ll still find it interesting!
Some brief analysis of the above trend data:
- YARN started off strong, but its coverage has been trending downwards since 2014.
- Since late 2013, Spark has been growing like mad, until late 2017 when Apache Kafka became more widely written about.
- Coverage of Hive, HDFS, and MapReduce has been slowly trailing off over the past five years.
- Apache Drill had a burst in 2015–2016, but it’s dropped in coverage in 2017.
- Apache Flink came onto the scene in 2015 (before that, it had a different name) and has maintained steady coverage since then.
Here’s a look at the first appearances of various technologies (and more!) in Hadoop Weekly over the past five years.
- First post ever (that’s not a deadlink): http://grepalex.com/2013/01/17/awk-with-hadoop-streaming/
- First mention of Kafka: Hadoop Weekly #4, February 10, 2013: LinkedIn Eng blog on intra-cluster replication in 0.8
- First post on ORCFile: Hadoop Weekly #6, February 24, 2013: The Stinger Initiative: Making Apache Hive 100 Times Faster
- First mention of Spark: Hadoop Weekly #8, March 10, 2013 for the 0.7.0 release (link no longer works)
- First post on Parquet: Hadoop Weekly #9, March 17, 2013: Introducing Parquet: Efficient Columnar Storage for Apache Hadoop
- First post on Microsoft Azure: Hadoop Weekly #10, March 24, 2013 for the HDInsight public preview (link no longer works)
- First mention of Apache Drill: Hadoop Weekly #10, March 24, 2013 in reference to MapR’s funding round
- First mention of HDFS erasure encodings (one of the features of the recently announced Hadoop 3.0): Hadoop Weekly #21, June 9, 2013 (link no longer works)
- First post on Presto: Hadoop Weekly #21, June 9, 2013: Facebook unveils Presto engine for querying 250 PB data warehouse
- First event in Australia: Hadoop Weekly #22, June 16, 2013
- First mention of Stratosphere (what is now Apache Flink): Hadoop Weekly #54, January 26, 2014: Stratosphere 0.4 Released
- First event in Africa: Hadoop Weekly #55, February 2, 2014
- First mention of Docker: Hadoop Weekly #61, March 16, 2014 for the (now defunct) Ferry project. (link no longer works)
- First mention of Google Cloud Dataflow (what is now Apache Beam): Hadoop Weekly #77, July 6, 2014. Google’s MapReduce Divorce Does Not Mean End of Hadoop is Near
- First mention of Kubernetes: Hadoop Weekly #90, October 5, 2014: Openshift, Kubernetes, Docker, and Apache Hadoop
- First medium post: Hadoop Weekly #92, October 19, 2014: Is Apache Flink Europe’s Wild Card into the Big Data Race?
- First mention of Kudu: Hadoop Weekly #139, September 27, 2015: Cloudera is building a new open-source storage engine called Kudu, sources say
- First post on microservices: Hadoop Weekly #170, May 15, 2016: Spring Cloud Stream: The New Event-driven Microservice Framework
I’m often asked how much time Hadoop Weekly takes as well as other details about operating it (and what they cost). So if you’re curious, here’s a peak into those details.
I use Mailchimp for sending emails and AWS to host hadoopweekly.com. During the first ~13 months of Hadoop Weekly, I qualified for Mailchimp’s free tier. Since then, prices have steadily increased from $27/month to $72/month. In just under four years, the total cost of sending email is around $2,750. Hadoopweekly.com is built with Jekyll and hosted byAmazon S3. In early 2016, I also added Cloudfront to the mix to enable https. It’s a bit difficult to estimate exactly, but it costs around $1/month to host Hadoop Weekly (the majority of that cost is the $0.50 for a Route53 hosted zone). So over 5 years, cost is well under $50.
Other costs include:
- PO Box for compliance with CANSPAM (has gone from $76/year to $90/year): $412 (76+80+80+86+90)
- DNS: $55 ($11/year)
So in total, over five years, my out of pocket cost has been around $3300. Of course, the real cost is my time—I spend 4+ hours per week curating content.
As you can see, the cost of operating a weekly newsletter is non-trivial. Starting in 2018, I’m going to be looking into including sponsored content (something that several other weekly newsletters do) to help offset some of my costs. If you’re interested in advertising a job, a webinar, or another type of sponsored post, please get in touch by mailing email@example.com.
A look ahead
2018 should be an exciting year for data engineering. I’m excited to cover that news, and I hope you remain on for the ride (and help spread the word). As always, I can be reached on twitter and via email (firstname.lastname@example.org) if you have any news or posts to share. Until the transition is complete, signup still lives at hadoopweekly.com.