Why I made the move from Oracle to MongoDB

Published in

The Startup

18 min readSep 8, 2020

Everything changes. Time is a measure of change. I’d go so far to say that change is the most powerful force in the universe. Even truth is subject to change. What is true today is not necessarily true tomorrow. Throughout recorded history, we can see the power of change.

Technological innovations brought about by changes in thinking often create their own empires. I understand the power of change and when I look at history, I realize I can’t fight changes that bring technological advancements, I have to embrace the change. Yet, oddly enough, it is part of human nature to like things just the way they are. Sometimes I fight the changes even though I know it is folly.

The technology industry is constantly changing, reinventing itself to enable new capabilities. Compare technology to another industry like health care and feels like it's like comparing dog years to a human life span. I work in the tech industry, I must constantly learn new technologies just to keep up with all the changes.

Three years ago, I was happily employed as a Principal Sales Consultant at Oracle Corporation. I had several years of experience with Java and Oracle and was happy and content in my role. In my humble opinion, Oracle was, (and still is) probably the greatest relational database in the world.

Today I am employed as a Senior Solutions Architect at MongoDB. This article explains why I made the switch from Oracle to MongoDB. More importantly, I hope to give you some insight as to the changes going on right now inside the tech industry when it comes to databases.

Over the past couple of decades, I have seen many iterations of various programing platforms rise and fall in popularity based on major technological paradigm shifts.

In spite of all this change, there is one central truth: data has value.

Many technologies developed to access and display the data in a database have come and gone. Yet the paradigm of the database, a place to store and retrieve data, has changed relatively little, until recently.

Over the past 40 years we witnessed the migration from mainframe to client-server technologies, then to distributed processing on the web, and in the process saw all the programing paradigms shift. Then the concept of the web began to evolve to the concept of cloud.

Communication between systems is not only critical, today it's central to the success of running any sort of business. Moving data from one system to another requires having a way to describe the data in a commonly understood format between the systems.

Extensible markup language (XML)and JavaScript Object Notation (JSON) are different ways of representing data as it is transferred from one system to another. Originally the favored method of communicating between systems was Web Services and something like SOAP (Simple Object Acess Protocol). Together with a Web Service Definition Language (WSDL) file and an XML message I could do just about anything.

Today, Representational State Transfer (REST) and JSON are pretty much the favored standard. JSON is a bit easier on the eyes and considered more human-readable. The two examples below show the exact same customer data in XML and JSON.

A Sample Customer Record in XML format:

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <address>
      <element>
         <city>New York</city>
         <postalCode>10021</postalCode>
         <state>NY</state>
         <streetAddress>21 2nd Street</streetAddress>
         <type>Billing</type>
      </element>
      <element>
         <city>New York</city>
         <postalCode>10021</postalCode>
         <state>NY</state>
         <streetAddress>42 4th Ave. Apt 2B</streetAddress>
         <type>Shipping</type>
      </element>
   </address>
   <age>25</age>
   <customerId>100123</customerId>
   <firstName>John</firstName>
   <gender>Male</gender>
   <lastName>Smith</lastName>
   <phoneNumber>
      <element>
         <number>212 555-1234</number>
         <type>home</type>
      </element>
      <element>
         <number>646 555-4567</number>
         <type>fax</type>
      </element>
   </phoneNumber>
</root>

A Sample Customer Record in JSON Format:

{
     "customerId":"100123",    
     "firstName": "John",
     "lastName": "Smith",
     "age": 25,
     "gender": "Male",
     "address":
     [
         {
             "type":"Billing",
             "streetAddress": "21 2nd Street",
             "city": "New York",
             "state": "NY",
             "postalCode": "10021"
         },     
         {
             "type":"Shipping",
             "streetAddress": "42 4th Ave. Apt 2B",
             "city": "New York",
             "state": "NY",
             "postalCode": "10021"
         }
     ],
     "phoneNumber":
     [
         {
           "type": "home",
           "number": "212 555-1234"
         },
         {
           "type": "fax",
           "number": "646 555-4567"
         }
     ]
 }

The XML file weighs in at 853 characters, where the JSON is a lighter 732 characters, the 17% increase in size is due to all the XML tags. It's pretty obvious why JSON is favored over XML, it is both smaller and easier to read.

In 2006 Amazon Web Services was officially launched offering Simple Storage Service (S3) and Elastic Compute Cloud (EC2). The notion of network-based computing dates to the 1960s, but many believe the first use of “cloud computing” in its modern context occurred on August 9, 2006, when then Google CEO Eric Schmidt introduced the term to an industry conference. Some refer to SalesForce.com and its offering in 1999 as the first offering of cloud computing. In my opinion, the launch of Amazon Web Services in 2006 marks the birth of today’s cloud computing.

Fourteen years later, the infrastructure, platforms, and software hosting just about all the applications we interact with are either running in the cloud, or they are communicating with systems running in the cloud. In the tech industry, we are embracing the cloud because we have to, the speed and scale cloud offers for application development cannot be ignored.

REST itself is now facing a new challenger. GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. GraphQL was developed internally by Facebook in 2012 before being publicly released in 2015. It is gaining serious traction in 2020.

GraphQL has an advantage over REST. The REST API has a rigid framework and the result returned is predefined when the program is written. GraphQL allows the developer to shape the response and only return the fields the developer needs. Additionally, GraphQL provides a framework for “Mutation” or changing data through inserts, updates, and deletes. It saves all the time necessary to develop a REST API for every possibility. I can easily predict that GraphQL will be the dominant standard in the future. It is important to note that GraphQL still uses JSON as the default response.

Up until the year 2017 the one constant through all this change was the database itself. Of all the technical knowledge that I learned, what served me well, and became the one thing that stood out with little need for change was my knowledge of Relational Database Design.

My knowledge of Oracle and Structured Query Language (SQL) kept me employed with a healthy salary through all of the technical paradigm changes. While the rest of technology evolved and changed, the relational paradigm remained relatively static since its inception by Larry Ellison on June 16, 1977.

Over the last 43 years, the relational database did not change much. Data is sticky. It has value. Knowing how to get data, store it, and manipulate it is a way to guarantee stable well-paid employment in the tech industry.

Larry Ellison Describes Exadata Cloud at Customer on Oracle.com

Oracle and the relational database concept has remained relatively unchanged for a very very very long time in technology.

Today when I use a modern Web Services or REST, I literally have to travel back in time 40 years, conceptually speaking, to store it in a relational database. When I store data sent in an XML or JSON message I have to break the message down into separate tables every time I put the data from the message into a relational database.

Then I have to reassemble the object from different tables every time I want to send data to another system. The mismatch between the JSON document and the relational database is called “Object-relational impedance mismatch.” Its a war fought every day by developers that have to break the JSON document into pieces that go into different tables. That is a lot of work, and I made a good living doing exactly that kind of work.

Our Sample Customer JSON Document in a Relational Database:

Relational databases have always had a central weakness. They break the world down into separate tables based on an idea of normalization. This is done to eliminate repeating values in the rows of a table. I stop repeating values by splitting the attributes that cause repetition off into its own table and create a primary key in one table and a foreign key in the other to join the two together. For example, if we put all the data in one table the Customer_ID 100123, age 25, gender “Male,” and the name “John Smith” all have to be repeated for every phone number and address combination. Repeating values eat up a lot of disk space. 40 years ago when the relational database came into being, disk space was incredibly expensive. I need to reduce the repeating values to save disk space in a relational system.

To keep from repeating these values I break the data down into three tables, a customer table, a phone table, and an address table. Then I link the tables off of only one repeating value, the Customer_ID. It is all well and good for storage. But no so well and good for the poor developer who has to write all the code to take the JSON document apart and stick it in the right tables.

The biggest computational problem comes when I want to report against large volumes of that data. I need to join the tables back together. Those joins work fine on a relatively small data set. When I have millions of rows of data it works great. Perhaps hundreds of millions of rows of data, relational will still work if I tune the database and optimize my SQL query.

Think of customers and orders… or call data records (CDRs) at a telecom provider. Each order or call data record is a row in a table. I can easily reach billions of rows in a short time period. Think about what the size of Amazon's orders must look like on a daily basis. Now think of what that looks like for a week, a month… a year. Imagine billions, even trillions of rows of data that have to be joined across many tables. Relational database systems really struggle with this volume of data.

Relational databases never did scale very well

Here is a somewhat hidden secret in the tech industry… Relational databases never did scale very well. The reason is simple, relational joins are extremely expensive when it comes to computational resources. Try running a Structured Query Language (SQL) statement that joins several tables against tens of terabytes of data. It doesn't work on a single server in 3rd normal form. It never did.

To handle reporting against extremely large volumes of relational data, a concept was formed called “data warehousing.” To get data of that scale (hundreds of billions of rows of data) to work I have to “denormalize” the data. A “star schema” is a very flat denormalized schema to join from a dimension like time or region to a “fact” table that has everything in it. The fact table is all nice and flat, it is one huge table.

Data warehousing works because it denormalizes the data, and eliminates joins. The only downside to data warehousing is, I have to write complex Extract Transform and Load (ETL) processes to move data from the Online Transactional Processing (OLTP) database to the Data Warehouse database. The ETL is responsible for denormalizing the data from 3rd normal form in the OLTP database and getting rid of the need to join multiple tables together. It will take the customer data in the three tables and add it to the order data and product data in one giant fact table, repeating values if necessary. The key here is I only have to join from one or two dimensions to the fact table. The data warehouse works because I eliminated the majority of joins between tables.

The Data Warehouse paradigm works, I keep the operational OLTP database small and nimble and archive data through ETL to the data warehouse. It is expensive and time-consuming, but up until now, it has worked.

But, it's not just the increase in the volume of data that has changed over time, its the nature of the data we need to process. Now we have unstructured log data, sentiment analytics, thumbs up, thumbs down, emojis, memes, and graphical representations of data.

The problem goes beyond the impedance mismatch or the unstructured nature of the new data, it is the velocity of data change we need to process. New fields are added frequently by agile development shops. IoT devices will be sending new data streams. All of this has to be modeled in 3rd normal form before I can insert a single piece of data in a relational system. Now the database itself meets the most powerful force in the universe, change. The underlying technology of the relational database has to change to meet the demands of new technology.

For now, let's just go back to the change in data volume over time. Look at the sheer size of the data growth. What are we up against? Why is the relational database paradigm reaching its end?

The IDC estimates will 175 ZettaBytes will be generated annually by 2025. The information is shared in the report, Data Age 2025, sponsored by Seagate Technology. The growth of this data will be the result of the incorporation of intelligent agents that use machine learning and other forms of artificial intelligence to analyze the growing amount of data generated by the digital things in our lives.

Whats a ZettaByte? Its 10²¹ Bytes or a trillion gigabytes. It's worth seeing the number of bytes in a ZettaByte, a 1 followed by 21 zeros.

1 ZettaByte = 1,000,000,000,000,000,000,000 bytes.

Today in 2020, we produce about 50 ZettaBytes annually. That number will more than triple in the next 5 years. The pain that relational systems feel today will be much worse in the near future. Just look at the graph and see that 10 years ago we produced less than 5 ZettaBytes. The pain was recognizable then, and relational systems had to introduce clustering technology and sharding just to stay relevant.

Something has to change to eliminate all the joins that a relational database has to do just to function. Something that can support JSON natively so we don't have to take it apart and reassemble it every time we need to communicate with a different system. Something that can handle trillions of data points that constantly have new data fields added.

Enter MongoDB….

The year was 2008 and the winds of change for the database began to blow. A young developer working with JSON was frustrated by relational design. His name was Eliot Horowitz. Every project he worked on was stymied by the upfront data modeling requirements and lack of scalability inherent in relational databases. So in 2008, he decided to fix the problem. He wanted to create a highly available “Humongous” database that could scale on commodity hardware and would store JSON documents directly. Mongo is short for humongous.

I will let you hear him speak in his own words about MongoDB. Take 5 minutes, if you have it, and watch the video below. It is well worth your time.

From day one MongoDB was scalable. It is built on the principle of horizontal sharding. I can scale the database by partitioning data across shards that run on separate commodity servers. It is also highly available each shard replicates its data two a minimum of two secondary nodes. So MongoDB, by default is both highly available, and scalable.

MongoDB is a document database. I simply store the JSON document I receive through REST directly in the database. Every field in the document can be indexed so it's super fast. It also eliminates the need for joins within a JSON object. It overcomes the impedance mismatch between relational and a JSON object because it stores the JSON object as it is.

Developers love it because it speeds up development. They don’t have to model everything in 3rd normal form anymore. I can do joins across documents if I so desire, but the JSON object itself is stored intact in BSON format. BSON is a binary version of JSON which allows MongoDB to enforce specific data types for each field. So again it's super fast. I can read and write a multiple of JSON documents for each input-output operation per second.

MongoDB writes documents to disk in a way that stores the data in the document together, unlike a relational database that has to write data in different sectors on the disk depending on where the relational tables are stored. This is a HUGE deal, it leads to an incredible leap forward in performance.

The ingest rate exceeds the input-output operation per second (IOPS). A couple of years ago, I was doing a test with an online testing service that wanted to insert at least 100,000 documents per second. They built the MongoDB cluster to handle 100,000 IOPS. During the test, they only saw the MongoDB cluster go to 20,000 IOPS. Yet they were inserting 100,000 documents every second. Confused, we looked at the logs and confirmed they were inserting about 5 times the number of documents we expected. MongoDB was inserting 5 documents for every input operation available. The test harness generated 100,000 documents per second, but only utilized 20% of the capacity available. Because they built the MongoDB cluster for 100,000 IOPS, they could easily get 500,000 documents per second without having to scale any further.

On June 28, 2016, Announced Atlas: Database as a Service for MongoDB

Having a humongous highly available, scalable, JSON document database is awesome on its own. But having it run on all three of the major cloud providers is something else entirely. Within minutes I can deploy a MongoDB cluster in any cloud provider (AWS, GCP or Azure) in just about any region in the world. Wow!

When I worked at Oracle, the feeling about MongoDB could be defined in one word, and that word was “fear”

MongoDB was a document database that ran on commodity hardware. It had an AGPL opensource offering, a paid enterprise version and it was powerful. At Oracle, MongoDB was alien to us, and yet we could not deny that it was a force to be reckoned with because so many of our customers were adopting it and using it on new projects with great success. In 2016, the paid version was also offered in the cloud as a database as a service. In short order, the MongoDB Atlas DBaaS offering was in all three of the major cloud provider (AWS, Azure, and GCP) while Oracle was struggling with just running DBaaS in its own cloud.

In March of 2017, Oracle released 12c R2 (after launching it first in the Oracle cloud) to be used on-premises. To my knowledge, it was the first R2 release that contained a new feature. Normally the R2 release at Oracle is just bug fixing of the R1 release. What new feature had to be released in R2? It was sharding. The reason sharding was released in R2 was that the Oracle customer base was begging for it.

The latest example of Oracle reacting to MongoDB was made on August 13th, 2020 with the announcement of “Oracle Autonomous JSON Database.” Touted as a full-featured database for developers working with JSON data, it is advertised to have “The JSON Features of MongoDB and More…” I will write another article addressing the limitations of Oracle’s attempt at JSON, but suffice it to say it does reinforce my arguments. In my opinion, Oracle is scared of MongoDB. At a minimum, the release of the Oracle JSON database is a nod to the fact that developers want to store their data in JSON format.

Back in 2016 at Oracle, we were struggling with our own cloud offering as we made great attempts to pivot to the cloud. I remember an all-hands meeting for sales with Oracle’s then-president Thomas Kurian. “You cannot win the cloud war without infrastructure. You simply cannot,” Thomas said with conviction.

If you want to run the Oracle Autonomous database or Exadata in the cloud you have a couple of options, your datacenter or Oracle’s. The highest level of executive leadership at Oracle does not want to run Oracle software in someone else's cloud.

This is worth repeating. MongoDB runs as Database as a Service (DBaaS) on AWS, GCP, and Azure. If you have an application running on these cloud providers, the traffic between the app and the database is routed within the cloud provider’s datacenter. MongoDB Atlas supports VPC Peering in GCP and AWS and VNet Peering in Azure. Again, Wow!

Oracle does not have a DBaaS offering on any of the major cloud providers. If you want to run applications on any of the major cloud providers and you want to use Oracle Database as a Service, the traffic will have to go from your application in AWS, Azure, or GCP to the Oracle cloud. In most cases, this means routing traffic over the public internet. Who wants the additional latency, or to route traffic over the public internet? Another option is to create and pay for a direct connection between Oracle’s datacenter and another cloud provider. Why pay extra if you don’t have to?

In September of 2018, Thomas Kurian left Oracle. In November he joined Google as CEO of Google Cloud. No one knows for sure why. My guess is it was frustration over Oracle not investing in infrastructure as the other cloud providers did. That was one problem, the other was not running Oracle DBaaS in the other major cloud providers. Ultimately, I believe, this is not a competitive situation Oracle can win.

I’ll have more articles on the things MongoDB has accomplished since 2017, but in the summer of 2017, I could see the writing on the wall. I’d like to point out one more thing that happened in 2016. The Man AHL use case is linked below.

The last straw: MongoDB can outperform Oracle Exadata

The test case I reviewed showed that MongoDB was able to run 250 million “ticks” per second for Man AHL. Imagine all the global stock trades that occur and all the related data that goes with it. Man AHL was wrestling with all this data. Let us just flip these ticks to IOPS.

Quantitative Investment Manager Man AHL to Open Source its MongoDB Powered Tick Store that Improved…

Man AHL Gained 40x Cost Savings by Building its Financial Data Tick Store on MongoDB and Has Now Released the Code on…

www.mongodb.com

MongoDB can insert multiple documents per input operation and actually exceed the IOPS, but relational databases cannot. The relational database has to break down the JSON document message and store it in different locations on disk. As far as I know, the best you can get is a one to one mapping. The test results I saw back in 2016 would translate to a minimum of 250 million IOPS per second for a relational database. I wanted to see what it would take for Oracle to compete.

Oracle’s premier database offering is an engineered system called Exadata. It uses Oracle’s real application clustering technology (RAC) with Intelligent Database Protocol (iDB) to communicate with storage cells. Each storage cell is its own server. The storage cell has flash cache storage indexes that have the high and low value of every megabyte of data on disk. The Exadata X8 was released in April 2019. A single full-rack Exadata Database Machine X8–2, with 8 database servers and 14 High Capacity storage servers can achieve up to 350 GB per second of analytic scan bandwidth from SQL, and up to 4.8 Million random 8K read I/O operations per second (IOPS) from SQL. Its a killer, and has some of the best performance benchmarks I have ever seen for a relational database. In fact, although it's not advertised, I have heard that a well-tuned Exadata can hit the 12 Million IOPS mark.

Let's go with 12 million IOPS on the Man AHL use case. We need 250 million IOPS. How many Exadata machines do we need to match the MongoDB cluster running on commodity hardware? Well, it looks like we need about 20 of them. But if we want disaster recovery outside of a single datacenter we need another 20 standby Exadata machines in another datacenter so we can failover. The cost of a fully-loaded Exadata machine is millions of dollars. So we need 40 multi-million dollar Exadata machines to compete with one single MongoDB cluster. The Man AHL use case blew my mind.

In the summer of 2017, I was actively interviewing at MongoDB.

When an old technology dies, it doesn't go out in a bang, it slowly fades away. Today we still have mainframes and COBOL. We will have relational databases for a long time to come. But in my humble opinion, the era of the relational database as the first and best option to solve a data problem has come to an end. It has come to an end because of the flexibility, sheer volume, incredible ingest rate, and query velocity a modern database requires.

When I recognized the relational gig was up at Oracle, I made the change and joined MongoDB at the end of July 2017. MongoDB went public on October 19, 2017. I am currently vesting the stock options I received before the IPO. Today less than three years later, the stock has gone up in value about 10 times its IPO price. Can I call it or what?

I don’t regret leaving Oracle to join MongoDB at all. Being in the industry gave me great insight into what is going on and why. I was able to get stock before anyone else investing in the market. I joined MongoDB because it was obvious to me that MongoDB was going to succeed greatly.

It is good to be able to see why change happens, to embrace it, and to ride the wave. I hope that this article gives some clarity to see the changes that are underway. If you have massive amounts of data to deal with and need to modernize your tech stack, take a serious look at MongoDB.