A One Love World

Software can change the world. It already does.

How we build it and what we can build defines the dreams our children can live. The world possible for our children.

But there is much work to do.

A Simple Scalable Internet Architecture

Piece by piece we built this old Internet. Part-by-part. Good parts. Good ideas for the times. Hundreds of programmers and hundreds of projects. Funny sounding techie terms like Apache and NGINX and… MySQL and MongoDB and Cassandra and… Memcached and Redis and… Squid and Varnish and… Vagrant and Docker and…

So now we can build huge monster systems by pulling the latest choice parts out of the magic bag o’ open-source; gluing them together. Languages like JavaScript or Python or Ruby or Rust… with a bit of Bash or Cython and C++ and Delta minus minus… thrown in for good measure and to keep things complicated and fast… enough.

Trouble is, when the complexity of the combinatorial explosions of the parts passes a certain threshold, one can no longer piece together more parts to make better systems. The temptation of the well-known path leads towards increasing complexity and at some new threshold of usage poor performance and scalability issues return again and again...

In an actual global-scale system built for tens of million of users, each of the parts might be tens or hundreds of machines with complicated algorithms and connection topologies between these machines.

This is why large systems are so hard to get right. They are complicated beasts more herded than controlled or managed or understood.

Complexity

Consider a simplified diagram of a typical high-scalability architecture for a site that includes both web and mobile / phone clients:

The above is considered a decent way of assembling the pre-existing parts, the good parts, mentioned in the opening paragraphs. Which parts are the best for any given use and company change over time as new products arrive and others gain improvements that makes them more suitable for a high-scaling system.

So what’s wrong with this? It works. Lots of companies do something similar. It seems to represent “best practice” for this sort of problem. Why propose something different?

In three words: exponential increasing complexity.

To understand where this complexity comes from, let’s look at two of the most problematic cases for scaling, what I’ll call the big-data problem and the data-stream problem.

The big-data problem is one of how to manage the data for millions of users when that data is larger than will easily fit into a cost-effective large data-focused machine instance. This is a problem that all high-scalable systems face at some point. More users means more data. More data eventually fills up the available space. Moore’s Law helps move the line where this limit lies, but only to a point as the constraints and bottlenecks move around the system. More data requires bigger pipes between servers, faster CPUs to handle the higher bandwidths and larger and more complex underlying data storage systems.

The data-stream problem is one of how to manage updates to dynamic data streams. This is a problem that all social-media applications face when there are streams built by individual subscriptions via following or subscribing to certain data input streams like Twitter or Pinterest or Instagram or FaceBook friends. Users with millions of followers drive changes to the system that must be propagated widely. These changes update data in many places in a system with many caches. Queries to retrieve the data become more complex and performance degrades when the system architecture keeps too much data in one location.

Big Data and Sharding

Sharding is the current best solution for the big-data problem. Sharding refers to splitting up the data in a large system so that only parts of the data (called shards) exist on any given database server (or cluster). For example, in a smaller system one might have all the data for all users on any given machine. In a scalable system, that data will be sharded according to some attributes that correlate closely with the way the data is accessed. For example, one might use a shard key corresponding to a country or region to place data in databases that correspond to particular regions so that users from the same area will have their data stored on the same machines as other users in that same region.

Another common application is to shard data based on tags or other attributes that correspond to the way the data is accessed. On might, for example, store all the data for posts tagged with #BlackLivesMatter together on the same database system. Search queries for the tag #BlackLivesMatter will be directed to the particular shard that contains that data. This might be done using range-based shards on the tags themselves, i.e. tags beginning with A-E might be on the first shard, F-M on the second, and N-Z on the third. The system would direct queries to each shard based on the letter range for the beginning of the tag itself.

To access the data in a sharded system, queries are routed to the correct shard based on the tags, this is typically done by a router component that contains the knowledge of where in the system the shards are stored and the breakdown between the ranges for each of the shards themselves: A-E on shard #1, F-M on shard #2, N-Z on shard #3, in the above example. (For an example with more detail, see the specifics of how MongoDB handles sharding.)

Data Streams and Fanout

The second hard problem for scaling is the data-stream problem. For example, a post is made by a users which must propagate to many of the users of the system. Here is what that might look like at a high level:

A stream server software component accepts the post from the user over an HTML post; or perhaps a REST API call made from a web or mobile app. The server determines which stream-cache servers have users subscribed to the feeds which contain that post; and then she forwards the post on to those stream-cache servers. In a larger system, this might be hundreds or more servers.

The above diagram resembles a hand-held fan so the process is referred to as fanout. In this example, the post fans out to only the stream caches A, B and C, and not to cache D.

There is no current “best solution” for this problem since there are relatively fewer applications that have large-scale data streams. Further, the solutions that these relatively-few companies have developed, are custom and the best solution changes on a regular basis as these companies improve their architectures to adapt to ever-changing needs, competitive forces, and the hard realities of rapid growth of anything. For example, here is a detailed explanation of the Twitter architecture (as of 2013).

The data stream problem can be handled in two main ways: first, using a data distribution system that propagates individual posts through the system directly to the servers which handle the data for those sets of users that are subscribed to the posts; and second, using queries which pull the data from a more generalized data store which can build the set of posts using more traditional database-type queries (whether SQL or NoSQL) which are built against sharded systems like MongoDB.

Generally, fanout is better for realtime feeds and frequent users; while database-type queries are better for users that have not logged into the system recently; this is because doing a lot of work for users that don’t access the data wastes more resources than just doing more work every once in a while. Why waste memory and cycles sending data over limited and expensive bandwidth connecions when there would be less work and less spent to reload their data the next time they log in? At the same time, why saddle your busiest and most important customers with the downsides of a system designed for the the casual customer since it is their use which defines your peak performance problems; and it is their happiness which defines your highest futures possible?

Combinatorial Complexity Explodes

So what’s the problem with the Internet and the web that builds on it? Why propose anything new or different?

Consider again the diagram from the beginning of the article. This time, I’ve expanded the MongoDB node to include the sharding example above so you can see how the complexity introduced by sharding affects the system. I’ve grayed out the other system components so you can see where the sharding components add complexity:

Once you introduce the idea of sharding to the system, then you must account for the complexities introduced in the other parts of the system. In the beginning, this might be relatively simple as you only need to shard one piece. Later, and sooner than you think if you are successful with your produce, you may find that you need to do something analogous to sharding with the other system components. The images might be stored in one place initially but as the accesses rise, these too might need to be spread across many machines. There is no way to propagate the sharding idea throughout the system in an automatic way. You must write custom code.

Likewise with caches of frequent pages (using special-purpose caching programs like Memcached, for example). As the number of users rises, cache efficiency drops as the likelihood of any given user’s page being cached drops inversely with the total users. So you need to split up your caches and build some code to route to the right cache. Some more code to rebuild caches when a machine crashes or hardware fails. Some more code to split up the search machines and route to the divided searches. Modern dynamic pages only make this problem worse by driving up the query rates and making caching itself less efficient since every user’s page will be constructed from parts that are largely unique to that user.

It’s not the same as the old days when the Yahoo home page was the same for everyone. All those queries for each users and access drive up the system requirements and complexity in nonlinear ways.

Add some more code to handle your PostgreSQL databases that might only contain a few critical tables, so perhaps you want to also put them on MongoDB but have to build another system that shards based on username instead of starting tag letter. More code for this… more code for that… pretty soon no one is even capable of understanding the entire running system and disaster strikes.

Metadata to the Rescue — The Lesson of Oracle

The underlying problem is that the systems which scale must encode the semantics of their particular problems in custom code rather than metadata. The solution of sharding or splitting up your data across many machines or data stream fanout needs to be built custom in too many places. The backbone of redundant systems across many data centers must be implemented in custom code. The solution for handling hardware outages and software crashes must be implemented in custom code… And this doesn’t even account for the additional complexities required to implement big-data artificial intelligence analysis at scale.

All the custom code and glue connecting disparate pieces makes the system more fragile and harder to test as custom code introduces many more paths through the system These additional paths introduce more failure points and vulnerabilities that sometimes show up as system-wide or widespread outages of critical parts.

The current situation for global-scale web systems is analogous to the way that databases were used in the days before SQL. Database access routines were hard-coded into the systems which used the data in what is now referred to as imperative programming. You would tell the system exactly what to do and it would exactly do, and only exactly do what you said, even when what you said wasn’t what you really wanted; or might have asked for, had you known what was possible, and what the real issues were.

Custom code is a straight-jacket if it is not built to be flexible.

SQL and declarative programming provided that flexibility using metadata. The metadata describing a system now supplied sufficient information so that the database system itself could use abstract algorithms to generalize the data retrieval problem across different usages of the data. After the first larger commercial SQL systems like Oracle, users of database systems could describe their data, then put that data into the system, and Oracle would figure out how to access it. You’d declare what you wanted, not how to get it, and the magic Oracle behind the scenes would give you the answer.

Larry Ellison is a very wealthy man because Oracle solved a huge problem for large-scale IT organizations and made it much easier to build systems that actually worked reliably. The abstraction of tables and the semantic processing enabled by having metadata relating table columns to data types, indices and query access plans allowed large commercial users to stop worrying about the how to get the data so they could concentrate only on the on the data needs of their applications.

Consider Efficient

What’s the ideal scenario from an efficiency perspective?

First, let’s look at what our current systems do. With so many different components, a typical login and access of the home page for a site might involve these paths:

The web or mobile app first: 1) reaches the AppServer cluster through the Load Balancer, then 2) connects to a Django process, then 3) the Django code checks the cache of users stored in Memcached to see if the user is already logged in or has cached user-state information and returns it if so, and if not 4) the Django code hits the PostgreSQL database to get the user data.

That’s just for a typical login at scale. Displaying the home page is even more complex and involves even more paths:

The web or mobile app first: 1) reaches the AppServer cluster through the Load Balancer, then 2) connects to a Django process, then 3) the Django code checks the cache of users stored in Redis to see if the user’s page is already cached and if so, returns it, and if not 4) the Django code hits MongoDB database to get the data required to display the page; once the page is loaded the graphics files it references must 5) Connect through the Load Balancer to the ImageServer Cluster, then 6) Check the Memcached cache of frequently-used images, and if not found 7) Hit the NFS Storage system to retrieve the image. It must do this for every single image referenced by the home page.

Note that these are relatively simple cases. For many real-world sites and apps, the number of different components which participate in a single page display can be many times greater.

Optimum Efficiency

What is the optimum scenario from an efficiency perspective? Consider the ideal machine topology for connections and networking. Optimally, the web or mobile app connects directly to a single machine which contains the full stack of components (which I’m calling a full-stack monolith) and all the data necessary to build the home page for a subset of users so the results can be returned directly to the calling app without any components having to connect to other machines over the network.

Most servers are i/o limited so eliminating network i/o will increase efficiency and system throughput. Here’s what this might look like using machines that contains typical current components (a bit later in this article we’ll look at how to optimize those components themselves):

If one distributed the data for each user according to the identical sharding and fanout algorithms then it is possible to make sure that all the data needed for a given user is contained on at least a few monolithic systems that contain all the parts necessary to return the results of the requests to the web or mobile app without having to make any network requests.

The app might get the address of this machine using a query to the custom load balancer the first time and remember that address in local storage so the endpoint monolith could be contacted directly the next time. The load balancer could return the list of all the machines which contain that users data.

Now any page or API request will be able to do all the processing on that single monolith machine without having to do any network requests to other systems. This will reduce network latencies and greatly reduce i/o as a bottleneck.

For redundancy and backup purposes, you would replicate any given user’s data across several different machines each of which could serve that user’s data whenever asked:

In this example, a change is made by a user connected to full-stack monolith D (Mono D), this change, in turn is propagated to Mono E and Mono F which are the redundant copies of the change made first on Mono D. This limits the network communications to only that necessary to propagate changes to the redundant nodes.

Now lets look at our “optimum” monolith. If we examine the parts, we can look at the abstract roles those parts play. Here’s a view of the example monolith that looks at the roles each part plays on an abstract basis:

So in our example:

Django / Python — a web and REST API application server

Redis — caches frequently used data

PostreSQL and MongoDB — indexes and storage to speedup access to data

NFS Storage — persistent storage of images and data

Since these parts are now running on the same machine, there’s less need for them to be separate components or to use network protocols to connect them. In fact, there can be considerable performance improvements gained by simplifying the parts down to their bare essentials and removing or simplifying the interface layers between some of the components where possible.

There are many many different types of optimizations that might be done here depending on the various components that one considers, since all the components now reside in a single machine. In general, it will be considerations of threading or process management that will drive the ways the components are separated across processes.

Fractal Search

The one remaining problem that this architecture doesn’t yet address is search. While having only data related to a subset of user’s needs on each of the monos is fine for serving responses which only require that data, it won’t work for search in many instances. If you are trying to find someone in a database of users that spans the globe, you can’t only search a database which contains a subset of the users, you need to search the whole globe’s database.

The simple solution to this problem is to extend the sharding to implement fractal hierarchies of arbitrary form which can be used to automatically build fractal decompositions of global data. Some pictures and a couple specific examples will make this clearer. In the following example, sharding for the monos is done geographically, so the Lagos mono would contain the data for users in Nigeria and most-probably all of West Africa, Nairobi would contain East-Africa and Capetown South Africa and nearby countries / cities.

Note, however, the addition of monos that contain supersets of the typical mono. The purpose of these special nodes is to support the larger datasets required to make global or regional-level queries. The basic idea is that certain tables and columns (or document types and fields) would have their changes propagated to aggregating search nodes which would contain sufficient aggregate data to be able to serve those requests directly from the database itself. Only data that one might want to search for globally would require this special handling. So most user data wouldn’t be searchable in this way, only public data that one might consider for search would be

This general approach works for any type of search. If the search should be limited to a particular subset of data relevant to a given user, then it can be directed to the mono handling the data for that user. In the following example, we see how a user home-hosted in the Lagos, Nigeria shard would also potentially want to search in high-level contexts like: Africa or Global (or perhaps even across other geographies like Europe or Asia).

Normal queries would be run against that mono directly. However, search queries for all of Africa or all Global data need to be directed against one of the higher-level monos which contain that data.

Now these fractal hierarchies don’t have to be geographical. They can be other types of decompositions: one could shard against domains in particular fields. Searches might be across all of genomics at the global level, just a particular species in another model, and just a small segment of genome in another model. Any hierarchy or hiearchical subgraph of any network can be sharded in this way.

Distributed and Private

Now here’s the best part…

Most of these nodes don’t even have to reside in a data-center, in fact, for efficiency, speed and cost’s optimization taken to the next level, what you really want is edge servers. Sitting right in the local telco’s exchange office, your own bedroom, or your gaming buddy Zed’s place who’s the guy the whole crew goes to for all things tech and hardware, who happens to have a gigabit pipe straight to the best Internet hubs in the tri-state and wider metro area. Everything is one short hop to your phone or laptop no matter where you are and where you go.

So instead of sending queries to the one true protector of our Equifaxic irony, the bastion of consumer protection, your data stays in your hood with your gal holding the keys. You get fast response and privacy protection because the most efficient way to do it happens to also be the most protective and most secure. Since the data stays in our servers… we get to see only ads we want and they have to pay us to see them… not some other jerks who see us as eyeballs to sell and opinions to propa a grand eyes. (or gand-ize, if you prefer a bit less license poetic, and comment parenthetic).

We can build this. Starting today. We don’t even have to invent new science to make it work at global scale.

We need only apply the lessons learned from past failures and the subsequent hard-earned successes and solutions which followed that failure.

It won’t be easy. It won’t happen overnight. But it will change the world.

A One Love Future

Once we have a global-scale software network that runs on meta-data designed to support the parts efficicently, we can build the tools for a global conversation… and for a global transition… and even emancipation. The problems we see right now on the global stage are not unsolveable, they’re just impossible in this configuration, on yesterday’s structures with last millenium’s leaders running the pre B.C. gameplan of Pharoahs and Kings and Conquest.

What if 1 million of us got together to organize and deploy global aid to help those suffering from earthquakes and hurricanes and floods and other disasters? These called natural but hard to bear when it happens to you and yours? Can’t we show our compassion better than the giant bureaucracies of last century’s phalinthropy?

If 75.001% of the world signed up to an agreement to stop their country from waging wars, would governments listen? What if we organized on a massive scale? Could they not listen?

Can’t we use the minds of our engineers and scientists better at helping people than with more better killing machines? What do we really want? Are there better ideas for how to get those things for more of us more often and with less fear and anxiety?

Can’t we do better than the centuries-old ideas of a few old men deciding the fate of the world with our children as their pawns? How hard would it be to move the levers that build on profit from pain towards joy and fun and meaning? If we all worked together?

If we all talked together?

Do we really need my-money-is-more-important-than-your-time advertising anymore? Don’t we really just want cool products, good food, and nice places to live, and awesome friends to hang out their with? Where does advertising benefit us in this?

Do we really benefit by having some of our best story tellers telling stories that make us do things that are not what we really want, because it works? For them anyway…

Fact is, cool products get made by artists and diehard fans and careful clever students of beauty and design. Good food is grown by those who care about their work, who love their land, and who have the tools and know-how to work with our mother Earth the way she is. Nice places to live get built by kraftwerkers and builders who love their work and have pride and integrity, who have lived well enough themselves to know what better looks like, and who have the tools and opportunities to do things right.

Art.

It turns out… there is more than enough money around these days to make all these ideas not only possible, but likely, inevitable even. The time is right and the alternatives are not only bleak… they just plain suck.

But it’s not going to come to that… we have the power of code.

Software, more than power, more than money, more than politics and more even than religion these days… runs this world. And beautiful software, well-built, solid and with screaming performance is possible for all of us right now. We don’t have to spend more. We don’t have to wait long. We just have to decide…

Yeah, I want to build that… and yeah, sign me up, how can I help?