Skip to end of metadata
Go to start of metadata

UPDATE 2017-1-10: highlights from this blogpost was published in ODBMS. See

1. Introduction

Twenty years ago I entered the database field. Since I left David DeWitt's Microsoft Lab in 2012 I have been working in Mike Stonebraker's big data startup Over the years I have gained a lot of insights in database systems and distributed computing. Recently invited me to give a two-hour mini lecture on big data. In preparation, I studied (or re-studied) a large collection of big data materials. This blogpost reviews the landscape of big data platform as I see it, by the beginning of 2017. By "platform" I mean the layer in the big data stack that is two layers below analytics. By "review" I mean to present existing knowledge, rather than new research.

The blogpost could be valuable to you in the following ways.

  • Provides information on the up-to-date big picture of big data platform.
  • Offers key lessons I learned and some opinions.
  • Includes links to carefully selected materials for further reading.

2. Big Data Overview

Why big data?

Data volume is large and grows very fast. Today the world has 10 billion TB of data. Furthermore, 90% of the data was produced in the past two years. Each data 2.5 EB of new data is gathered, which is equivalent to the total storage of 150 million iPhones.

(Mikal Khoso. “How Much Data is Produced Every Day?”

Naturally, the number of big data jobs grew, too. The percentage of job postings that contain "big data" grew by 10X in the past 5 years.


There are 20 or so big data roles. (“Top 20 Big Data jobs and their responsibilities”. People often ask the question "I want to find a big data job; how shall I start?" Knowing the various roles may provide some idea.


  • Be familiar with big data roles and responsibilities.

What is big data?

According to Wikipedia: "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them."

But the term "big data" has been used at least to mean three different things. One meaning is "big datasets" or a lot of data. For instance, one may say "this year our users uploaded 10X more videos; we have big data now.” The second meaning is "big data analytics". For instance, a hedge fund may "use big data to predict stock trends". The third meaning is the whole "big data stack".

In Gartner's original proposal, big data had three V's: Volume, Variety, and Velocity. Later at least two more V's were added: Value and Veracity. (Jason Williamson. “The 4 V’s of Big Data”.

Big Volume means the amount of data is big. Note that whether data volume is big or small can be a relative concept. A Facebook engineer told me that for them, data with volume less than 10 PB is considered small. I told him "your small data is my big data". In general, data is big for one organization if it exceeds the processing capabilities using the existing infrastructure and platform.

Big Variety means the range of data types, formats, and sources is big. In theory data scientists are provided well-formed data and delightedly work with sexy machine learning and analytics algorithms. In practice majority of their time are spent grumpily to extract/transform/load data coming from different sources, having different types or formats, and containing plenty of errors or inconsistencies. For four decades we RDBMS people have been telling the world (such as scientists) that they were wrong for not putting their data in database systems. We had very limited success. Today only a tiny fraction of the world's data (probably less than 10%) is in databases. In my opinion a key reason is that existing database systems support small variety. Basically they only work if the data model is relational, and they require non-trivial ETL. The landscape has been disrupted first by the NoSQL movement and then more violently by Hadoop. I do not believe Hadoop (or even Spark) is the final answer (for one thing, using Java to develop platform may not be the best idea). But whatever solution that is settled on, it has to support big variety.


  • An important feature missing in RDBMS is to support big variety.

Big Velocity means the speed of data in and out is fast. This is particularly useful for streaming applications such as real-time finance. Related to data movement are federated databases. See for instance MIT's current Polystore project. Mike Stonebraker mentioned to me that "every five years there was a federated database but they all failed." A key reason was that plumbing data movement between systems is hard. Especially when data volume is big, you want to try keeping data where they are. A federated database by design will move data a lot between systems – suck the query result from one system and inject to another.


  • Move query to data, not data to query.

Value means business value. Many technical people tend to pay a lot more attention to technology than to business value. But any book on building a startup company will make the point that building a company is much more than doing technology. When making decisions on technology, always have in mind the purpose to increase business value.


  • Do big data for increasing business value.
  • Don’t do technology purely for technology purpose.
  • Read a book on how to build a startup company.

Veracity means correctness. If you as a data scientist suggests a billion-dollar action based on analyzing some data, be prepared to answer the question "where did you get the data from?" While a data lake provides you big variety (i.e. you can dump to it data with arbitrary types and formats), don't abuse it. Establish and follow policies that ensure the quality of data that enters the lake, and maintain enough metadata such as who modified it when and how. Philip Russom suggested to use data lakes not data swamps. The two share in common that both are capable of storing big datasets with big variety. I vision a data lake as a cleanly organized storage room with racks, boxes, and labels where everything is where it belongs; and a data swamp as a room covered with feet-high stuff and garbage all mingled together.


  • Use data lakes, not data swamps.
  • Read Russom’s “Best Practices for Data lake Management”.

(Philip Russom. “Best Practices for Data Lake Management”.

Is big data for you?

It depends.

A lot of organizations are jumping onto the "big data train", e.g. allocating a lot of money migrating to a Hadoop cluster. At least some of them are doing so not because they have done analysis which showed they needed it, but because they heard a lot of others are jumping on it. Remember the quote that big data is like teenager sex?

Turing Award laureate Mike Stonebraker said that "big data marketing hype has included a constantly evolving message, and IT pros keep falling for it." (Robert Gates. “Stonebraker: IT pros fall for big data hype”.

Frank McSherry said "if you are going to use a big data system for yourself, see if it is faster than your laptop." (Frank McSherry. “Scalability! But at what COST?”


  • Use reason, not emotion.
  • Choose big data only if you need.


Big data market cap

The market cap of big data is in the order of some hundred billion dollars. For a lot of companies, it is not possible to accurately tell what percentage of its market cap is for big data. The graph below shows the overall market caps of selected companies.

( 1/3/2017)

The top three cloud service providers are —Amazon, Microsoft, and Google.

—Apple has been using the cloud services from all these three, but recently started its own “Pie” platform project.

Oracle recently entered the cloud world, too. When —Larry Ellison announced Oracle's cloud offering in OpenWorld 2016, he said something like “Amazon’s lead is over”. Take it with a grain of salt, of course.

Samsung has been using Google platform, but its purchase of California-based cloud-computing company Joyant on July 15, 2016 indicated that it is serious about having its own big data platform.

It is worth mentioning two Chinese companies both having a market cap larger than that of Oracle and Samsung: Tencent (creator of WeChat, the instant messaging app which is used by virtually every single person in the world who is capable of typing Chinese into a smart phone) and Alibaba (the Chinese version of Amazon). One interesting observation is that while Databricks Spark won the 2014 world record of Gray Sort Benchmark (sorting 100TB data as fast as possible; see, Alibaba FuxiSort won the 2015 record (377 seconds using 3,377 nodes) and TecentSort won the 2016 record (134 seconds using 512 nodes).

The market caps of some other well-known companies are as follows:

  • Hadoop:
    • HortonWorks has a $0.5B market cap.
    • —Cloudera has been intentionally delaying IPO but was estimated to worth $5B.
    • —Databricks was estimated to worth $0.3B.
  • —Baidu ($58B).
  • HP ($39B HPE + $25B HPQ). So HP bought Vertica, sold it, and now is split into two HPs.
  • —Yahoo ($37B)
  • —Dell+EMC ($34B). So Dell which was private for a long time has merged with EMC and went public.
  • —Ebay ($33B)
  • —LinkedIn ($27B)
  • —Twitter  ($12B). Not sure why exactly but Twitter's stock price kept going down while Facebook's price kept going up.
  • —Teradata ($4B)

The market is constantly changing. keeping an eye on it may help developing the correct "sense". Making architectural decisions in an isolated environment may risk reaching local-optimal solutions.


  • Be aware of the market.

3. Big Data Stack

I'd consider the big data stack as having four layers, using terminologies from cloud computing.

As an example, a traditional BI scenario may be that users interact with a "thin client" web browser. At the server side business logic is implemented through a web application, which issues SQL queries to a database server to access data.

As another example, analytics may be performed by writing Python code to process data, using packages such as scikit-learn or numpy, on Jupyter Notebook.

Some sample research areas may be:

Note that terminologies may differ. For instance, a very good article about Google's architecture uses "infrastructure" and "platform" for exactly the opposite meanings compared with the above graphs. (See

Again the focus of this blogpost is on the platform layer.

4. Big Data History

Mike Stonebraker said "what goes around comes around." David DeWitt said "everything has prior art." It is wise to stand on the shoulders of giants rather than to shut yourself in a cave and reinvent the wheels. This is easier to say than done because there are so many materials out there that no one can study them all. How to select the best resources to study? Bookmark a small list of good "meta" resources that do the selection for you. Among my list of such "meta" resources, I'd strongly recommend the following two.

  • The "red book", i.e. "Readings in Database Systems" edited by Peter Bailis, Joe Hellerstein, and Mike Stonebraker ( A big contribution of the editors, in addition to selecting the best resources for you and me, is to write a super-interesting introduction for every chapter. Arguably the introductions may be even more useful to many people than the referenced articles. Among all articles referenced by the red book, if I have to pick one to suggest, I'd say "Architecture of a Database System" by Joe Hellerstein, Mike Stonebraker, and James Hamilton. I myself read it multiple times.
  • The all-time-favorite posts on HighScalability ( There are a lot of good articles there. I personally found the articles with titles containing "architecture" more interesting. 


Below I highlight the big-data history by listing some high-impact milestones, most of which corresponding to articles cited in the red book. For each article, among all authors I pick one to mention for brevity.

  • —1969: relational model (Edgar F. Codd). The relational model is very simple – rows and columns. The impact is, however, very high. Most database systems nowadays use this model. Edgar Codd won the 1981 Turing Award.
  • —1976: System R (Jim Gray). It's been four decades; RDBMS systems are still largely using the System R model. An important legacy is transaction management. Jim Gray won the 1998 Turing Award.
  • —1986: Postgres (Mike Stonebraker). In his Turing Award 2014 speech, Mike Stonebraker talked about the similarities between building Postgres and riding a bicycle across America. One important legacy of Postgres is to support extensibility through abstract data types. Another legacy is a generation of highly trained DBMS implementers. Among others, Cloudera's founder Mike Olston was a key developer on Postgres.
  • —1990: Gamma (David DeWitt). Gamma popularized the shared-nothing architecture, which is used by essentially all data warehouse systems today. Also, like the Postgres project, the Gamma project also trained a generation of DBMS implementers. For instance I was told that half of the engineers on the Microsoft SQL Server team were from Wisconsin.
  • —2004: MapReduce (Jeff Dean). Google is considered the "king of scalability". MapReduce, with all its drawbacks, certainly disrupted the DBMS landscape, hard. The key legacy from MapReduce is that it revealed the importance of flexibility. As Peter Bailis put it in chapter 5 of the red book, there are schema flexibility (not only relational), interface flexibility (not only SQL), and architectural flexibility (open architecture with replaceable components).
  • —2005: One size does not fit all (Mike Stonebraker). Motivated by that famous quote, entrepreneurs have founded a series of startup companies. To name a few: Vertica worked on distributed column store; StreamBase worked on streaming data; Paradigm4 worked on array-modeled distributed computational databases; VoltDB worked on in-memory databases; and Tamr worked on indexing and unifying many databases at the schema level.
  • —2006: Hadoop (Doug Cutting). Many people think of "big data" as Hadoop (which is an open source version of Google's MapReduce). Today the Hadoop ecosystem has evolved to be a >$10 billion market, and continues to grow fast.
  • —2011: Spark (Matei Zaharia). Among technologies in the Hadoop ecosystem, Spark is one of the most widely adopted. Cloudera for instance is using Spark to replace MapReduce.
  • —2016: AlphaGo (David Silver). An interesting consequence of the big data development is the re-birth of artificial intelligence, which was hot in the 1980s but stayed cool mid-1990s to 2010. One representative milestone was that in 2016 Google's AlphaGo completely beat top human players for the Go board game. A Go board has 361 spots so the overall search space is 3361 which is impossibly large – equivalent to taking the number of atoms in the observable universe and multiplying that by 10100. (As a comparison, a chess board has 64 spots.) Before 2016, most people including me firmly believed that computers could not beat human in this game as long as we lived. It looks to me that AI and deep learning will only get hotter.


  • Don’t reinvent the wheels.
  • Read the editors’ introduction for “the red book”.
  • Read "Architecture of a Database System".
  • Study all-time-favorite posts on HighScalability.
  • Be familiar with deep learning.

(Bailis, Hellerstein, Stonebraker. “Readings in Database Systems”, 5th Ed.

Stonebraker's loop

Back in 2004 when Mike Stonebraker started brainstorming with professors the Vertica idea, I was on his team. I was an Assistant Professor then who had never dived deep into a big system. One reason why I quit from the project was that I thought the idea was too simple – "storing data by column instead of by row; where is the intellectual beauty?" Only until much later have I truly realized the beauty of the KISS idiom. Here is an outline of Mike Stonebraker's activity loop as I observed:

while (true) {
    1. Talk with the users to find their pain;
    2. Brainstorm with professors;
    3. Recruit students to build a prototype;
    4. Draw a quadrant;
    5. Co-found a VC-backed startup;
    6. Play banjo; write papers; give talks; receive awards;

People who know Mike might smile at step 4 because he seems to have a special liking of quadrants. For example, the quadrant he drew for SciDB was something like: R is good for complex analytics but only for small data; Hadoop is good for big data but only for embarrassingly parallel problems; SciDB is good for doing complex analytics on big data.


  • Don’t invent problems; talk to the users.
  • Keep it simple.

5. Google Platform

Google cluster at the beginning

(Abhijeet Desai. "Google Cluster Architecture".

Google search: basic ideas

At the beginning, Google had a single product – the search – with an extremely simple interface. In fact at the beginning there was not even a button to click; you type the ENTER key to search. But Google certainly did the search product really well.


  • Do one thing REALLY WELL first.


A web crawler starts with a URL pool, and iteratively downloads the HTML files, parses them, and adds additional URLs to the pool.


—Global ordering

The crawled web pages are ordered through some version of the famous PageRank algorithm. The ideas are: (a) A page that is referenced by a large number of pages is important; and (b) A page that is referenced by an important page is important. Implementation wise this could be a repeated matrix-vector multiplication, where the matrix corresponds to the links and the vector stores the ranks.



The web pages could be indexed using an inverted index.

Given the following collection of URLs and the words in each web page:




pokemon, magic


platform, service, PaaS


big, data, platform


data, scientist


data, service

—The corresponding inverted index may be:






C, D, E






C, B






B, E

For each word, the list of URLs is ordered by page rank.

The inverted index enables efficient retrieval of URLs containing the keywords that are searched on. —Hypertext matching is performed between the keywords and each retrieved web page, to decide its local order. The global and local orders of a web page are combined to decide the order a web page should appear on the client's web browser.

As a side note for web developers: —60% of Google search traffic came from mobile, and —Google demotes mobile-unfriendly web pages. So it may be a good idea to be mobile friendly.


  • Make your web applications mobile friendly.

Google platform challenges

—The basic ideas above sound good and clearly point to an implementation. What is the problem? Data is too big. Some numbers:

  • —130 trillion web pages.
  • —100 PB index(stacking 2 TB drives up: 0.8 mile)
  • ——3 billion searches per day (or 35,000 per second)

What does 100 PB mean? Imagine you drove on a highway for a minute (0.8 mile), and dropped a 2 TB hard drive for every single inch your car passed by. That is how much 100 PB is. Of course in the early days Google's data was much smaller than that; but still, Google had to build its own platform because no DBMS then could handle.


Google data centers

Google's Belgium center looks like follows.

(Malte Schwarzkopf. "What does it take to make Google work at scale".

Google kept to themselves the information about their data centers. Below are what people estimated:—

  • about 2 million machines
  • —about 40 data centers
  • —machines are organized in containers each having 1,160 machines, or 30 racks of 40 machines each

(James Pearn, “How many servers does Google have?”

(“Learn How Google Works: in Gory Detail”.

Google platform

While some other companies view platform as an expense, —Google views platform as a competitive advantage, and thinks of itself as a systems engineering company. Todd Hoff pointed out that an underappreciated advantage of investing in platform is to enable junior developers to quickly create robust applications. (Todd Hoff. “Google Architecture”. If this is true for Google, this is probably true for you and me, too.



  • Treat platform seriously.

Google platform supports its products.

Within the platform layer, there is also architecture as some components are built on top of some others.

(Malte Schwarzkopf. "What does it take to make Google work at scale".

For instance, GFS provides service to BigTable.

In GFS, data are chunkified to 64 MB each, and the data chunks are replicated to different chunk servers. Data chunks do not go through the master nodes. When clients need data, they ask the masters which chunk server to talk to, and then retrieve the chunks from there.

BigTable is essentially a 3D array representation of relational tables, where the dimensions are rowID, columnID, and timestamp. BigTable processes tablets, which are stored as chunks in GFS.

(“Google Big Table Architecture”.

6. Scale up

Scalability is the capability of a system to handle a growing amount of work.

One typically adds resources (such as the number of CPU cores or the number of machines) to handle more work. As more resources are added, an ideal system scales linearly. If a system continues to use the initial resource while ignoring the newly added, it does not scale at all. In practice, the behavior of a system is somewhere in between: as you throw more and more resources in, the system scales initially but stops scaling at some point because some bottleneck is hit. Adding more resources could even lead to worse performance, e.g. due to too much communication overhead.

—Scale up means to use more CPU cores in the same machine. It is also called vertical scaling or parallel computing. On the other hand, scale out means to use more machines. —Note that scale-up technologies are not limited to a single-machine configuration, because in a multiple-machine cluster, every machine can implement single-machine technologies in addition to cluster-specific technologies. This section discusses scale up, and the next section discusses scale out.

CPU clock speed flattened since 2003.

(Bob Warfield. “A Picture of the Multicore Crisis”.

On the other hand, a different form of Moore's Law continued to apply: the number of CPU cores continued to double every 18 to 24 months. You could buy a 200-core computer today for instance.

What do you do with 200 cores? —Obviously you want to maximize throughput. So keeping one core busy while having 199 cores idle is out of the question. Using too many threads may also be problematic due to high context switch cost and the fact that if one thread sits in the READY queue, all threads that are waiting for it cannot make progress either.

There are at least four options. Option A uses single-threaded systems, while Options B1, B2, and B3 use multi-threaded systems. Personally I believe option B2 is the best.

—Option A: divide the machine and use a single-threaded system

—This option is used by VoltDB and is called NewSQL by Mike —Stonebraker.

The advantage is that implementation is simple and the system may be efficient because there is no lock or latch.

The disadvantages are: —having high inter-process communication (especially if there are too many cores), and —being vulnerable to uneven load.

—Option B1: use ad hoc threads

This is the most familiar method to programmers who write parallel programs. Your program creates and manages multiple threads so as to make multiple cores busy. Synchronization primitives may be used if needed.

For example, you could write code like the following to sort an array.

t1 = new thread(QuickSort, first_half);
t2 = new thread(QuickSort, second_half);
t3 = new thread(Merge, first_half, second_half)

In practice, a parallel sort algorithm creates more than two sort jobs and more than one merge job.

While simpler than B2 and B3, this option may have the following drawbacks:

  • Thread creation and management overhead. Your program creates threads dynamically and needs to handle management overhead such as dealing with deadlocks.
  • —May either under-utilize the cores (if you create too few threads), or over-utilize (e.g. when there are too many concurrent instances of the sort program each independently creating a lot of threads).

Option —B2: thread pool + async programing

Microsoft SQL Server's storage manager is completely async. In general, web servers, database servers, operating systems all tend to be asynchronous.

—This option pre-creates a thread pool (when the system starts) with a chosen number of long running threads (e.g. equal to the number of cores), each of which iteratively picking the next job in a shared task queue to execute.

Your program creates tasks (instead of threads), and pushes them into the thread pool's task queue. Your tasks will be picked up by some thread asynchronously. The AddTask() service provided by the thread pool typically allows you to provide a callback function, which will be called after the task is completed.

To do merge-sort using a thread pool TP, your code may look like:

done1 = false; done2 = false;
TP.AddTask(QuickSort, first_half,
        if (done2) TP.AddTask(Merge, ......)
TP.AddTask(QuickSort, second_half,
        if (done1) TP.AddTask(Merge, ......)

To implement a good thread pool is challenging. There are many issues to consider. Say your system starts with the same number of threads as the number of CPU cores. What if some —threads are blocked on resources? Without care, some cores may be idle. What about deadlocks when all threads are blocked on each other? Probably you want to support multiple-priority tasks? What happens when a user-submitted task throws an exception? (Some features of a thread pool are discussed in

The good news is that —a non-trivial server-side system typically has implemented a thread pool already. Even if you need to implement one yourself, it will be an "—implement once, use many times" thing.

—The advantages of this option are that it tends to have a "just-right" number of running threads, to keep all cores busy without overloading them; and it hides the thread creation and management overhead.

The disadvantage is that the async programming model can be quite non-intuitive to many programmers.

—Option —B3: lock-free data structures

—Yet another option of using many threads is to use lock-free data structures. Microsoft SQL Server's in-memory OLTP (since the 2014 release) was largely built using this option.

There are three principles in implementing lock-free data structures.

Principle 1: —Use atomic CAS.

Say you want to insert n after left in a linked list.


—Normally you would use a lock:

—n->next = left->next
—left->next = n

The lock-free solution may be:

while (true) {
    —localNext = left->next
    —n->next = localNext
    if (—CAS(left->next, localNext, n)) break;

Here CAS is atomic compare-and-swap which is typically translated to a hardware instruction. In the above example, left->next is swapped to n only if it compares successfully with localNext. This guards against concurrent updates. For instance, if another thread has already inserted a node in between, the CAS will fail.

Principle 2: —Borrow bits to store states between atomic steps.

As an example, consider the case to delete a node.

To delete node n, you want to swing pointer A to point to B. To guard against concurrent updates, the swap should happen only if A is still pointing to n AND B is still pointing to the next node. This is a double-CAS. Unfortunately machines nowadays do not provide a double-CAS hardware instruction.

You could implement delete using CAS by breaking the operation in two atomic steps: mark pointer B for deletion (change an unused bit from 0 to 1); then swing pointer A.

—Principle 3: Use cooperative threads.

—In the above delete case, if the delete thread successfully marked B but then hung, all other threads arriving at B will be stalled. To guarantee overall system progress, when another thread encounters a pointer that is marked for deletion, it should help completing the delete operation and move on.

When I was working on Microsoft SQL Server's in-memory OLTP (code named Hekaton), I did some performance evaluations. From what I could remember, when the number of threads is small, lock-free data structures tend to have comparable performance as their lock-based counterparts. The advantage of this option is that when the number of concurrent threads gets very large, lock-free data structures are much faster than the lock-based counterpart. I observed something like the following graph.

—The main disadvantage of lock-free data structures using CAS is that they tend to be extremely complicated to implement.

I believe if hardware vendors provide a double-CAS hardware instruction someday, there will be hope for a much wider adoption of lock-free structures in real systems. It will tremendously reduce the complexity of implementation compared with coding using CAS.

Note that a more general-purpose —STM (software transactional memory) is available today and it allows us to treat some arbitrary code as an atomic unit. You could for instance use STM to simulate double-CAS, but it will not be a single hardware instruction. So to me STM is somewhat neither here nor there. If you want coding simplicity, STM is not as good as lock-based solutions. If you want performance, it is not as good as atomic CAS or double-CAS. I believe double-CAS can be a good tradeoff for people who desire lock-free data structures.

7. Scale out

"—Scale out" is to use multiple machines, and is also called —horizontal scaling or —distributed computing.

Challenges in distributed systems

Designing a distributed system is challenging. Below are some unique challenges (in addition to those shared with scale-up systems – such as thread pool and async programming).

  • Networking. As different machines communicate with each other through message passing, a crucial component of a distribute system is the network manager.
  • Algorithm design. In a distributed system, algorithms could be drastically different from those in a single-machine version (including multi-threaded algorithms). For instance, a distributed algorithm may utilize networking primitives such as broadcast() and may need to synchronize with other machines.
  • Load balancing. Each machine only has a small fraction of the data. The time to service a request may be determined by the slowest machine in the cluster. So balancing load and computing can be crucial. One idea some systems use is to execute each task redundantly using two machines and whichever finishes first wins.
  • Consensus. Making a joint decision can be surprisingly hard in a distributed system. For instance, while one machine may decide to commit a user's transaction, another machine may encounter some error and decides to rollback the same transaction.
  • Fault tolerance. When your cluster is sufficiently large, failures are a norm, and a distributed system must be prepared to deal with them. One conventional wisdom is to avoid "single point of failure": one machine going down should not take down the whole cluster nor lead to loss of data. Errors may take place at many different levels, not just losing a hard drive or a machine. For instance, "split brain" refers to the phenomenon that all machines are up but a group of machines could not reach another group (probably due to a temporarily network link failure or mere slowness), and each group may think the other group has failed.
  • Heterogeneity. A distributed system should support diversity in software and hardware. Even if a cluster started with the same software and hardware on all machines, you may need to perform rolling upgrades later that replaces or adds some machines with either newer software, or more up-to-date hardware, or both.
  • Elasticity. Support adding more machines and replacing existing machines without bringing the cluster down.
  • Cluster monitoring and management. A large cluster is like a monster. Sometimes it behaves, and sometimes it does not. Say your user submitted a job that normally takes one hour to run, and your cluster decides to keep running the job for more than 20 hours. You want to be able to get in there and find out what's going on.

While essentially all scale-out systems inherited the shared-nothing architecture of Gamma, the scale-out world appears to be divided into two.

  • tightly-coupled cluster is one that supports communications between any pair of machines. Examples include —scientific computing, HPC, MPI, and SciDB.
  • loosely-coupled cluster is one that forbids such point to point communications. Examples include MapReduce, Hadoop and TensorFlow.

A random peek at tight-cluster techniques: broadcast()

Needless to say, there are many tight-cluster techniques. Here I'll pick a tiny example to give some idea.

Let your cluster be composed of n machines 0 through n-1. Suppose machine 0 has a message to be sent to all other machines.

naive broadcast algorithm may be as follows. Note that all machines execute the same algorithm.

if (I am machine 0) {
    for (i=1; i<n; ++i) {
} else {

The problem with this naive approach is that the sender is a performance bottleneck. The time complexity is O(n).

The binomial broadcast algorithm is much more efficient.

A binomial tree with order k, denoted as Bk, comprises of a root node and sub-trees B0, ..., Bk-1.

For instance, the following graph shows a binomial tree with order 3. 

The number of nodes in Bk is 2k. The children of node i are i+20, i+21, ..., but not to exceed n-1. The width and height of the tree are both O(k) i.e. logarithmic to the number of nodes.

The binomial broadcast algorithm is as follows.

if (I am not machine 0) receive(parent);
for (each child in decreasing order) {

The time complexity is O(log n).

A random peek at loose-cluster techniques: using MapReduce to do matrix multiplication

Matrix multiplication is widely used in many areas. For instance, it may be used to calculate page ranks for a search engine.

If the matrices are so big that chunks in each matrix are distributed to many machines, multiplication can be tricky. Imagine in the above example every cell (as a one-cell chunk) is stored in a separate machine. Each chunk needs to be multiplied against multiple others. E.g. a needs to be multiplied again both i and j. Tight-cluster algorithms would replicate the chunks of one matrix, conceptually to all machines, but in practice to \(\mathrm{\sqrt{n}}\)machines by leveraging on "block-cyclic distribution".

Interestingly, the seemingly non-embarrassingly-parallel problem can be solved using MapReduce.

The map() part maps each pair of (column i of matrix one, row i of matrix two) to some machine, and multiply them. For instance:

Similarly, additional map() results are:

The reduce() part is to add all the map() results together.

8. My Opinions

Scale out: tight cluster or loose cluster?

The observation is that today loose clusters dominate the market. Mike —Stonebraker said MapReduce is only good for embarrassingly-parallel problems. He further cited that for Jeremy Kepner (MIT Lincoln Labs), only 5% of the problems are embarrassingly parallel. I'd say if we look at the whole market instead of one lab (although a big and important lab), more than 80% of the big-data problems are embarrassingly parallel.

—My opinions are:

  • —Tight and loose clusters will co-exist, although loose clusters will continue to dominate.
  • Very large clusters (say more than 1000 machines) tend to be loose, and small to medium clusters (say less than 100 machines) tend to be tight.
  • —The two will embrace each other. For instance, a tight-cluster system SciDB recently added some MapReduce-like feature (called streaming by Paradigm4), and a loose-cluster system Spark uses message passing libraries (first Akka and now Netty).

Analytics: R or Python?

—Both are widely used in analytics.

R is —more mature, has better graphing support, and has been integrated into SQL Server and Oracle.

—Python is more general purpose. In the past Python lacked commercial support, and the fact that Python 3 when released broke backward compatibility hurt the feelings of many. But the situation is better and Python is catching up quickly.

—My opinion is:

  • R and Python will co-exist, although Python will surpass and dominate.

Platform development: C/C++ or Java/Scala?

My opinion is that C++ is obviously the way to go. Ask an engineer in the platform team of Google, Facebook, or Microsoft, and you probably will hear the same opinion.

I recognize that Hadoop was implemented in Java but I believe it was a mistake that will be corrected eventually. Otherwise why did Cloudera implement Impala in C++?

Building a company: open source or closed source?

—Until 20 years ago, all serious software was closed source. But in the past 20 years we witnessed a huge open-source software movement. —Now everyone is using everyone else’s code. For instance: —Amazon Athena uses Facebook Presto; —Facebook uses Oracle MySQL and provides RocksDB which itself derived from Google LevelDB; and several companies commercialized Hadoop. The fact that the big guys give away their good code for free provides a lot of opportunities, but at the same time could make life more difficult for companies with a shallow pocket.

—My opinion is that in today's market one must be familiar with the open source model and find ways to join it: embrace, leverage, and contribute. It is no longer time to be one against the world.

9. Concluding Remarks

I'll conclude the blogpost with the biggest lessons I learned from my five years experience working in a big-data startup.

There are at least three APIs for a distributed system: client API (how others can use your big-data system as a black box), inter-component API (how different components in your system service each other), and extension API (how people including internal customers can extend your system, e.g. via loading shared libraries at run time). All APIs must be carefully designed, published, and maintained. Be an expert on the Pimpl idiom.


  • Strive for minimal and clean APIs.

Any distributed system is by nature very complicated. Do yourselves a favor by not adding any additional, unnecessary complexities. In design meetings when discussing two alternative options, pay attention to the complexity of them.


  • When in doubt, choose the simpler option.

Especially at the early phase of a startup, it may be necessary to write prototype code. That's fine and encouraged. The question is: what to do with the prototype code upon productization? It may seem wise and fast to shovel prototype code in and fix problems later on. Don't do that, period.


  • Throw away prototype code.

Wherever you are in the big data stack, there are open-source projects that you can leverage. The difficulty lies in the fact that it can be hard to find them. In some cases, what you should bring in are some components from a project, not the whole thing. Such deep integration is quite possible nowadays (thanks to and can be a competitive advantage. If in your company no one is paying attention to the open-source world, something could be seriously wrong. On the other hand, try to componentize your own product (again with clean APIs) not only for easier internal development and maintenance, but also for making it easier for  others to leverage on you.


  • Leverage, and be leveraged.

Thank you for reading. I hope at least some parts of my blogpost are valuable to you.

  • No labels
Write a comment…