I will present in this blog post the Lambda architecture for Big Data. This architecture is about integrating historical Big Data with “live” streaming Big Data. Afterwards, the concept of a large data lake in your enterprise or amongst enterprises in a B2B scenario is explained. This data lake – based on the lambda architecture – can replace a service oriented architecture (SOA), because it is easier to implement and manage for large data volumes in a variety of formats. Hence, a plethora of use cases arises. Finally, I will discuss how this architecture can be implemented using various open source software technologies based on the Hadoop Ecosystem.
The Lambda Architecture
Big Data has become an increasing popular topic over the last years. Big Data is about processing large volumes of data in a variety of formats taken into account live streaming or historical data. One large computing cluster is used to store and process all of one or more companies‘ data.
Internet companies, such as Google, Yahoo or Facebook, are driven by new business models for which existing technology was not suitable. This led to the development of new technologies known under the common umbrella of NoSQL. Furthermore, there has been the need to integrate them in a flexible standardized architecture to enable Big Data. The lambda architecture is such an architecture and has been coined recently by Nathan Marz and James Warren.
It has the following key features:
- Standardized fault-tolerant distributed file system that spawns across the whole cluster – this file system is the base of the data lake that I will explain later.
- A batch processing layer for processing large amounts of historical data stored in the computing cluster
- A serving layer for providing fast access to results of batch processed data
- A real-time processing layer (or “speed layer”) for “live” processing of data streams, such as sensor data or stock market data
- A long term storage layer optimized for extremely cheap storage of data that is rarely used (e.g. for legal reasons). Usually you do not find this in other articles describe lambda architecture, but I think it is an important feature to highlight. Here you have very old data (more than multiple years) that you do not need in your day to day business – you can store them on very cheap hardware with a lot of disk space but much less computing power and memory capacity.
These features are not new and have been addressed partly also by other architectures known in other domains, such as Business Intelligence, Complex Event Processing, Data Warehouse or Master Data Management. However, the lambda architecture addresses them in context of huge data volumes, diversity of data formats (polyglot persistence) and integrates them all in one architecture.
The term “lambda” stems from the following function used for doing analytics in context of Big data:
query = λ(all data) = λ (live streaming data) * λ (historical data)
Basically it say that all analytics functions λ combining live streaming data and historical data can be computed on systems implementing the lambda architecture. I will later discuss the implication of this for the implementation of the architecture.
The lambda architecture is illustrated in the following figure
The lambda architecture provides the data scientist means and tools to analyze any data occuring in the company, whereby tools can be easily plugged into the architecture without requiring later major implementation efforts.
One of the most interesting aspect of the lambda architecture is that you have a cluster of nearly unlimited storage and memory capacity. You can have even an in-memory database with a memory capacity on the terabyte to petabyte scale distributed over the whole cluster. Popular open source frameworks, such as Hadoop, allow you to use commodity hardware, so that deploying such an architecture can be relatively cheap and they have already built-in fault-tolerance, so that developers do not need to mess around with it.
With such a large cluster you can create a big data lake in your company (see next figure). Basically all your data ends up in this cluster and all applications including the one in the cloud can share it via simple file system access mechanisms and you can use the computing power of the whole cluster to do analysis. Needless to say that you save a lot of money, because you save a lot of redundant ETL processes, which all have to be made fault-proof and interact with different systems. Modern Big Data architectures take care of this for you.
Finally, exchanging data becomes much easier than in a Service-oriented Architecture (SOA), where you need to design interfaces and implement services – here every application simply access the distributed file system in the cluster.
Implementing a Lambda Architecture
There are several things to consider when you implement the lambda architecture. Firstly, you can choose from a variety of components to implement it. For instance, on the open source side Apache Hadoop / Apache Spark is very popular which is used by many companies including all popular Internet companies, such as Facebook or Google. You can also use other open source components, such as Apache Cassandra for batch processing and Twitter Storm for Stream processing. Additionally, you can also use commercial tools, such as SAP HANA Cloud platform. Finally, you can put your lambda architecture completely on-premise, completely in the cloud (see my example with Amazon Elastic Map Reduce, which partly implements a Lambda Architecture) or have some kind of hybrid model. In the following I will describe an implementation using Apache Hadoop and additional tools that can integrate with Apache Hadoop.
You can use the following components for implementing the lambda architecture.
- Standardized fault-tolerant distributed file system: Hadoop Distributed Filesystem (HDFS). You can use also other distributed file systems. The choice of the file system is transparent to the application, i.e. they won’t need to use different APIs for different file systems. Most of the time you will be fine with HDFS, but, for example, cloud providers, such as Amazon, may implement their own that fits to their infrastructure.
- Batch Processing layer: Here you can use Hadoop Yarn, which is responsible for distributing Big Data Analytics jobs, such as map reduce jobs. Yarn allows you even to “containerize” your jobs, i.e. define CPU, memory and network limitations across the big data cluster for a specific job. This allows you to do proper capacity management – one of the most important aspects of a lambda architecture. If you need in-memory batch processing then you should check out Apache Spark. If you want to have a more generic job control, i.e. because you have other distributed applications around your cluster , not based on the MapReduce paradigm, you can use Apache Mesos.
- Serving layer: The serving layer provides fast access and advanced query mechanisms for results of batch jobs. Here you can use typical Big Data databases and data warehouses, such as Apache Hbase or Apache Shark (for in-memory access). You will probably have multiple different technologies here according to the polyglot persistence NoSQL paradigm. They offer typical interfaces, such as JDBC or ODBC, to integrate with any application.
- Real-time processing layer: Although Hadoop can process streaming data, most of the time you will choose a software component supporting complex event processing of live streaming data across your cluster, such as Apache Spark Streaming or Twitter Storm.
- The long term persistence layer is mostly a hardware choice: Here you need a lot of cheap hard disk space, e.g. by not using SSD flash drives, and little computing / memory power. It is usually a separate cluster connected to the other cluster and it leverages the fault tolerance features of HDFS, such as automated replication of data to several nodes and re-replication in case of node failures.
Furthermore, you can have a lot of other software components that automatically build on the aforementioned core technologies, such as Apache Hive or Apache Shark, a Data Warehouse for Hadoop, or Apache Oozie, which is a workflow tool for complex ETL processes distributed over your data lake.
As mentioned before, there is a wide variety of alternatives that you can use to implement the lambda architecture. The standardized fault-tolerant distributed file system is most of the time the base for everything and you can also gradually evolve your architecture and implement it using different components.
I briefly described before that capacity management is an important part of the lambda architecture. You need to define how big data jobs are programmed and tested as well how they get into the cluster. I expect that in the future not only programmers, but also business people, such as data scientist will need to load big data jobs in your cluster. This means you will need to (1) properly define your delivery pipeline (2) implement and enforce proper capacity management and (3) have a bullet-proof dependency management for different software versions in your cluster.
Luckily by using Apache Yarn or Apache Mesos together with a cluster monitoring software, such as Ganglia, you can do proper capacity management.
Recently, more tools, such as Docker, using advanced virtualization features of the Linux kernel (cgroups) have emerged making capacity management even more easier and flexible. These technologies also have built-in dependency management to avoid a library/versioning hell. Google developed an open source scheduling system, called Kubernetes for them.
Combining Stream-Processing and Batch Data
One core goal of the lambda architecture is to integrate live streaming and batch processing. In fact, most of the recent articles on lambda architecture are just about providing both as software components. However, you will also need to integrate this on the query level, because complex event processing queries are a little bit different from batch processing queries.
Spark Streaming demonstrates how you can join historical data with stream processed data at the same time.
Hardware considerations for a lambda architecture have – if at all – only been briefly discussed in most of the publications. Hardware planning is important for your cluster – we have seen this already with the long term storage. Furthermore, if you have in your big data cluster a few very old machines than this will affect all jobs running on your cluster. You will need to have proper monitoring tools and rules deployed to identify automatically these kind of bottlenecks.
Once you have implemented the lambda architecture you will need to teach everybody to use it. You will need to plan migration of datas torage for analytics from the individual systems to your data lake, i.e. your big data cluster. Keep in mind that the lambda architecture is about analytics. Although it is possible to include transactional systems into this (e.g. a MySQL Cluster), you will probably still use for your individual ERP systems, CRM systems etc. standard transactional databases of which you extract the data in put them into the cluster for analytics.
However, there are also other tools for doing distributed transactions, such as CloudTPS or even more advanced the Bitcoin transaction system. They may replace individual transactional databases in the future.
More and more companies are embarking on the journey of a standardized Big Data architecture each year. Most of them use open source technologies to gradually migrate towards one big data lake as it has been described here.