A few years after the emergence of the Lambda-Architecture several new architectures for Big Data have emerged. I will present and illustrate their use case scenarios. These architectures describe IT architectures, but I will describe towards the end of this blog the corresponding Enterprise Architecture artefacts, which are sometimes referred to as Zeta architecture.
I have blogged before about the Lambda-Architecture. Basically this architecture consists of three layers:
Batch-Layer: This layer executes long-living batch-processes to do analyses on larger amounts of historical data. The scope is data from several hours to weeks up to years. Here, usually Hadoop MapReduce, Hive, Pig, Spark or Flink are used together with orchestration tools, such as Oozie or Falcon.
Speed-Layer/Stream Processing Layer: This layer executes (small/”mini”) batch-processes on data according to a time window (e.g. 1 minute) to do analyses on larger amounts of current data. The scope is data from several seconds up to several hours. Here one may use, for example, Flink, Spark or Storm.
Serving Layer: This layer combines the results from the batch and stream processing layer to enable fast interactive analyses by users. This layer leverages usually relational databases, but also NoSQL databases, such as Graph databases (e.g. TitanDB or Neo4J), Document databases (e.g. MongoDB, CouchDB), Column-Databases (e.g. Hbase), Key-Value Stores (e.g. Redis) or Search technologies (e.g. Solr). NoSQL databases provide for certain use cases more adequate and better performing data structures, such as graphs/trees, hash maps or inverse indexes.
In addition, I proposed the long-term storage layer to have an even cheaper storage for data that is hardly accessed, but may be accessed eventually. All layers are supported by a distributed file system, such as HDFS, to store and retrieve data. A core concept is that computation is brought to data (cf. here). On the analysis side, usually standard machine learning algorithms, but also on-line machine learning algorithms, are used.
As you can see, the Lambda-Architecture can be realized using many different software components and combinations thereof.
While the Lambda architecture is a viable approach to tackle Big Data challenges different other architectures have emerged especially to focus only on certain aspects, such as data stream processing, or on integrating it with cloud concepts.
The Kappa Architecture focus solely on data stream processing or “real-time” processing of “live” discrete events. Examples are events emitted by devices from the Internet of Things (IoT), social networks, log files or transaction processing systems. The original motivation was that the Lambda Architecture is too complex if you only need to do event processing.
The following assumptions exists for this architecture:
You have a distributed ordered event log persisted to a distributed file system, where stream processing platforms can pick up the events
Stream processing platforms can (re-)request events from the event log at any position. This is needed in case of failures or upgrades to the stream processing platform.
The event log is potentially large (several Terabytes of data / hour)
Mostly online machine learning algorithms are applied due to the constant delivery of new data, which is more relevant than the old already processed data
Technically, the Kappa architecture can be realized using Apache Kafka for managing the data-streams, i.e. providing the distributed ordered event log. Apache Samza enables Kafka to store the event log on HDFS for fault-tolerance and scalability. Examples for stream processing platforms are Apache Flink, Apache Spark Streaming or Apache Storm. The serving layer can in principle use the same technologies as I described for the serving layer in the Lambda Architecture.
There are some pitfalls with the Kappa architecture that you need to be aware of:
End to end ordering of events: While technologies, such as Kafka can provide the events in an ordered fashion it relies on the source system that these events are indeed delivered in an ordered fashion. For instance, I had the case that a system in normal operations was sending the events in order, but in case of errors of communication this was not the case, because it stored the events it could not send and retransmitted them at a certain point later. Meanwhile if the communication was established again it send the new events. The source system had to be adapted to handle these situations correctly. Alternatively, you can only ensure a partial ordering using vector clocks or similar implemented at the event log or stream processing level.
Delivery paradigms on how the events are delivered (or fetched from) to the stream processing platform
At least once: The same event is guaranteed to be delivered once, but the same events might be delivered twice or more due to processing errors or communication/operation errors within Kafka. For instance, the stream processing platform might crash before it can marked events as processed although it has processed them before. This might have undesired side effects, e.g. the same event that “user A liked website W” is counted several times.
At most once: The event will be delivered at most once (this is the default Kafka behavior). However, it might also get lost and not be delivered. This could have undesired side effects, e.g. the event “user A liked website W” is not taken into account.
- Once and only once: The event is guaranteed to be delivered once and only once. This means it will not get lost or delivered twice or more times. However, this is not simply a combination of the above scenarios. Technically you need to make sure in a multi-threaded distributed environment that an event is processed exactly once. This means the same event needs to be (1) only be processed by one sequential process in the stream processing platforms (2) all other processes related to the events need to be made aware that one of them already processes the event. Both features can be implemented using distributed system techniques, such as semaphores or monitors. They can be realized using distributed cache systems, such as Ignite, Redis or to a limited extent ZooKeeper. Another simple possibility would be a relational database, but this would quickly not scale with large volumes.
Needles to say: The source system must also make sure that it delivers the same event once and only once to the ordered event log.
Online machine learning algorithms constantly change the underlying model to adapt it to new data. This model is used by other applications to make predictions (e.g. predicting when a machine has to go into maintenance). This also means that in case of failure we may temporary have an outdated or simply a wrong model (e.g. in case of at least once or at most once delivery). Hence, the applications need to incorporate some business logic to handle this (e.g do not register a machine twice for maintenance or avoid permanently registering/unregistering a machine for maintenance)
Although technologies, such as Kafka can help you with this, it requires a lot of thinking and experienced developers as well as architects to implement such a solution. The batch-processing layer of the Lambda architecture can somehow mitigate the aforementioned pitfalls, but it can be also affected by them.
Last but not least, although the Kappa Architecture seems to be intriguing due to its simplification in comparison to the Lambda architecture, not everything can be conceptualized as events. For example, company balance sheets, end of the month reports, quarterly publications etc. should not be forced to be represented as events.
Microservice Architecture for Big Data
The Microservice Architecture did not originate in the Big Data movement, but is slowly picked up by it. It is not a precisely defined style, but several common aspects exist. Basically it is driven by the following technology advancements:
- The implementation of applications as services instead of big monoliths
- The emergence of software containers to deploy each of those services in isolation of each other. Isolation means that they are put in virtual environments sharing the same operating systems (i.e. they are NOT in different virtual machines), they are connected to each other via virtualized networks and virtualized storage. These containers leverage much better the available resources than virtual machines.
- Additionally the definition of repositories for software containers, such as the Docker registry, to quickly version, deploy, upgrade dependent containers and test upgraded containers.
- The deployment of container operating systems, such as CoreOS, Kubernetes or Apache Mesos, to efficiently manage software containers, manage their resources, schedule them to physical hosts and dynamically scale applications according to needs.
- The development of object stores, such as OpenStack Swift, Amazon S3 or Google Cloud Storage. These object stores are needed to store data beyond the lifecycle of a software container in a highly dynamic cloud or scaling on-premise environment.
- The DevOps paradigm – especially the implementation of continuous integration and delivery processes with automated testing and static code analysis to improve software quality. This also includes quick deliveries of individual services at any time independently of each other into production.
An example of the Microservice architecture is the Amazon Lambda platform (not to be confused with Lambda architecture) and related services provided by Amazon AWS.
Nevertheless, the Microservice Architecture poses some challenges for Big Data architectures:
- What should be a service: For instance, you have Apache Spark or Apache Flink that form a cluster to run your application. Should you have for each application on them a dedicated cluster out of software container or should you provide a shared cluster of software containers. It can make sense to have the first solution, e.g. a dedicated cluster per application due to different scaling and performance needs of the application.
- The usage of object stores. Object stores are needed as a large scale dynamically scalable storage that is shared among containers. However, currently there are some issues, such as performance and consistency models (“eventually consistent”). Here, the paradigm of “Bring Computation to Data” (cf. here) is violated. Nevertheless, this can be mitigated either by using HDFS as a temporal file system on the containers and fetching the data beforehand from the object store or use an in-memory caching solution, such as provided by Apache Ignite or to some extend Apache Spark or Apache Flink.
I see that in these environments the role of software defined networking (SDN) will become crucial not only in cloud data centers, but also on-premise data centers. SDN (which should NOT be confused with virtualized networks) enables centrally controlled intelligent routing of network flows as it is needed in dynamically scaling platforms as required by the Microservice architecture. The old decentralized definition of the network, e.g. in form of decentralized routing, does simply not scale here to enable optimal performance.
I presented here several architectures for Big Data that emerged recently. Although they are based on technologies that are already several years old, I observe that many organizations are overwhelmed with these new technologies and have some issues to adapt and fully leverage them. This has several reasons.
One tool to manage this could be a proper Enterprise Architecture Management. While there are many benefits of Enterprise Architecture Management, I want to highlight the benefit of managed of managed evolution. This paradigm enables to align business and IT, although there is a constant independent (and dependent) change of both with not necessarily aligned goals as illustrated in the following picture.
As you can see from the picture both are constantly diverging and Enterprise Architecture Management is needed to unite them again.
However, reaching managed evolution of Enteprise Architecture requires usually many years and business as well as IT commitment to it. Enterprise Architecture for Big Data is a relatively new concept, which is still subject to change. Nevertheless some common concepts can be identifed. Some people refer to Enterprise Architecture for Big Data also as Zeta Architecture and it does not only encompass Big Data processing, but in context of Microservice architecture also web servers providing the user interface for analytics (e.g. Apache Zeppelin) and further management workflows, such as backup or configuration, deployed in form of containers.
This enterprise architecture for Big Data describes some integrated patterns for Big Data and Microservices so that you can consistently document and implement your Lambda, Kappa, Microservice architecture or a mixture of them. Examples for artefacts of such an enterprise architecture are (cf. also here):
Global Resource Management to manage the physical and virtualized resources as well as scaling them (e.g. Apache Mesos and Software Defined Networking)
Container Management to deploy and isolate containers (e.g. Apache Mesos)
Execution engines to manage different processing engines, such as Hadoop MapReduce, Apache Spark , Apache Flink or Oozie
Storage Management to provide Object Storage (e.g. Openstack Swift), Cache Storage (e.g. Ignite HDFS Cache), Distributed Filesystem (e.g. HDFS) and Distributed Ordered Event Log (e.g. Kafka)
Solution architecture for one or more services that address one or more business problems. It should be separated from the enterprise architecture, which is focusing more on the strategic global picture. It can articulate a Lambda, Kappa, Microservice architecture or mixture of them.
Enterprise applications describe a set of services (including user interfaces)/containers to solve a business problem including appropriate patterns for Polyglot Persistence to provide the right data structure, such as graph, columnar or hash map, for enabling interactive and fast analytics for the users based on SQL and NoSQL databases (see above)
Continuous Delivery that describe how Enterprise applications are delivered to production ensuring the quality of them (e.g. Jenkins, Sonarqube, Gradle, Puppet etc).