Big Data is the topic of the coming years. Even today large Internet companies store exabytes of data and their revenue model is based on selling products as well as services around this data. Consequently, they need to process data using advanced statistical methods, such as machine learning. Hence, they need to think about how to do this efficiently. Currently, especially in-memory is hyped to address this issue. However, this is only one aspect. A fundamentally more important aspect is where the data is processed in a distributed multi-node data environment.
A brief history on software architectures
In the beginning of software development, many applications have been single monolithic applications. They have been deployed on a single computer. This lead to several problems, such as that developers could hardly reuse code of monolithic applications and the approach did not scale very well since it was limited to a single computer. The first problem has been addressed by introducing different layers into the architecture. The resulting architectures are usually based on three layers (see next figure): data layer, service layer and presentation layer. The data layer handles any functionality for managing data, such as querying or storing it. The service layer implements business logic, e.g. it implements business process. The presentation layer allows the user to interact with the implemented business processes, e.g. entering of new customer data. The layers communicate with each other using well-defined interfaces implemented today in REST, OData, SOAP, Websockets or HTTP/2.0.
With the emergence of the Internet, these layers had to be put physically on different machines to provide larger scalability. However, they have never been designed with this in mind. The network layer has only limited transport bandwidth and capacity. Indeed, for very large data it can be faster to store it on a large drive and transport it by truck to its destination than doing it by the network.
Additionally, during development scalability of data computation is of less interest, because in the Internet world it is often not known how many people will have access to an application and this may change over time. Hence, you need to be able to scale dynamically up an down. I observe that more and more of the development efforts in this area have moved to operations, who need to implement monitors, load-balancer and other technology to scale applications. This is also the reason why DevOps is a popular and emerging paradigm for developing and operating Internet-scale web applications, such as Netflix.
Towards New Software Architectures: Bring Computation to Data
The multiple layer approach does make sense and you could it even split it into more layers (“services”), but you have to evaluate carefully complexity and reusability of your service design. More important, you will have to think about new interfaces, because if components are located on different machines or different memory instances, your application will spend a lot of time for moving data between them. For instance, the application logic on the application server may request all customer transactions from the database and then correlate them to write the results back into the database. This requires a lot of data to be transferred from the database to the application server and potentially costs a lot of performance. Finally, it does not scale at all.
This problem first emerged when companies introduced the first Online Analytical Processing (OLAP) engines as part of business intelligence solutions for understanding their business. Database queries proved as too simple and would require to transfer first a lot of data to the application server. Hence, the Structured Query Language (SQL) for databases was extended to cope with these new requirements (e.g. the CUBE operator). Moreover, you can define your own custom functions (e.g. SQL Stored procedures), but they have to be implemented very vendor specific. For instance, distributed databases based on Apache Hadoop support custom functions. However, you can integrate sometimes other programming languages, such as Java. While stored procedures are already an improvement in terms of security (protection against SQL injection attacks), they have the problem that it is very difficult to write sophisticated programs to handle modern Big Data applications. For instance, many applications require machine learning, statistical correlation or other statistical methods. It is difficult to write them as stored procedures and to maintain support for different vendors. Furthermore, it leads again to monolithic applications. Finally, they are not dynamic – the application cannot decide to do any new computation on the fly without reimplementing it in the database layer (e.g. implement a new machine learning algorithm). Hence, I suggest another way to address this issue.
A Standard for Bringing Computation to Data?
As mentioned, we want to support modern Big Data applications by providing suitable language support for machine learning and statistical methods on top of any database system (e.g. MySQL, Hadoop, Hbase or IBM DB2). The next figure illustrates the new approach. The communication between the presentation and service layer works as usual. However, the services do not call functions on the data layer, but send any data-intensive computation they want to perform as an R script to the data layer, which executes it and only sends back the result.
I observed that the programming language R for statistical computing has been recently integrated in various data environments, such as transactional databases, Apache Hadoop clusters or in-memory databases, such as SAP HANA. Hence, I think R could be a suitable language for describing computation that operates on data. Additionally, R has already a lot of built-in packages for machine learning or statistical data processing. Finally, depending on the openness of the underlying data environment, you can integrate R tightly into it, so you may not have to do extensive in-memory transfers.
The advantage of the approach are:
- business logic stays in the service level and does not move to the data layer
- You can easily add new services without modifying the data layer – so you avoid a tight coupling, which makes it easier to change the data layer or to introduce new functionality
- You can mine R scripts generated by services to determine which computation the user is likely to do next to start executing it before the user requests it.
- Caching and distribution of data processing can be based on a more sophisticated analysis of the R scripts using the R Profiler Rprof
- R is already known by many business analysts or social scientists/psychologists
However, you will need to have some functionality for governing the execution of the R scripts in the data layer. This includes decisions on when to schedule computation or creating new computing/data nodes (e.g. real-time vs batch). This will require a company-wide enterprise architecture approach where you need to define which data should be real-time and which data should be batch-processed. Furthermore, you need to take into account security and separation of concerns.
In this context, Apache Hadoop might be an interesting solution from the technology perspective.
What is next
The aforementioned approach is only the beginning. By using this solution, you can think about true inter-cloud deployments of your application. Finally, you can enable inter-organizational data-processing business processes.