Spending Time on Quality in Your Big Data Open Source Projects

Open source libraries are nowadays a critical part of the economy. They are used in commercial and non-commercial applications directly or indirectly affecting virtually any human being. Ensuring quality should be at the heart of each open source project. Verifying that an open source project ensures quality is mandatory for each stakeholder of this project, especially its users. I describe in this post how software quality can be measured and ensured when integrating them in your application. Hence, we focus only on evaluating the technical quality of the library, because it suitability for business purposes has to be evaluated within the regular quality assurance process of your business.

The motivation for this post is that other open source projects will employ also transparent software quality measures so that eventually all software becomes better. Ideally, this information can be pulled by application developers to have a quality overview on their software, but also its dependencies. The key message is that you need to deploy solutions to efficiently measure software quality progress of your applications including dependencies and make the information transparently available in a dashboard (see screenshot) for everyone in near realtime.

Although I write here about open source software, you should push your commercial software providers to at least provide the same level of quality assurance and in particular transparency. This also includes the external commercial and open source libraries they depend on.

Software quality is “The totality of functionality and features of a software product that bears on its ability to satisfy stated or implied needs” (cf. ISTQB Glossary). This definition is based on the ISO9216 standard (succeeded by ISO/IEC 25000), which describes various quality classes, such as functionality, reliability, usability, efficiency, maintainability and portability.

I chose two open source libraries in which I am deeply involved to demonstrate measures to increase their quality:

  • HadoopCryptoLedger: A library for processing cryptoledgers, such as the Bitcoin Blockchain, on Big Data platforms, such as Hadoop, Hive and Spark
  • HadoopOffice: A library for processing office documents, such as Excel, on Big Data platforms, such as Hadoop and Spark

Before you think about software quality, one important task is to think about what functionality and which business process that is supported by the application brings the maximum value to the customer. Depending on the customer needs, the same application can bring different value to different customers. When looking at the supported business process you also must consider that not all potential execution variants of it should be evaluated, but only the most valuable for the business.

At the same time, you should evaluate the risk that the application fails and how it impacts the business process. If the risk is low or you have equally mitigation measures (e.g. insurance) then you can put there less emphasis compared to a business process where the risk is high and no mitigation measures exist. Additionally, I also recommend to look at value of functionality. If a functionality provides a lot of value then you should test it much more. Sometimes this change of view from risk to value uncovers more important areas of quality of a software product.

Steps for measuring quality

One of the core requirements for increasing software quality is to measure quality for third party libraries. It implies that you have mapped and automated your software development and deployment process (this is also sometimes referred to as DevOps). This can be done in a few steps:

  1. Have a defined repeatable build process and execute them in an automated fashion via a continuous integration tool.
  2. Version your build tool
  3. Software repository not only for source code
  4. Integrate automated unit and integration tests into the build process.
  5. Download dependencies, such as third party libraries, from an artifact management server (repository) that manage also their versions
  6. Evaluate quality of test cases
  7. Code documentation
  8. Code Analytics: Integrate static code analysis into the build process.
  9. Security and license analysis of third party libraries
  10. Store build and tested libraries in a versioned manner into an artifact management server (repository)
  11. Define a Security policy (security.md)
  12. Define a contribution policy (contributing.md)
  13. Ensure proper amount of logging
  14. Transparency about the project activity

Of course, these are only technical means for measuring quality. Out of scope here, but equally important is to define and version requirements/use cases/business process models, system tests, acceptance tests, performance tests, reliability tests and other non-functional tests. Additionally you should take into account automated deployment and deployment testing, which is out of scope of this article.

Evaluate quality

Given the aforementioned points, we describe now how they have been realized for the hadoopoffice and hadoopcryptoledger. Of course, they can be realized also with different tools or methods, but they provide you some indicators on what to look for.

1. Have a defined repeatable automated continuous integration process

We use for the hadoopoffice and hadoopcryptoledger library the Continuous Integration Platform Travis CI, which is offered as a free service for open source projects in the cloud. The aforementioned projects include in the source code repository root folder a file named .travis.yml (example), which describes the build process that is executed by travis. Everytime new code is committed the library is build again including the examples and automated unit and integration tests take place.

Travis CI provides a container infrastructure that allows executing the build process in so called software containers, which can integrate flexible different types of software without the overhead of virtual machines. Container can be pulled from repositories and flexible composed together (e.g. different Java Development Kits, operating systems etc.). Additionally, one wastes the costs of virtual machines being idle and capacity not used. Furthermore, you can track previous execution of the build process (see example here) to measure its stability.

Gradle is used for nearly every step in the process to build the application, generate docuemntation, execute unit and integration tests and collect reports.

Shell scripts are used to upload the HTML reports and documentation to a branch in Github (gh-pages) so that it can be easily found in search engines by the users and viewed in the browser.

2. Version your build tool

As said, our main build tool is Gradle. Gradle is based on the popular Java Virtual Machine scripting language Groovy and thus it scales flexible to your build process as well as make build scripts more readable in comparison to XML based tools, which are inflexible and a nightmare to manage for larger projects. Furthermore, plugins can be easily implemented using Groovy.

Modern build tools are continuously evolving. Hence, one want to test new versions of the build tool or upgrade to new versions rather smoothly. Gradle allows this using Gradle Wrapper – you simply define the desired version and it downloads it (if not already done) to be executed for your current build process. You do not need any manual/error prone installation.

3. Software Version Repository not only for source code

One key important ingredient of the software development process is the software version repository. Most of the m now are based on GIT. It is extremely flexible and scale from single person projects to projects with thousands of developers, such as the Linux kernel. Furthermore, it allows you to define your own source code governance process. For example, the source code of open source libraries that I introduced in the beginning are hosted on a GIT server in the cloud: Github. Alternatives are Gitlab or Bitbucket.org, but also many more. These cloud solutions also allow you having private repositories fully backed up for a small fee.

As I said, these repositories are not only for your software source code, but also your build scripts and continuous integration scripts. For example, for the aforementioned software libraries it stores the Travis scripts and the build files as well as test data. Large test data files (e.g. full blockchains) should not be stored in the version management system, but in a dedicated cloud object storage, such as Amazon S3 or Azure Storage. Here you have also the possibility to store them on ultra-low cost storage which might take longer to access the files, but this can be alright for a daily CI process execution.

4. Integrate automated unit and integration tests into the build process.

Another important aspect of software delivery is the software quality management process. As a minimum you should have not only automated unit and integration tests, but also a daily reporting on the coverage.

The reporting for the aforementioned library is done via Sonarqube.com and Codacy. Both not only report results on static code analysis, but also on unit/integration test coverage. Two have been chosen, because Codacy supports Scala (a programing language used in part of the aforementioned libraries).

More interesting, you find on both platforms also result for popular open source libraries. Here you can check for the open source libraries that you use in your product how good they are and how they improve quality over time.

Of course there are many more cloud platforms for code coverage and quality analysis (e.g. SourceClear).

Another important method for testing beyond unit testing for developers is integration testing. The reason is that complex platforms for Big Data, such as Hadoop, Flink or Spark, have emerged over the past years. These tests are necessary, because unit tests cannot reflect properly the complexity of these platforms and system tests are not focusing on technical aspects, such as code coverage. Furthermore, the execution of integration tests is very close to the final execution on platforms, but requires less resources and virtually no installation effort (an example is Spring Boot for web applications). However, still unit tests are mandatory, because they are very suitable to verify quickly that calculations with many different inputs are done correctly automatically.

Nevertheless, these integration tests require additional libraries. For example, for the aforementioned library, which are supporting all major Big Data platforms, the Hadoop integration tests suites are used as a base. They provide small miniclusters that can be even executed on our CI infrastructure based on containers with very little configuration and no installation efforts. Of course we use also the testing features of the platforms on top of this, such as the one for Spark or Flink.

5. Download dependencies, such as third party libraries, from an artifact management server (repository) that manage also their versions

It should be a no-brainer, but for all of your third party libraries that you include in your application you should manage them an artifact management server. Popular choices are Nexus or Artifactory. Both offer also cloud solutions for open source and private libraries (e.g. Maven Central or Bintray). The public cloud solutions should be analysed with care, for example the NPM.js gate security affair for Javascript packages or the PyPi issue for Python packages, where attackers replaced popular packages or uploaded packages with similar names to popular packages to trick developers in using malicious libraries in their application code. Having a local artifact management server quickly indicates you which of your developed applications are exposed and you can initiate mitigation measures (e.g. directly communciate to the project which library is affected).

These repositories are important, because you do not need to manage access to third party libraries in your software project. You can simply define in your build tool the third party library as well as the version and it will simply download it during the build process. This allows you simply changing and testing new versions.

Furthermore, having an artifact management server allows you to restrict access to third party libraries, e.g. ones with known security issues or problematic licenses. This means you can easily make your organisations more safe and you can detect where in your software products are libraries with issues.

Especially with the recent trend of self-service by end users in the Big Data world it is crucial that you track which libraries they are using to avoid vulnerabilities and poor quality.

6. Evaluate quality of test cases

As I wrote before, quality of test cases is crucial. One key important aspect of quality is coverage, ie to which extend your tests cover the source code of your software. Luckily, this can easily be measured with tools, such as Jacoco. They give you important information on statement and condition coverage.

There is no silver bullet how much coverage you need. Some famous software engineers say you should have at least 100% statement coverage, because it does not make sense that someone uses/tests a piece of software if not even the developers themselves have tested it. 100% might be difficult especially in cases of rapidly changing software.

In the end this is also a risk/value estimation, but having high statement coverage is NOT costly and saves you later a lot of costs with bugs/generally better maintainable software. It often also uncovers aspects that someone has not thought about before and require clarification, which should happen as early as possible.

For the aforementioned libraries we made the decision that we should have at least 80% statement coverage with the aim to have 100% coverage. Condition coverage is another topic. There are some guidelines, such as DO-178B/DO-178C for avionics or ISO 26262 for automotive that describe a certain type of condition coverage for critical software.

For the aforementioned libraries, there was no such guideline or standard that could be applied. However, both libraries are supporting Big Data applications and these applications, depending on their criticality will need to be tested accordingly.

Finally, another measure for quality of test cases is mutation testing. This type of evaluation basically modifies the software under test to see if the test cases still cover a mutated software. This is very useful in case a method changes frequently, is used in a lot of parts of the application and thus is an important piece. I found it rather useful for the utility functions of the HadoopCryptoledger library that parse binary blockchain data to get some meaningful information of it (cf. based on pitest). In this way you can cover more of the software with higher quality test cases.

7. Code documentation

Documentation of code is important. Of course, ideally the code is self-documenting, but with comments one can easier navigate in code and gets the key concepts of the underlying software. Everybody who has some more in-depth development exposure to popular big data platforms appreciates the documentation (but also of course the source code) shipped with it.

Additionally, one can automatically generate a manual (e.g. by using Javadoc or Scaladoc) that is very useful for developers to leverage the functionality of your library (e.g. find an example of the HadoopCryptoLedger library here). For the aforementioned libraries the documentation is automatically generated and stored in the version management branch “gh-pages”. From there Github pages takes the documentation and makes them available as web pages. Then, it can be found by search engines and viewed in the browser by the users.

The latter point is very important. Usually the problem is not that the developers do not write documentation, but publishing the documentation in time. The documentation for the aforementioned library is published every build in the most recent version. Usually in company it is not the case, it is a manual process that is infrequently or not done at all. However, solutions exist here, for example Confluence and the Docs plugin or the Jenkins Javadoc plugin, but also many others exist.

Furthermore, one should look beyond standard programing language documentation. Nowadays you can integrate UML diagrams (example) and markdown language (example) in your source code comments, which means that your technical specification can be updated every build eg at least once a day!

Finally, tools such as Sonarqube or Openhub can provide you information on the percentage of documentation in comparison to other software.

8. Code Analytics: Integrate static code analysis into the build process.

Besides testing, static code analysis can illustrate areas of improvements for your code. From my experience, if you start right away with it then it is a useful tool with very few false positive.

Even if you have a project that has grown over 10 years and did not start with code analysis, it can be a useful tool, because it shows also for newly added or modified code the quality and this one you can improve right away. Of course code that has never been touched and seems to work already since many years should be carefully evaluated if it makes sense to adapt to recommendation by a static code analysis tool.

Static code analysis is very useful in many areas, such as security or maintainability. It can show you code duplication, which should be avoided as much as possible. However, some code may always have some duplications. For example, for the aforementioned libraries two Hadoop APIs are supported that do essentially the same thing: mapred and mapreduce. Nevertheless, some Big Data tools, such as Hive, Flink or Spark use one API, but not the other one. Since, they are very similar some duplications exist and have to maintained in any case. Nevertheless, in most of the cases duplications should be avoided and remvoed.

One interesting aspect is that popular static code analysis tools in the cloud, such as Codacy, SourceClear or Sonarqube shows you also the quality or even better the evolution of quality of many popular open source projects, so that you can easily evaluate if you want to include them or not.

9. Security & License analysis of third party libraries

Many open source libraries use other open source libraries. The aforementioned libraries are no exception. Since they provide services to Big Data applications, they need to integrate with the libraries of popular Big Data platforms.

Here, it is crucial for the developers and security managers to get an overview of the libraries the application dependent on including the transitive dependencies (dependencies of dependencies). Then, ideally the dependencies and their versions are identified as well as matched to security databases, such as the NIST National Vulnerability Database (NVD).

For the aforementioned libraries the OWASP dependency check is used. After each build the dependencies are checked and problematic ones are highlighted to the developer. The dependency check is free, but it identifies dependencies and matches them to vulnerability based on the library name. This matching is not perfect and requires that you use a build tool with dependency management, such as Gradle, where there are clear names and versions for libaries. Furthermore, the NVD entries are not always of good quality, which makes matching even more challenging.

Commercial tools, such as Jfrog Xray or Nexus IQ have their own databases with enriched security vulnerability information based on various sources (NVD, Github etc.). Matching is done based on hashes and not names. This makes it much more difficult to include a library with wrong version information (or even with manipulated content) in your application.

Additionally, they check the libraries for open source licenses that may have certain legal implications on your organization. Licensing is a critical issue and you should take care that all source code files have a license header (this is the case for the libraries introduced in the beginning).

Finally, they can indicate good versions of a library (e.g. not snapshots, popular by other users, few known security vulnerabilities).

Since the aforementioned libraries introduced in the beginning are just used by Big Data applications, only the free OWASP dependency check is used. Additionally, of course the key important libraries are kept in sync with proper versions. Nevertheless, the libraries itself do not contain other libraries. When a Big Data application is developed with these libraries the developer has to choose the dependent libraries for the Big Data platform.

The development process of Big Data application that use these libraries can employ a more sophisticated process based on their needs or risk exposure.

10. Store build and tested libraries in a versioned manner into an artifact management server (repository)

Since you retrieve already all your libraries in the right version for the artifact management server, you should also make your own libraries or application modules available there.

Our libraries introduced in the beginning are stored on Maven central in a versioned fashion where the developer can pull them in the right version for their platform.

You can also think for your Big Data application to make it available on an on-premise or in the cloud artifact management server (private or publicly). This is very useful for large projects where an application is developed by several software teams and you want to, for example, test integration of new versions of a module of an application and easily switch back if they do not work.

Furthermore, you can use it as part of your continuous deployment process to provide the right versions of the application module for automated deployment up to the production environment.

11. Define a Security policy (security.md)

A security policy (example) is a shared understanding between developers and users on security of the software. It should be stored in the root folder of the source code so that it can be easily found/identified.

As a minimum it defines the security reporting process of issues to the developers. For instance, it should be not public – in the beginning – as issues that everyone can see. The reason is that developers and users need to have some time to upgrade to the new software. After some grace period after fixing the issue the issue of course should be published (depending on how you define it in your security policy).

Depending on your software you have different security policies – there is no unique template. However certain elements can be applied to many different types of software, such as security reporting process, usage of credentials, encryption/signing algorithms, ports used by software or anything that might affect security.

Although many open source software has nowadays such a policy it is by far not common or easy to find. Furthermore, there is clearly a lack of standardization and awareness of security policies. Finally, what is completely missing is automatically combining different security policies of the libraries your application depends on to make a more sophisticated security analysis.

12. Define a contribution policy (contributing.md)

Open source libraries live from voluntary contributions of people from all over the world. However, one needs to define a proper process that defines minimum quality expectations of contributions and description on how people can contribute. This is defined in a policy that is stored in the root folder of the source code (example).

This also avoids conflicts if the rules of contribution are clear. Source code repositories, such as Github, show the contribution policy when creating issues or when creating pull requests.

13. Ensure proper amount of logging

Logging is necessary for various purposes. One important purpose is log analytics:

  • Troubleshooting of issues
  • Security analysis
  • Forensics
  • Improving user interactions
  • Proactive proposal for users what to do next based on other users behavior

Finding the right amount of logging is crucial. Too much logging decreases system performance, too little logging gives not the right information. Different configurable log levels supports tuning of this effort, however, usually the standard log levels (error, warn, info, debug etc.) are used and they do not distinguish the use case mentioned before. Even worse, there is usually no standard across an enterprise. Thus they do not take into account the integration of development and operation. For instance, if there is a warning then operation should know if the warning needs to be taken care of immediately or if it has time to prioritize their efforts.

This also implies that there is a need to have standards for logging in open source libraries, which do not exist. As a first step they would need to document their logging approach to get this discussion in the right direction.

Finally, although certain programing languages such as Java with log4j have rather sophisticated logging support this is not necessarily the case for other programing languages, especially high level languages, such as R or Pig. Here you need to define standards and libraries across your enterprise.

14. Transparency about the project activity

This is the last point (in this blog), but also one of the more important ones. Currently, organizations do not collect any information of the open source libraries they are using. This is very risky, because some open source projects may show less and less activity and are abandoned eventually. If an organizations puts a high valuable business process on them then this can cause a serious issue. In fact many applications (even commercial ones) consists nowadays to a large extent of open source libraries.

Additionally, you may want to know how much effort is put into an open source library, how frequently it changes, the degree of documentation etc. Of course, all these should be shown in context with other libraries.

OpenHub is an example for such a platform that exactly does this (find here an example based on one of the aforementioned libraries). Here you can see the activity and the community of a project based on data it pulls out of the source code repository Github. It provides cost estimates (what would it cost to develop the software yourself) and security information.

Several popular open source projects have been scanned by OpenHub and you can track the activity over years to see if it is valuable to include a library or not.


Most of the aspects presented here are not new. They should been part of your organization since several years. Here, we put some emphasis on how open source developers can increase quality and its transparency using free cloud services. This may lead to better open source projects. Furthermore, open source projects have been always the driver for better custom development projects in company by employing best practices in the area of continuous integration and deployment. Hence, the points mentioned here should also serve as a checklist to improve your organizations software development quality.

Software quality increases the transparency on your software product which is the key of successful software development. This itself is a significant benefit, because if you do not know the quality then you do not have quality.

There are a lot of tools supporting this and even if you have to develop unit/integration tests for 100% coverage this usually does not take more than 5-10% of your developers time. The benefits are, however, enormous. Not only that the developer and users better understand the product, they are also more motivated to see the coverage and keep a high standard of quality in the product. If you software provider tells you that unit/integration tests are not worth the effort or that they do not have time for it – get rid of your software provider – there is no excuse. They must make quality information available in real time: This is the standard.

Given the increase of automation and documentation of software products, I expect a significant increase of productivity of developers. Furthermore, given the converge of Big Data platforms, automated assistants (bots) will guide users to develop themselves more easily and quicker high quality analysis depending on their data. Hence, they will be more empowered and less dependent on external developers – they will become developers themselves with the guidance of artificial intelligence doing most of their work to focus more on valuable activities.

Nevertheless, this depends also on their discipline and the employment of scientific methods in their analysis and execution of business processes.


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert