HadoopOffice is already since more than a year available (first commit: 16.10.2016). Currently it supports Excel formats based on the Apache POI parsers/writers. Meanwhile a lot of functionality has been added, such as:
- Support for .xlsx and .xls formats – reading and writing
- Encryption/Decryption Support
- Support for Hadoop mapred.* and mapreduce.* APIs
- Support for Spark 1.x (via mapreduce.*) and Spark 2.x (via data source APIs)
- Low footprint mode to use less CPU and memory resources to parse and write Excel documents
- Template support – add complex diagrams and other functionality in your Excel documents without coding
Within 2018 and the coming years we want to go beyond this functionality:
- Add further security functionality: Signing and verification of signatures of new Excel files (in XML format via XML signature) / Store credentials for encryption, decryption, signing in keystores
- Apache Hive Support
- Apache Flink Support
- Add support for reading/writing Access based on the Jackcess library including encryption/decryption support
- Add support for dbase formats
- Develop a new spreadsheet format suitable for the Big Data world: There is currently a significant gap in the Big Data world. There are formats optimized for data exchange, such as Apache Avro, and for large scale analytics queries, such as Apache ORC or Apache Parquet. These formats have been proven as very suitable in the Big Data world. However, they only store data, but not formulas. This means every time simple data calculation need to be done they have to be done in dedicated ETL/batch processes varying on each cluster or software instance. This makes it very limiting to exchange data, to determine how data was calculated, compare calculations or flexible recalculate data – one of the key advantages of Spreadsheet formats, such as Excel. However, Excel is not designed for Big Data processing. Hence, the goal is to find a SpreadSheet format suitable for Big Data processing and as flexible as Excel/LibreOffice Calc. Finally, a streaming SpreadSheet format should be supported.
HadoopOffice aims at supporting legacy office formats (Excel, Access etc.) in a secure manner on Big Data platforms but also paving the way for a new spreadsheet format suitable for the Big Data world.