HadoopOffice – A Vision for the coming Years
2018-01-02 --- Jörn Franke
HadoopOffice is already since more than a year
available (first commit: 16.10.2016). Currently it supports Excel
formats based on the Apache POI parsers/writers. Meanwhile a lot of
functionality has been added, such as:
- Support for .xlsx and .xls formats – reading and writing
- Encryption/Decryption Support
- Support for Hadoop mapred.* and mapreduce.* APIs
- Support for Spark 1.x (via mapreduce.*) and Spark 2.x (via data
source APIs)
- Low footprint mode to use less CPU and memory resources to parse and
write Excel documents
- Template support - add complex diagrams and other functionality in
your Excel documents without coding
Within 2018 and the coming years we want to go beyond this
functionality:
- Add further security functionality: Signing and verification of
signatures of new Excel files (in XML format via XML signature) / Store
credentials for encryption, decryption, signing in keystores
- Apache Hive
Support
- Apache Flink
Support
- Add support for reading/writing Access based on the Jackcess
library including encryption/decryption support
- Add support for dbase formats
- Develop a new spreadsheet format suitable for the Big Data world:
There is currently a significant gap in the Big Data world. There are
formats optimized for data exchange, such as Apache Avro, and for
large scale analytics queries, such as Apache ORC or Apache Parquet. These formats have been proven as
very suitable in the Big Data world. However, they only store data, but
not formulas. This means every time simple data calculation need to be
done they have to be done in dedicated ETL/batch processes varying on
each cluster or software instance. This makes it very limiting to
exchange data, to determine how data was calculated, compare
calculations or flexible recalculate data – one of the key advantages of
Spreadsheet formats, such as Excel. However, Excel is not designed for
Big Data processing. Hence, the goal is to find a SpreadSheet format
suitable for Big Data processing and as flexible as Excel/LibreOffice
Calc. Finally, a streaming SpreadSheet format should be
supported.
HadoopOffice aims at supporting legacy office formats (Excel, Access
etc.) in a secure manner on Big Data platforms but also paving the way
for a new spreadsheet format suitable for the Big Data world.