Templates, low footprint mode, improved integration with Spark
for the HadoopOffice library for reading/writing Excel files on Big data
platforms
2017-08-01 --- Jörn Franke
Although it seems to be that it was only a small improvement, version
1.0.4 of the HadoopOffice library has a lot of new features for
reading/writing Excel files:
- Templates, so you can define complex documents with
diagrams or other features in MSExcel and fill it with data or formulas
from your Big Data platform in Hadoop, Spark & Co
- Low footprint mode – this mode leverages the Apache
POI event and streaming APIs. It saves CPU and memory consumption
significantly at the expense of certain features (e.g. evaluation of
formulas which is only supported in standard mode). This mode supports
reading old MS Excel (.xls)/new MS Excel (.xlsx) and writing new MS
Excel (.xlsx) documents
- New features in the Spark 2 datasource:
- Inferring of the DataFrame schema consisting of simple Spark SQL
DataTypes (Boolean, Date, Byte, Short, Integer, Long, Decimal, String)
based on the data in the Excel file
- Improved writing of a DataFrame based on a schema with simpel Spark
SQL DataTypes
- Interpreting the first row of an Excel file as column names for the
DataFrame for reading (“header”)
- Writing column names of a DataFrame as the first row of an Excel
file (“header”)
- Support for Spark 2.0.1, 2.1, 2.2
Of course still other features are still usable, such as metadata
reading/writing, encryption/decryption or linked workbooks, support for
Hadoop MapReduce, support for Spark2 datasources and support for Spark
1.
What is next?
- Support for Apache Flink for reading/writing Excel files
- Support for Apache Hive (Hive SerDe) for reading/writing Excel
files
- Support for digitally signing/verifying signature(s) of Excel
files
- Support for reading access files
- … many more