Reading/Writing Excel documents with the HadoopOffice library on
Hadoop and Spark – First release
2017-01-08 --- Jörn Franke
Reading/Writing office documents, such as Excel, has been always
challenging on Big data platforms. Although many libraries exist for
reading/writing office documents, they have never been really integrated
in Hadoop or Spark and thus lead to a lot of development efforts.
There are several use cases for using office documents jointly with
Big data technologies:
- Enabling the full customer-centric data science lifecycle: Within
your Big Data platform you crunch numbers for complex models. However,
you have to make them accessible to your customers. Le us assume you
work in the insurance industry. Your Big Data platform calculates
various models focused on your customer for insurance products. Your
sales staff receives the models in Excel format. They can now play
together with the customers on the different parameters, e.g. retirement
age, individual risks etc. They may also come up with a different
proposal more suitable for your customer and you want to feed it back
into your Big Data platform to see if it is feasible.
- You still have a lot of data in Excel files related to your
computation. Let it be code lists, data collected manually or your
existing systems simply support this format.
Hence, the HadoopOffice library was created and the first
version has just been released!
It features:
- A Hadoop FileFormat for reading/writing Excel files using the Apache POI library,
so that nearly all Hadoop ecosystem components can read/write them
- Excel files can be in .xls or .xlsx format, encrypted/not encrypted,
with linked workbooks, be filtered based on metadata, with formulas,
comments etc.
- mapred.* and mapreduce.* API supported
- A Spark2 datasource for reading/writing Excel files enabling
comfortable integration of the HadoopOffice library into Spark2. It is
available on Spark-packages.
- Examples
Of course, further releases are planned:
- Support for signing and verification of signature of Excel
documents
- Going beyond Excel with further office formats, such as ODF
Calc
- A Hive Serde for querying and writing Excel documents directly in
Hive
- Further examples including one for Apache Flink