I present in this blog post the sneak preview of the hadoopoffice library that will enable you to process Office files, such as MS Excel, using the Hadoop Ecosystem including Hive/Spark.
It currently contains only an ExcelInputFormat, which is based on Apache POI.
Additionally, it contains an example that demonstrates how an Excel input file on HDFS can be converted into a simple CSV file on HDFS.
Finally, you may want to look at this wiki page that explains how you can improve the performance for processing a lot of small files, such as Office documents, on Hadoop.
Of course this is only the beginning. The following things are planned for the near future:
- Support of other office formats as input: ODF Spreadsheets, ODF Database, MS Access, Dbase, MS Word….
- Support of other office formats as output
- A HiveSerde to query office documents in Hive using SQL
- An official release on Maven Central
- An example for Apache Spark