Zukunft-Innovation-Technik (ZuInnoTe) – Digitalize your business

Sneak Preview – HadoopOffice: Processing Office documents using the Hadoop Ecosystem – The example of Excel files

Okt. 22, 2016

—

von

in analytics, big data, hive, office, tech

I present in this blog post the sneak preview of the hadoopoffice library that will enable you to process Office files, such as MS Excel, using the Hadoop Ecosystem including Hive/Spark.
It currently contains only an ExcelInputFormat, which is based on Apache POI.

Additionally, it contains an example that demonstrates how an Excel input file on HDFS can be converted into a simple CSV file on HDFS.

Finally, you may want to look at this wiki page that explains how you can improve the performance for processing a lot of small files, such as Office documents, on Hadoop.

Of course this is only the beginning. The following things are planned for the near future:

Support of other office formats as input: ODF Spreadsheets, ODF Database, MS Access, Dbase, MS Word….
Support of other office formats as output
A HiveSerde to query office documents in Hive using SQL
An official release on Maven Central
An example for Apache Spark

csv excel hadoop hadoopoffice spark

Sneak Preview – HadoopOffice: Processing Office documents using the Hadoop Ecosystem – The example of Excel files

Kommentare

Schreibe einen Kommentar Antworten abbrechen