Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release

Jan. 8, 2017

—

von

Reading/Writing office documents, such as Excel, has been always challenging on Big data platforms. Although many libraries exist for reading/writing office documents, they have never been really integrated in Hadoop or Spark and thus lead to a lot of development efforts.

There are several use cases for using office documents jointly with Big data technologies:

Enabling the full customer-centric data science lifecycle: Within your Big Data platform you crunch numbers for complex models. However, you have to make them accessible to your customers. Le us assume you work in the insurance industry. Your Big Data platform calculates various models focused on your customer for insurance products. Your sales staff receives the models in Excel format. They can now play together with the customers on the different parameters, e.g. retirement age, individual risks etc. They may also come up with a different proposal more suitable for your customer and you want to feed it back into your Big Data platform to see if it is feasible.
You still have a lot of data in Excel files related to your computation. Let it be code lists, data collected manually or your existing systems simply support this format.

Hence, the HadoopOffice library was created and the first version has just been released!

It features:

A Hadoop FileFormat for reading/writing Excel files using the Apache POI library, so that nearly all Hadoop ecosystem components can read/write them
- Excel files can be in .xls or .xlsx format, encrypted/not encrypted, with linked workbooks, be filtered based on metadata, with formulas, comments etc.
- mapred.* and mapreduce.* API supported
A Spark2 datasource for reading/writing Excel files enabling comfortable integration of the HadoopOffice library into Spark2. It is available on Spark-packages.
Examples
- Reading Excel documents using MapReduce: Converting Excel to CSV
- Writing Excel documents using MapReduce: Converting CSV to Excel
- Reading Excel documents using the Spark2 datasource API: Displaying the number of rows in the Excel document as well as the content
- Writing Excel documents using the Spark2 datasource API: Example for creating a dataframe with Excel formulas, comments and writing them to a .xlsx file
- An example for reading Excel files using Spark 1.x without the Spark2 datasource API

Of course, further releases are planned:

Support for signing and verification of signature of Excel documents
Going beyond Excel with further office formats, such as ODF Calc
A Hive Serde for querying and writing Excel documents directly in Hive
Further examples including one for Apache Flink

datasource excel hadoop hadoopoffice read spark spark2 write xls xlsx

Kommentare

2 Antworten zu „Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release“

vipin

März 7, 2017

Hi ,
I have implemented this feature in java programm :
pom.xml

com.github.zuinnote
spark-hadoopoffice-ds_2.11
1.0.1

java code

String file = „/home/empower/WorkingData/Project/Spark Work/spark_file_dir/input/file“;
String source = „org.zuinnote.spark.office.excel“;
String key = „read.locale.bcp47“;
String value = „de“;
Dataset rowds = sparkSession.sqlContext().read().format(source).option(key,value).load(file);

error : 17/03/07 12:55:13 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$2$$anonfun$apply$3.apply(DefaultSource.scala:154)
at org.zuinnote.spark.office.excel.DefaultSource$$anonfun$buildReader$2$$anonfun$apply$3.apply(DefaultSource.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:93)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

A little help is much appreciate thanks

Spark Version : 2.0.0
java 1.8

Thanks

Antworten
1. jornfranke
  
  März 7, 2017
  
  Hi, This error does not seem to be related to HadoopOffice, but to your application. You seem to compile for Scala 2.11, but your cluster (or in one of your dependencies) you use Scala 2.10. The examples in the hadoopoffice library do contain a build.sbt for proper compiling for scala 2.10 and scala 2.11 (both versions are supported). Find here some more information: http://stackoverflow.com/questions/27925375/nosuchmethoderror-when-declaring-a-variable Please let me know if it helped you. If not then do not hesitate to create an issue on Github. Thanks! All the best
  
  Antworten

Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release

Kommentare

2 Antworten zu „Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release“

Schreibe einen Kommentar Antworten abbrechen