Big Data Analytics on Excel files using Hadoop/Hive/Flink/Spark

Today we have released HadoopOffice v1.1.0 with major enhancements: Based on the latest Apache POI 3.17 Apache Hive: Query Excel files and write tables to Excel files using the Hive Serde Apache Flink support for Flink Table API and Flink DataSource/DataSink Signing and verification of signatures of Excel files Example to use the HadoopOffice library… Big Data Analytics on Excel files using Hadoop/Hive/Flink/Spark weiterlesen

HadoopOffice – A Vision for the coming Years

HadoopOffice is already since more than a year available (first commit: 16.10.2016). Currently it supports Excel formats based on the Apache POI parsers/writers. Meanwhile a lot of functionality has been added, such as: Support for .xlsx and .xls formats – reading and writing Encryption/Decryption Support Support for Hadoop mapred.* and mapreduce.* APIs Support for Spark… HadoopOffice – A Vision for the coming Years weiterlesen

Mapred vs MapReduce – The API question of Hadoop and impact on the Ecosystem

I will describe in this blog post the difference between the mapred.* and mapreduce.* API in Hadoop with respect to the custom InputFormats and OutputFormats. Additionally I will write on the impact of having both APIs on the Hadoop Ecosystem and related Big Data platforms, such as Apache Flink, Apache Hive and Apache Spark. Finally,… Mapred vs MapReduce – The API question of Hadoop and impact on the Ecosystem weiterlesen

Templates, low footprint mode, improved integration with Spark for the HadoopOffice library for reading/writing Excel files on Big data platforms

Although it seems to be that it was only a small improvement, version 1.0.4 of the HadoopOffice library has a lot of new features for reading/writing Excel files: Templates, so you can define complex documents with diagrams or other features in MSExcel and fill it with data or formulas from your Big Data platform in… Templates, low footprint mode, improved integration with Spark for the HadoopOffice library for reading/writing Excel files on Big data platforms weiterlesen

Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release

Reading/Writing office documents, such as Excel, has been always challenging on Big data platforms. Although many libraries exist for reading/writing office documents, they have never been really integrated in Hadoop or Spark and thus lead to a lot of development efforts. There are several use cases for using office documents jointly with Big data technologies:… Reading/Writing Excel documents with the HadoopOffice library on Hadoop and Spark – First release weiterlesen

Sneak Preview – HadoopOffice: Processing Office documents using the Hadoop Ecosystem – The example of Excel files

I present in this blog post the sneak preview of the hadoopoffice library that will enable you to process Office files, such as MS Excel, using the Hadoop Ecosystem including Hive/Spark. It currently contains only an ExcelInputFormat, which is based on Apache POI. Additionally, it contains an example that demonstrates how an Excel input file… Sneak Preview – HadoopOffice: Processing Office documents using the Hadoop Ecosystem – The example of Excel files weiterlesen