Schlagwort: hadoop
-
Spark+Scala+Graphx: Analyzing the Bitcoin Transaction Graph
The hadoopcryptoledger library provides now an example how you can generate a Bitcoin Transaction Graph using the Big Data graph analysis technologies Spark+Scala+Graphx. Basically it demonstrates how to read the Bitcoin Blockchain from HDFS, transform it into a graph with Bitcoin addresses as vertices and transactions between them as edges. The example returns the 5…
-
Hive & Bitcoin: Analytics on Blockchain data with SQL
You can now analyze the Bitcoin Blockchain using Hive and the hadoopcryptoledger library with the new HiveSerde plugin. Basically you can link any data that you loaded in Hive with Bitcoin Blockchain data. For example, you can link Blockchain data with important events in history to determine what causes Bitcoin exchange rates to increase or…
-
Using Apache Spark to Analyze the Bitcoin Blockchain
The hadoopcryptoledger library provides now a simple example how you can analyze the Bitcoin Blockchain with Apache Spark. Previously, I described how you can use Hadoop MR or any other Hadoop ecosystem-compatible application to analyze it. Basically, it leverages the HadoopRDD API to read the Hadoop File Format of the hadoopcryptoledger library. Afterwards you can…
-
Analyzing the Bitcoin Blockchain using the Hadoop Ecosystem – A first Approach
Bitcoin and other crytocurrencies have drawn a lot of attention of companies, public organizations and individuals. While many use cases exists there is still a long road ahead to make them part of everybody’s life. The recently released first version of the open source hadoopycryptoledger library is a first attempt to make this happen. It…
-
Batch-processing & Interactive Analytics for Big Data – the Role of in-Memory
In this blog post I will discuss various aspects of in-memory technologies and describe how various Big Data technologies fit into this context. Especially, I will focus on the difference between in-memory batch analytics and interactive in-memory analytics. Additionally, I will illustrate when in-memory technology is really beneficial. In-memory technology leverages the fast main memory…
-
Hive Optimizations with Indexes, Bloom-Filters and Statistics
This blog post describes how Storage Indexes, Bitmap Indexes, Compact Indexes, Aggregate Indexes, Covering Indexes/Materialized Views, Bloom-Filters and statistics can increase performance with Apache Hive to enable a real-time datawarehouse. Furthermore, I will address how index-paradigms change due to big data volumes. Generally it is recommended to use less traditional indexes, but focus on storage indexes…
-
Big Data Lab in the Cloud with Hadoop+Spark+R+Python
This is an update of the second big data lab for the cloud. Similar to previous versions, this document described how you can create a Big Data Lab in the cloud on Amazon EMR. Besides some major upgrades to the newest Amazon Hadoop AMI (3.6.0) Spark (1.3.0) and R, it includes now also the possibility…
-
Update: Next Generation Big Data Lab V2 in the Cloud
Recently, I presented the first version of the Big Data Lab in the cloud. Now I extended this version and kept most of the features of the previous version. However, I provide upgrades for important software components. It still runs on Amazon EMR, but with the newest Amazon AMI (including Amazon Linux). It now features…
-
Example projects for using various NoSQL and Big Data technologies
Recently, I published on github.com several example Java projects for using various NoSQL technologies: cassandra-tutorial : Apache Cassandra tutorial (Column-oriented database) mongodb-tutorial : Mongo DB tutorial (Document database) neo4j-tutorial : Neo4J (Graph Database) redis-tutorial : Redis (Key/Value Store) solr-tutorial : Apache SolrCloud (Search technology) Other example Java projects aim at standardized big data processing platforms:…
-
The Lambda Architecture for Big Data in your Enterprise
I will present in this blog post the Lambda architecture for Big Data. This architecture is about integrating historical Big Data with “live” streaming Big Data. Afterwards, the concept of a large data lake in your enterprise or amongst enterprises in a B2B scenario is explained. This data lake – based on the lambda architecture…