Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 6 – Apache Spark and/vs Apache Hadoop?

At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other.

Hereof it seems the opinions are divided. In discussions some people opine that Apache Spark is edge Hadoop away, because Spark is among others for some use cases more efficient and distinctly faster. To get a clearer understanding of this we compare the both frameworks in the following section.

Continue reading

Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 5 – Spark applications in PIA project

The main reason for choosing Spark was a second project which we developed for the course “Programming Intelligent Applications”. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter.

To separate important tweets from others we use Latent Dirichlet Allocation (LDA) which is an algorithm for topic modelling. LDA is able to extract distinct topics through relations between words. An example for such a relation would be that the word “terror” is likely to be used with the word “attack”. This information is enough to generate separate topics for a bunch of documents, in our case tweets. Therefore, the need to annotate training data to learn word distributions for topics is not necessary and LDA is an algorithm which can learn completely unsupervised. For most machine learning applications the performance is getting better with more training data, and this also applies for our application.

Continue reading

Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 4 – Big Data Engineering

Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed and tested the logic of our Spark applications with a small amount of data local or in the virtual test cluster. After that we run the applications in the cloud with perhaps higher amount of data. The figure below shows our clusters and their capabilities.

Continue reading

Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 3 – What is Apache Spark?

Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that means locally on a computer. Possible data sources for Spark applications are e.g. the Hadoop Distributed File System (HDFS), HBase (Hadoop distributed NoSQL Database), Amazon S3 or Apache Cassandra. [1]

Continue reading

Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 2 – Apache Hadoop Ecosystem

In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache Spark. Because of this we first give a short Overview of the Apache Hadoop Ecosystem and in the next part we introduce Apache Spark and parts of our development.

Continue reading