Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services – Part 1 – Introduction

As part of the lecture “System Engineering and Management” in the winter semester 2016/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of our development environment. We set up an virtual Hadoop cluster with Vagrant and VirtualBox. For our production environment we used an IBM Bluemix Service called “BigInsights”. Finally, we will discuss current issues related to Apache Spark and Hadoop and how they relate to each other.

The primary reason for considering these technologies was another project in the lecture “Programming Intelligent Applications”. In the following we will abbreviate this lecture with PIA. In that project we developed an application which wants to detect and localize arising source of danger on the world with the aid of tweets.

Hence, our project objectives could be split up in the following three parts:

  • Development:
    • learn concepts of Apache Spark and develop a bunch of small applications
    • develop applications which support our project in PIA
  • Big Data Engineering:
    • set up and install a virtual Hadoop cluster on virtual machines
    • manage Hadoop cluster on IBM Bluemix
  • Research topics:
    • investigate and discuss issues related to Apache Spark and Hadoop

In the next parts of this post we will start with the development part and give a brief introduction into the core concepts of the Hadoop Ecosystem and Apache Spark.

Part 2 – Apache Hadoop Ecosystem