At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other.
Hereof it seems the opinions are divided. In discussions some people opine that Apache Spark is edge Hadoop away, because Spark is among others for some use cases more efficient and distinctly faster. To get a clearer understanding of this we compare the both frameworks in the following section.
The main reason for choosing Spark was a second project which we developed for the course “Programming Intelligent Applications”. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter.
To separate important tweets from others we use Latent Dirichlet Allocation (LDA) which is an algorithm for topic modelling. LDA is able to extract distinct topics through relations between words. An example for such a relation would be that the word “terror” is likely to be used with the word “attack”. This information is enough to generate separate topics for a bunch of documents, in our case tweets. Therefore, the need to annotate training data to learn word distributions for topics is not necessary and LDA is an algorithm which can learn completely unsupervised. For most machine learning applications the performance is getting better with more training data, and this also applies for our application.
Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed and tested the logic of our Spark applications with a small amount of data local or in the virtual test cluster. After that we run the applications in the cloud with perhaps higher amount of data. The figure below shows our clusters and their capabilities.
This is the last part in our series of blog posts concerning the development of an Alexa Skill. If you missed the previous parts you can catch up by reading part 1 here, part 2 here and part 3 here.
Every student group that has worked on a software project can retell the following situation: you’re one week ahead of the deadline, every team member has spent the last weeks working on their part of the project. So far, everything looks great – every module works on its own, the GUI is designed and implemented, the database is modeled and set up, client and server are both running smoothly. All that’s left is combining all the bits and pieces to see everything in action together. Easy, right? Fast forward another five days, it’s the weekend before the final presentation: the air is thick with panic with everyone furiously debugging their code, solving merge conflicts left and right while trying to get the project to some kind of working state that will at least survive the demo. Things that were already working in isolation are now broken and quite a bunch of features that were an inch close to completion will never make it into the presentation. So what has gone wrong? And what have we done to prevent the same from happening with our Alexa Skill?
Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that means locally on a computer. Possible data sources for Spark applications are e.g. the Hadoop Distributed File System (HDFS), HBase (Hadoop distributed NoSQL Database), Amazon S3 or Apache Cassandra. 
In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache Spark. Because of this we first give a short Overview of the Apache Hadoop Ecosystem and in the next part we introduce Apache Spark and parts of our development.
As part of the lecture “System Engineering and Management” in the winter semester 2016/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. Continue reading
Test-driven Development of an Alexa Skill with Node.js
This is the third part in a series of blog posts in which we will describe the process of developing an Amazon Alexa Skill while focusing on using new technologies like serverless computing and enforcing the use of clean code conventions. We decided for our project to use continuous integration and delivery. For that to work as it should and to prevent unnecessary bugs from being discovered by the user, we relied on test-driven development for our code.
In this blog entry we take a look at Travis CI, Jenkins, Gitlab CI and Buildbot and evaluate their benefits and downsides when trying to build a content heavy project with it (e.g. games). Continue reading
Welcome to the final part of our microservices series. If you’ve missed a previous post you can read it here:
IV) Continuous Integration
V) Lessons Learned
Respect for Stumbling Blocks
Hopefully you have enjoyed our blog posts and have learned a lot. We answered following questions in our last four posts
- How to build a microservices architecture?
- How to use the advantages of caching with microservices?
- How to secure microservices and handle authentication between them?
- How to set up a seamless Continuous Integration workflow for microservices combining Jenkins, Git and Docker?