Related articles: ►Take Me Home – Project Overview ►CI/CD infrastructure: Choosing and setting up a server with Jenkins as Docker image ►Android SDK and emulator in Docker for testing ►Automated Unit- and GUI-Testing for Android in Jenkins
Setting up the testing environment and workflow
- Jenkins CI Docker Container
- MongoDB Docker Container
- Production-MongoDB on mlab.com
- NodeJS Web-Application, hosted on heroku.com
Regarding our database tests we wanted to achieve two things: first of all, we wanted to test any functions of our web application using mongoose which change persistent data. Secondly, we needed database tests to test if eventual model-changes are compatible with our data in the production database.
In relational database management systems one defines constraints in the database, which are tested by the database tests. Since we are using MongoDB though, we don’t really have any constraint in the database. Instead we can define any needed constraints with mongoose right in the application. Therefore, cloning the production database for testing is only required when migrating data (or when having constraints defined) but non-essential for the consistency of the as-is backend.
So alternatively when testing a MongoDB we could also use a shell-script or a “before”-function, that gets called once before starting the tests, to define our database testing environment, like creating the same collections we have in the production database and put in test data for our test cases.
You might have already heard of IBM’s artificial intelligence “Watson”, which beat two former champions of the american television game show “Jeopardy!” back in 2011. What you probably don’t know is that today lots of predefined Watson services are publicy available on IBM’s cloud platform “Bluemix”. These services cover different aspects of AI-backed applications like Visual Recognition, Language Translation or Text to Speech. This post glances on Natural Language Understanding, which is about analyzing text by extracting different kinds of information, and shows how this can be achieved by using Bluemix services.
At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other.
Hereof it seems the opinions are divided. In discussions some people opine that Apache Spark is edge Hadoop away, because Spark is among others for some use cases more efficient and distinctly faster. To get a clearer understanding of this we compare the both frameworks in the following section.
The main reason for choosing Spark was a second project which we developed for the course “Programming Intelligent Applications”. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter.
To separate important tweets from others we use Latent Dirichlet Allocation (LDA) which is an algorithm for topic modelling. LDA is able to extract distinct topics through relations between words. An example for such a relation would be that the word “terror” is likely to be used with the word “attack”. This information is enough to generate separate topics for a bunch of documents, in our case tweets. Therefore, the need to annotate training data to learn word distributions for topics is not necessary and LDA is an algorithm which can learn completely unsupervised. For most machine learning applications the performance is getting better with more training data, and this also applies for our application.
Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed and tested the logic of our Spark applications with a small amount of data local or in the virtual test cluster. After that we run the applications in the cloud with perhaps higher amount of data. The figure below shows our clusters and their capabilities.
This is the last part in our series of blog posts concerning the development of an Alexa Skill. If you missed the previous parts you can catch up by reading part 1 here, part 2 here and part 3 here.
Every student group that has worked on a software project can retell the following situation: you’re one week ahead of the deadline, every team member has spent the last weeks working on their part of the project. So far, everything looks great – every module works on its own, the GUI is designed and implemented, the database is modeled and set up, client and server are both running smoothly. All that’s left is combining all the bits and pieces to see everything in action together. Easy, right? Fast forward another five days, it’s the weekend before the final presentation: the air is thick with panic with everyone furiously debugging their code, solving merge conflicts left and right while trying to get the project to some kind of working state that will at least survive the demo. Things that were already working in isolation are now broken and quite a bunch of features that were an inch close to completion will never make it into the presentation. So what has gone wrong? And what have we done to prevent the same from happening with our Alexa Skill?
Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that means locally on a computer. Possible data sources for Spark applications are e.g. the Hadoop Distributed File System (HDFS), HBase (Hadoop distributed NoSQL Database), Amazon S3 or Apache Cassandra. 
In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache Spark. Because of this we first give a short Overview of the Apache Hadoop Ecosystem and in the next part we introduce Apache Spark and parts of our development.
As part of the lecture “System Engineering and Management” in the winter semester 2016/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. Continue reading