{"id":2157,"date":"2017-03-09T11:39:56","date_gmt":"2017-03-09T10:39:56","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=2157"},"modified":"2023-06-07T15:27:54","modified_gmt":"2023-06-07T13:27:54","slug":"of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/","title":{"rendered":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 4 &#8211; Big Data Engineering"},"content":{"rendered":"<p>Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed and tested the logic of our Spark applications with a small amount of data local or in the virtual test cluster. After that we run the applications in the cloud with perhaps higher amount of data. The figure below shows our clusters and their capabilities.<\/p>\n<p><!--more--><\/p>\n<p><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2180\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/dev-env-spark\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark.png\" data-orig-size=\"1116,744\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"dev-env-spark\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-1024x683.png\" class=\"alignnone size-medium_large wp-image-2180\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png\" alt=\"\" width=\"656\" height=\"437\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-300x200.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-1024x683.png 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark.png 1116w\" sizes=\"auto, (max-width: 656px) 100vw, 656px\" \/><\/a><\/p>\n<h3>Virtual Hadoop Cluster<\/h3>\n<p>For setting up the virtual test cluster with virtual machines (VirtualBox) and Vagrant we oriented towards this tutorial: <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/AMBARI\/Quick+Start+Guide\">https:\/\/cwiki.apache.org\/confluence\/display\/AMBARI\/Quick+Start+Guide<\/a> .<\/p>\n<p>Vagrant is an application for managing virtual machines. It is a wrapper between virtualization software (e.g. VirtualBox) and software configuration management software like puppet. [10]<\/p>\n<p>We used Vagrant to define in a central place (Vagrant File) the properties like amount of cpus and memory of the virtual machines. Furthermore we configured port forwarding to SSH and port 8080. So, we was able to connect to the virtual machines via SSH from our host operating system. We set up the cluster with Apache Ambari which is a tool for managing and installing Hadoop clusters. Ambari provides a Web UI which we could access from our host operating system via port 8080. Through this Web UI we installed the virtual Hadoop cluster with all the needed components and projects like HDFS, MapReduce, YARN, Apache Spark etc. The approach described above sound really easy, but most of the time it was really annoying and it took as a long time until the cluster run stable with the virtual machines. Many projects had to be installed and minimal network problems caused the failure of the whole installation. Furthermore, there was memory and disk space issues. Managing and configuring a Hadoop cluster isn&#8217;t that easy. The virtual cluster consisted of three virtual machines.<\/p>\n<p>At the beginning, we tried to install the Hadoop projects manually. There we realized that between Hadoop related projects and Spark are many dependency issues. Even between Spark components. Because of this a tool like Apache Ambari is very useful, because it considers this dependency issues. The version of Apache Ambari we used was one of the newest and it installed Apache Spark in Version 1.6.2. But the newest version of Apache Spark was Version 2.1.0. This is certainly referable to this dependency issues.<\/p>\n<h3>IBM Bluemix Big Insights<\/h3>\n<p>From the IBM Bluemix cloud we choose a service called BigInsights. This service offered us a pre-installed hadoop cluster with all the projects we needed e.g. HDFS, Apache Spark. We also could manage the cluster with a Apache Ambari Web UI and connect to the cluster via SSH like in our test environment. The version of Ambari on Bluemix was the same as the version of our test environment. With our student accounts we could use a cluster consisting of four powerful computers with 24-48 GB RAM, 12 VCPUs and nearly 1 TB disk space for free.<\/p>\n<p><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/\">Part 5 &#8211; Spark Applications in PIA project<\/a><\/p>\n<h5>References<\/h5>\n<h6>10 <a href=\"https:\/\/de.wikipedia.org\/wiki\/Vagrant_(Software)\">https:\/\/de.wikipedia.org\/wiki\/Vagrant_(Software)<\/a><\/h6>\n","protected":false},"excerpt":{"rendered":"<p>Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed and tested the logic of [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[22,651,2],"tags":[],"ppma_author":[721],"class_list":["post-2157","post","type-post","status-publish","format-standard","hentry","category-student-projects","category-system-designs","category-system-engineering"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":2143,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services\/","url_meta":{"origin":2157,"position":0},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 1 &#8211; Introduction","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2151,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/","url_meta":{"origin":2157,"position":1},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 2 &#8211; Apache Hadoop Ecosystem","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2165,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","url_meta":{"origin":2157,"position":2},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":2157,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2161,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/","url_meta":{"origin":2157,"position":4},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 5 &#8211; Spark applications in PIA project","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter. To separate important tweets from others\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":926,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2016\/07\/26\/socialcloud-configure-all-the-things-part-6\/","url_meta":{"origin":2157,"position":5},"title":"SocialCloud &#8211; Configure all the things! &#8211; Part 6","author":"ew033","date":"26. July 2016","format":false,"excerpt":"One of the requirements of the system is that organizations should be able to set up and deploy the HumHub system on their own. For this purpose, we have designed the Configtool. To meet this requirement we have to use different tools, procedures and interfaces from Bluemix. In the following\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2016\/07\/socialCloud.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2016\/07\/socialCloud.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2016\/07\/socialCloud.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2016\/07\/socialCloud.jpg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2016\/07\/socialCloud.jpg?resize=1050%2C600&ssl=1 3x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":721,"user_id":49,"is_guest":0,"slug":"bh051","display_name":"bh051, cz022, ds168","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6e0cfeb23e37b530d4d35d4e46d3e6f39969124f52f6474b4cf0f23b6ff524ac?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2157","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=2157"}],"version-history":[{"count":13,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2157\/revisions"}],"predecessor-version":[{"id":2245,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2157\/revisions\/2245"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=2157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=2157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=2157"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=2157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}