{"id":2143,"date":"2017-03-08T18:29:50","date_gmt":"2017-03-08T17:29:50","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=2143"},"modified":"2023-06-07T15:27:15","modified_gmt":"2023-06-07T13:27:15","slug":"of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services\/","title":{"rendered":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 1 &#8211; Introduction"},"content":{"rendered":"<p>As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. <!--more-->In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of our development environment. We set up an virtual Hadoop cluster with Vagrant and VirtualBox. For our production environment we used an IBM Bluemix Service called \u201cBigInsights\u201d. Finally, we will discuss current issues related to Apache Spark and Hadoop and how they relate to each other.<\/p>\n<p>The primary reason for considering these technologies was another project in the lecture \u201cProgramming Intelligent Applications\u201d. In the following we will abbreviate this lecture with PIA. In that project we developed an application which wants to detect and localize arising source of danger on the world with the aid of tweets.<\/p>\n<p>Hence, our project objectives could be split up in the following three parts:<\/p>\n<ul>\n<li><b>Development:<\/b>\n<ul>\n<li>learn concepts of Apache Spark and develop a bunch of small applications<\/li>\n<li>develop applications which support our project in PIA<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li><b>Big Data Engineering:<\/b>\n<ul>\n<li>set up and install a virtual Hadoop cluster on virtual machines<\/li>\n<li>manage Hadoop cluster on IBM Bluemix<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li><b>Research topics:<\/b>\n<ul>\n<li>investigate and discuss issues related to Apache Spark and Hadoop<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In the next parts of this post we will start with the development part and give a brief introduction into the core concepts of the Hadoop Ecosystem and Apache Spark.<\/p>\n<p><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/\">Part 2 &#8211; Apache Hadoop Ecosystem<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem.<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[22,651,2],"tags":[],"ppma_author":[721],"class_list":["post-2143","post","type-post","status-publish","format-standard","hentry","category-student-projects","category-system-designs","category-system-engineering"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":2157,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/","url_meta":{"origin":2143,"position":0},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 4 &#8211; Big Data Engineering","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2151,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/","url_meta":{"origin":2143,"position":1},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 2 &#8211; Apache Hadoop Ecosystem","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2165,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","url_meta":{"origin":2143,"position":2},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":2143,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2161,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/","url_meta":{"origin":2143,"position":4},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 5 &#8211; Spark applications in PIA project","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter. To separate important tweets from others\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10318,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/04\/13\/open-source-batch-and-stream-processing-realtime-analysis-of-big-data\/","url_meta":{"origin":2143,"position":5},"title":"Open Source Batch and Stream Processing: Realtime Analysis of Big Data","author":"Marcel Stolin","date":"13. April 2020","format":false,"excerpt":"Abstract Since the beginning of Big Data, batch processing was the most popular choice for processing large amounts of generated data. These existing processing technologies are not suitable to process the large amount of data we face today. Research works developed a variety of technologies that focus on stream processing.\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":721,"user_id":49,"is_guest":0,"slug":"bh051","display_name":"bh051, cz022, ds168","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6e0cfeb23e37b530d4d35d4e46d3e6f39969124f52f6474b4cf0f23b6ff524ac?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2143","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=2143"}],"version-history":[{"count":11,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2143\/revisions"}],"predecessor-version":[{"id":2213,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2143\/revisions\/2213"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=2143"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=2143"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=2143"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=2143"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}