{"id":2151,"date":"2017-03-08T18:46:21","date_gmt":"2017-03-08T17:46:21","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=2151"},"modified":"2023-08-06T21:53:34","modified_gmt":"2023-08-06T19:53:34","slug":"of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/","title":{"rendered":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 2 &#8211; Apache Hadoop Ecosystem"},"content":{"rendered":"<p>In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache Spark. Because of this we first give a short Overview of the Apache Hadoop Ecosystem and in the next part we introduce Apache Spark and parts of our development.<\/p>\n<p><!--more--><\/p>\n<h3>Overview of the Apache Hadoop Ecosystem<\/h3>\n<p>Apache Hadoop is a framework for distributed processing of large data sets across large computer clusters consisting of thousand of machines. The Hadoop framework consists of the following three core modules:<\/p>\n<ul>\n<li>Hadoop Distributed File System (HDFS)<\/li>\n<\/ul>\n<ul>\n<li>Hadoop YARN<\/li>\n<\/ul>\n<ul>\n<li>Hadoop MapReduce<\/li>\n<\/ul>\n<p>Following, we want to introduce this modules in a nutshell.<\/p>\n<h4>Hadoop Distributed File System (HDFS)<\/h4>\n<p>The Hadoop distributed file system is based on the Google File System (GFS) and they are in the most important properties equivalent [4].<\/p>\n<p style=\"text-align: left;\">The Google File System consists of one master node (in HDFS called name node) and any number of chunk servers (in HDFS called data nodes). The master node manages the meta data of the file system and the files such as directories or meta-data of files for example the location [5]. The actual files are stored on the chunk servers [5] (data nodes). The figure below shows how the Google File System works. An app or client requests the master node for a file. Then the master returns an address of a chunk server where the file can be requested. Afterwards the app direct requests the file from the appropriate chunk server.<br \/>\n<img decoding=\"async\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/1024px-GoogleFileSystemGFS.svg.png\"><br \/>\nImage: <a href=\"https:\/\/en.wikipedia.org\/wiki\/File:GoogleFileSystemGFS.svg\">https:\/\/en.wikipedia.org\/wiki\/File:GoogleFileSystemGFS.svg<\/a><\/p>\n<p>The figure illustrates that files are copied redundant on multiple chunk servers [5]. In case a machine fails it is guaranteed that the file is available on another server.<\/p>\n<h4>Hadoop YARN (Yet Another Resource Negotiator)<\/h4>\n<p>YARN ist the cluster manager of the Hadoop framework and manages the resource and job management [1]. It also enables other (external) projects like Apache Spark to use the Hadoop distributed file system [2].<\/p>\n<h4>Hadoop MapReduce<\/h4>\n<p>In this section we describe in a nutshell the Google MapReduce algorithm. Like the name of the algorithm suggests there are two tasks, the map and the reduce task. A programmer of a MapReduce job implements a map and a reduce tasks.<\/p>\n<p>Firstly, the input data will be read usually from the HDFS. The input data will be split into partitions. Then across the computer cluster multiple map processes will be started. Each map task processes a partition of the input data. All computation of the partitions happens in parallel.[3]<\/p>\n<p>The result of the map tasks are key value pairs [3] which usually will be written back to the HDFS as intermediate data. Then these key-value pairs will be sorted across the computers in the cluster, so that all pairs with same keys are located in the same partition [3]. Now the reduce phase begins where multiple reduce processes will be started across the cluster [3]. Again, each reduce task processes a partition of the intermediate data [3]. The result of the reduce tasks are the final outcome of the MapReduce job.<\/p>\n<h4>Further Hadoop related projects<\/h4>\n<p>Beside this core modules there are several projects related to Apache Hadoop. For example HBase which is an open source implementation of Googles Big Table [6]. HBase is a scalable, distributed NoSQL database [6].<\/p>\n<p>Another project is Apache Hive which was developed by Facebook. Hive delivers data warehouse functionality to hadoop applications. With Hive it is possible to query with SQL statements a database which is optimised for analytic issues and which is aggregated of usually heterogeneous data sources.[8][9]<\/p>\n<p>In our projects we used SparkSQL which is &#8211; carefully expressed &#8211; something similiar to Hive. It also allows the aggregation of structured data from possible heterogeneous data sources. A list of Hadoop related projects can be viewed on the Hadoop Website http:\/\/hadoop.apache.org\/.<\/p>\n<p><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/\">Part 3 &#8211; What is Apache Spark<\/a><\/p>\n<h5>References Hadoop<\/h5>\n<h6>1 <a href=\"http:\/\/hadoop.apache.org\/\">http:\/\/hadoop.apache.org\/<\/a><\/h6>\n<h6>2 <a href=\"https:\/\/hortonworks.com\/apache\/yarn\/\">https:\/\/hortonworks.com\/apache\/yarn\/<\/a><\/h6>\n<h6>3 <a href=\"https:\/\/de.wikipedia.org\/wiki\/MapReduce\">https:\/\/de.wikipedia.org\/wiki\/MapReduce<\/a><\/h6>\n<h6>4 <a href=\"https:\/\/www.heise.de\/developer\/artikel\/Hadoop-Distributed-File-System-964808.html\">https:\/\/www.heise.de\/developer\/artikel\/Hadoop-Distributed-File-System-964808.html<\/a><\/h6>\n<h6>5 <a href=\"https:\/\/de.wikipedia.org\/wiki\/Google_File_System\">https:\/\/de.wikipedia.org\/wiki\/Google_File_System<\/a><\/h6>\n<h6>6 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_HBase\">https:\/\/en.wikipedia.org\/wiki\/Apache_HBase<\/a><\/h6>\n<h6>7 <a href=\"https:\/\/hortonworks.com\/apache\/hbase\/\">https:\/\/hortonworks.com\/apache\/hbase\/<\/a><\/h6>\n<h6>8 <a href=\"https:\/\/en.wikipedia.org\/wiki\/Apache_Hive\">https:\/\/en.wikipedia.org\/wiki\/Apache_Hive<\/a><\/h6>\n","protected":false},"excerpt":{"rendered":"<p>In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache Spark. Because of this we [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[22,651,2],"tags":[],"ppma_author":[721],"class_list":["post-2151","post","type-post","status-publish","format-standard","hentry","category-student-projects","category-system-designs","category-system-engineering"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":2143,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services\/","url_meta":{"origin":2151,"position":0},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 1 &#8211; Introduction","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2165,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","url_meta":{"origin":2151,"position":1},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2157,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/","url_meta":{"origin":2151,"position":2},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 4 &#8211; Big Data Engineering","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":2151,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2161,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/","url_meta":{"origin":2151,"position":4},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 5 &#8211; Spark applications in PIA project","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter. To separate important tweets from others\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10289,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/","url_meta":{"origin":2151,"position":5},"title":"Distributed stream processing frameworks &#8211; what they are and how they perform","author":"Alexander Merker","date":"9. March 2020","format":false,"excerpt":"An overview on stream processing, common frameworks as well as some insights on performance based on benchmarking data","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":721,"user_id":49,"is_guest":0,"slug":"bh051","display_name":"bh051, cz022, ds168","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6e0cfeb23e37b530d4d35d4e46d3e6f39969124f52f6474b4cf0f23b6ff524ac?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2151","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=2151"}],"version-history":[{"count":16,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2151\/revisions"}],"predecessor-version":[{"id":25530,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2151\/revisions\/25530"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=2151"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=2151"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=2151"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=2151"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}