{"id":2165,"date":"2017-03-09T11:51:26","date_gmt":"2017-03-09T10:51:26","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=2165"},"modified":"2023-06-07T15:28:11","modified_gmt":"2023-06-07T13:28:11","slug":"of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","title":{"rendered":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?"},"content":{"rendered":"<p>At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other.<\/p>\n<p>Hereof it seems the opinions are divided. In discussions some people opine that Apache Spark is edge Hadoop away, because Spark is among others for some use cases more efficient and distinctly faster. To get a clearer understanding of this we compare the both frameworks in the following section.<\/p>\n<p><!--more--><\/p>\n<p>One of the biggest differences between the both frameworks is that Apache Spark can store the intermediate results in memory which is also the default configuration. Whereas Hadoop\u2019s MapReduce always saves all intermediate results on disk (usually in HDFS). This explains the sometimes distinct performance differences. But this means a Spark cluster needs a huge amount of RAM which is very expensive [11]. Therefor Spark reduces the costs per unit of computation [11]. In times of IaaS and payment models where resources can be hired for minutes, it is quite probable that Spark reduces the costs of computation.<br \/>\nBut what happens if the data which should be processed is too big, so that the intermediate results couldn\u2019t be stored in memory? For this Apache Spark applications can be configured to store the intermediate results on disk. Even with this configuration Apache Spark is for some use cases ten times faster than Hadoop [12]. This means Apache Spark could also be used for batch processing. The main difference is that Apache Hadoop is designed as a batch processing framework whereas Apache Spark is designed as a micro-batch processing framework and so usable for real-time tasks. Hadoop is efficient for tasks where real-time is not necessary. So from this point of view we think it is a bit unfair to compare this both frameworks regarding their performances. In our opinion Apache Spark could be surely the better solution for many use cases where Hadoop formerly was used. Especially for machine learning and near real-time tasks. Spark provides a simple programming API in four different languages and has powerful modules which are easy to use. In Spark you can simply create multi-stage pipelines. If you want to realize multi-stage pipelines with MapReduce you have to concatenate multiple MapReduce Jobs. A MapReduce job consists simply of two task, the map task and the reduce task which the programmer has to implement. But we especially think this simplicity allows Hadoop Clusters to scale unlimited. In the respective phase, the map or reduce processes will be started across the cluster and execute their tasks. For us it is not recognizable that their are scaling problems at any point. Just add commodity computers to the cluster and you have more memory, cpus and disk space. We have our doubts if Spark applications can also scale that way unhesitatingly, because the processing stages in Spark seems not that clearly separated like in Hadoop. But according to the present state it exists a Spark cluster with 8000 nodes [13]. So, for many use cases this should be enough. But for example Twitter processes 500 PB data on multiple Hadoop clusters, the biggest consists of over 10k nodes [14]. For such use cases we could imagine that Hadoop is the better solution and so we believe Apache Hadoop has still his empowerment. Furthermore, the cluster manager of Hadoop YARN was developed further in a way that allows external projects like Spark to use the HDFS. So why should the Hadoop project open the doors for potential projects like Spark that could make Hadoop needless? Finally, we would say Apache Spark and Apache Hadoop rather complement each other than outpace each other.<\/p>\n<h5>References<\/h5>\n<h6>11 <a href=\"https:\/\/acadgild.com\/blog\/hadoop-vs-spark-best-big-data-frameworks\/\">https:\/\/acadgild.com\/blog\/hadoop-vs-spark-best-big-data-frameworks\/<\/a><\/h6>\n<h6>12 <a href=\"http:\/\/spark.apache.org\/\">http:\/\/spark.apache.org\/<\/a><\/h6>\n<h6>13 <a href=\"http:\/\/spark.apache.org\/faq.html\">http:\/\/spark.apache.org\/faq.html<\/a><\/h6>\n<h6>14 <a href=\"https:\/\/blog.twitter.com\/2017\/the-infrastructure-behind-twitter-scale\">https:\/\/blog.twitter.com\/2017\/the-infrastructure-behind-twitter-scale<\/a><\/h6>\n","protected":false},"excerpt":{"rendered":"<p>At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems the opinions are divided. In [&hellip;]<\/p>\n","protected":false},"author":49,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[22,651,2],"tags":[],"ppma_author":[721],"class_list":["post-2165","post","type-post","status-publish","format-standard","hentry","category-student-projects","category-system-designs","category-system-engineering"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":2143,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services\/","url_meta":{"origin":2165,"position":0},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 1 &#8211; Introduction","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"As part of the lecture \u201cSystem Engineering and Management\u201d in the winter semester 2016\/17, we run a project with Apache Spark and the Apache Hadoop Ecosystem. In this article series firstly we want to introduce Apache Spark and the Apache Hadoop Ecosystem. Furthermore we want to give an overview of\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2151,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-2-apache-hadoop-ecosystem\/","url_meta":{"origin":2165,"position":1},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 2 &#8211; Apache Hadoop Ecosystem","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"In our project we primarily implemented Spark applications, but we used components of Apache Hadoop like the Hadoop distributed file system or the cluster manager Hadoop YARN. For our discussion in the last part of this blog article it is moreover necessary to understand Hadoop MapReduce for comparison to Apache\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2157,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-4-big-data-engineering\/","url_meta":{"origin":2165,"position":2},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 4 &#8211; Big Data Engineering","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"Our objective in this project was to build an environment that could be practical. So we set up a virtual Hadoop test cluster with virtual machines. Our production environment was a Hadoop Cluster in the IBM Bluemix cloud which we could use for free with our student accounts. We developed\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/dev-env-spark-768x512.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":2165,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2161,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-5-spark-applications-in-pia-project\/","url_meta":{"origin":2165,"position":4},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 5 &#8211; Spark applications in PIA project","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"The main reason for choosing Spark was a second project which we developed for the course \u201cProgramming Intelligent Applications\u201d. For this project we wanted to implement a framework which is able to monitor important events (e.g. terror, natural disasters) on the world through Twitter. To separate important tweets from others\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":10289,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/","url_meta":{"origin":2165,"position":5},"title":"Distributed stream processing frameworks &#8211; what they are and how they perform","author":"Alexander Merker","date":"9. March 2020","format":false,"excerpt":"An overview on stream processing, common frameworks as well as some insights on performance based on benchmarking data","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":721,"user_id":49,"is_guest":0,"slug":"bh051","display_name":"bh051, cz022, ds168","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/6e0cfeb23e37b530d4d35d4e46d3e6f39969124f52f6474b4cf0f23b6ff524ac?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2165","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/49"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=2165"}],"version-history":[{"count":16,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2165\/revisions"}],"predecessor-version":[{"id":2243,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/2165\/revisions\/2243"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=2165"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=2165"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=2165"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=2165"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}