At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other.
Hereof it seems the opinions are divided. In discussions some people opine that Apache Spark is edge Hadoop away, because Spark is among others for some use cases more efficient and distinctly faster. To get a clearer understanding of this we compare the both frameworks in the following section.
One of the biggest differences between the both frameworks is that Apache Spark can store the intermediate results in memory which is also the default configuration. Whereas Hadoop’s MapReduce always saves all intermediate results on disk (usually in HDFS). This explains the sometimes distinct performance differences. But this means a Spark cluster needs a huge amount of RAM which is very expensive . Therefor Spark reduces the costs per unit of computation . In times of IaaS and payment models where resources can be hired for minutes, it is quite probable that Spark reduces the costs of computation.
But what happens if the data which should be processed is too big, so that the intermediate results couldn’t be stored in memory? For this Apache Spark applications can be configured to store the intermediate results on disk. Even with this configuration Apache Spark is for some use cases ten times faster than Hadoop . This means Apache Spark could also be used for batch processing. The main difference is that Apache Hadoop is designed as a batch processing framework whereas Apache Spark is designed as a micro-batch processing framework and so usable for real-time tasks. Hadoop is efficient for tasks where real-time is not necessary. So from this point of view we think it is a bit unfair to compare this both frameworks regarding their performances. In our opinion Apache Spark could be surely the better solution for many use cases where Hadoop formerly was used. Especially for machine learning and near real-time tasks. Spark provides a simple programming API in four different languages and has powerful modules which are easy to use. In Spark you can simply create multi-stage pipelines. If you want to realize multi-stage pipelines with MapReduce you have to concatenate multiple MapReduce Jobs. A MapReduce job consists simply of two task, the map task and the reduce task which the programmer has to implement. But we especially think this simplicity allows Hadoop Clusters to scale unlimited. In the respective phase, the map or reduce processes will be started across the cluster and execute their tasks. For us it is not recognizable that their are scaling problems at any point. Just add commodity computers to the cluster and you have more memory, cpus and disk space. We have our doubts if Spark applications can also scale that way unhesitatingly, because the processing stages in Spark seems not that clearly separated like in Hadoop. But according to the present state it exists a Spark cluster with 8000 nodes . So, for many use cases this should be enough. But for example Twitter processes 500 PB data on multiple Hadoop clusters, the biggest consists of over 10k nodes . For such use cases we could imagine that Hadoop is the better solution and so we believe Apache Hadoop has still his empowerment. Furthermore, the cluster manager of Hadoop YARN was developed further in a way that allows external projects like Spark to use the HDFS. So why should the Hadoop project open the doors for potential projects like Spark that could make Hadoop needless? Finally, we would say Apache Spark and Apache Hadoop rather complement each other than outpace each other.