{"id":10289,"date":"2020-03-09T11:28:56","date_gmt":"2020-03-09T10:28:56","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=10289"},"modified":"2023-08-06T21:44:19","modified_gmt":"2023-08-06T19:44:19","slug":"distributed-stream-processing-frameworks-what-they-are-and-how-they-perform","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/","title":{"rendered":"Distributed stream processing frameworks &#8211; what they are and how they perform"},"content":{"rendered":"\n<p>This blog aims to provide an overview about the topic of stream processing and its capabilites in a large scale environment. The post starts with an introduction to stream processing. After that, it explains how stream processing works and shows different areas of application as well as some common stream processing frameworks. Finally, this article will provide a performance comparison of several common frameworks based on benchmarking data.<\/p>\n\n\n<p><!--more--><\/p>\n\n\n\n<p>Let&#8217;s begin with an introduction to stream processing.<\/p>\n\n\n\n<p>Distributed stream processing engines are gaining popularity over the last years. Stream processing is a technology that can query continous streams of data in real-time and perform operations on the received data. It also goes by the name event-processing, Complex Event Processing, real-time-analytics or stream analytics. It allows to process data in real time as soon as it arrives in the system and can be used to quickly detect conditions in the received data. As an example imagine a temperature sensor that continously sends the temperature level. Once a certain temperature level is reached the system can trigger an alert based on the received data. [1]  <\/p>\n\n\n\n<p> At a more technical level the stream processing engines process data in a pipeline-like structure. Data gets processed in terms of a Directed Acyclic Graph (see below). The processing can chain various functions together but never go back to an earlier point in the graph. Depending on the processed data steps in the chain can be skipped.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"10290\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/dag\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/DAG.png\" data-orig-size=\"605,559\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"DAG\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/DAG.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/DAG.png\" alt=\"Directed Acyclic Graph\" class=\"wp-image-10290\" width=\"303\" height=\"280\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/DAG.png 605w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/DAG-300x277.png 300w\" sizes=\"auto, (max-width: 303px) 100vw, 303px\" \/><figcaption class=\"wp-element-caption\">Directed Acyclic Graph. Source: [5]<\/figcaption><\/figure>\n\n\n\n<p>\nThe\navailable frameworks approach this differently. Some frameworks let\nthe developer define the graph explicitly, thus the coding is at a\nmuch lower level. In these frameworks like Apache Storm or Apache\nSamza the developer has full control over the code but it is possible\nto write inefficient code.<\/p>\n\n\n\n<p>In\nother frameworks like Apache Flink or Apache Spark the developer can\nsimply chain functions together and the framework constructs the\ngraph. Thus the code is shaped in a very functional style as shown in\nthe example code snippet below. The code snippet shows a simple\napplication that counts words from an incoming stream for five\nseconds. [5]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"609\" data-attachment-id=\"10292\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/wordcountexample\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample.png\" data-orig-size=\"786,609\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"wordcountexample\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample.png\" alt=\"Apache Flink word count example code\" class=\"wp-image-10292\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample.png 786w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample-300x232.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/wordcountexample-768x595.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><figcaption class=\"wp-element-caption\">Apache Flink word count example code. Source: [4]<\/figcaption><\/figure>\n\n\n\n<p> On the contrary there is the &#8220;classic&#8221; approach of batch processing. Processing happens on blocks of data that have been collected and stored over a period of time. Depending on the size of the application this can be a huge amount of data with possibly millions of records.[1]<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>So why use stream processing instead of just processing the data in batches? And when?<\/strong><\/h2>\n\n\n\n<p>Some data that has to be processed inside an application naturally comes as a never-ending stream of data. For example healthcare sensor data, traffic sensor data or almost all IOT devices produce events continously. These types of data are time-series data for which the time of arrival or the time at which the event occured is important. Stream Processing frameworks naturally fit this model of time-series data. Detecting patterns and anomalies within data streams becomes easy. Additionally, you can inspect multiple data streams at once. [1]<\/p>\n\n\n\n<p>Processing constantly arriving streaming data in batches would require to stop data collection at some point, store the data and then process it. Then you&#8217;ll have to worry about aggregation across multiple batches. So using a framework that fits this model makes perfect sense. [1]<\/p>\n\n\n\n<p>Now let&#8217;s look at some example usecases where streaming can be used beneficially:<br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"> <br><strong>Event-driven applications<\/strong><\/h2>\n\n\n\n<p>An event-driven application retrieves events from possibly multiple sources and performs operations on these events. Instead of writing data to a transactional database the data and state of the application are kept on the local system. By keeping the data on the local system the performance (latency and throughput) of the application is improved and possible network failure will be avoided. The application writes checkpoints periodically for fault-tolerance but this can happen asynchronously and does not impact performance. Examples for event-driven applications can be fraud detection, anomaly detection or rule-based alerting. [2]<\/p>\n\n\n\n<p>The following image, taken from Apache Flink, illustrates how the architecture of an event-driven application is structured. (Apache Flink will be described later in this blog post)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1372\" height=\"687\" data-attachment-id=\"10293\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/usecases-eventdrivenapps\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-eventdrivenapps.png\" data-orig-size=\"1372,687\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"usecases-eventdrivenapps\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-eventdrivenapps-1024x513.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/usecases-eventdrivenapps.png\" alt=\"Architecture of an event-driven application.\" class=\"wp-image-10293\"\/><figcaption class=\"wp-element-caption\">Architecture of an event-driven application. Source: [2]<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"> <br><strong>Data Analytics Applications<\/strong><\/h2>\n\n\n\n<p>A data analytics application extracts insights from raw data. While this can also be done with batch processing it is a good usecase for a streaming application. The advantage of using a stream processing engine is the capability of using real-time-data and continously producing results. The results can be persisted to an external storage and\/or shown in a live report in real time. As a result of using stream processing, the latency of processing events gets lowered. Examples for data analytics applications can be quality monitoring of telecommunication networks or analysis of product updates. [2]<\/p>\n\n\n\n<p>The following image shows two architectures of data analytics applications using Apache Flink.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"755\" height=\"194\" data-attachment-id=\"10294\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/usecases-dataanalyticsapps\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-dataanalyticsapps.png\" data-orig-size=\"755,194\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"usecases-dataanalyticsapps\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-dataanalyticsapps.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-dataanalyticsapps.png\" alt=\"\" class=\"wp-image-10294\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-dataanalyticsapps.png 755w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/usecases-dataanalyticsapps-300x77.png 300w\" sizes=\"auto, (max-width: 755px) 100vw, 755px\" \/><figcaption class=\"wp-element-caption\">Batch and streaming analytics architectures in comparison. Source: [2]<\/figcaption><\/figure>\n\n\n\n<p>Now that we&#8217;ve seen an introduction to stream processing as well as some usecases. let&#8217;s look at what stream processing frameworks exist.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>An overview about available frameworks<\/strong><\/h2>\n\n\n\n<p>The history of stream processing began with Apache Hadoop&#8217;s batch processing engine and later shifted towards stream processing. By now, the following popular frameworks have implementations for stream processing:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\n\n\tApache\n\tSpark\n\t<\/li>\n\n\n\n<li>\n\n\tApache\n\tStorm\n\t<\/li>\n\n\n\n<li>\n\n\tApache\n\tFlink\n\t<\/li>\n\n\n\n<li>\n\n\tApache\n\tSamza\n\t<\/li>\n\n\n\n<li>\n\n\tApache\n\tKafka\n\t<\/li>\n\n\n\n<li>\n\n\tApache\n\tApex\n<\/li>\n<\/ul>\n\n\n\n<p>\nIn\nthe meantime stream processing was also made available as a managed\nservice, for example Amazon Kinesis.<\/p>\n\n\n\n<p>Let&#8217;s look a bit more into details for some of these frameworks. The blog post will briefly introduce some of the most popular streaming frameworks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"> <br><strong>Apache Flink<\/strong><\/h2>\n\n\n\n<p>Apache\nFlink is\nan open-source\ndistributed stream processing engine. It can do stateful computations\nover bounded and unbounded streams. Unbounded streams equal streaming\ndata that arrives endlessly as it is generated. The order of arrival\nis important to reason about completeness and must be preserved.\nBounded streams are comparable to <em>batch\nprocessing<\/em>.\nThey have a defined start and end. The order of arrival is\nneglectable because the finite data can always be sorted.<\/p>\n\n\n\n<p>That way, Flink can process both input data types of streaming data and batch processing data. [9]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"639\" height=\"169\" data-attachment-id=\"10295\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/boundedunbounded\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/BoundedUnbounded.png\" data-orig-size=\"639,169\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"BoundedUnbounded\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/BoundedUnbounded.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/BoundedUnbounded.png\" alt=\"Bounded and unbounded streams processable by Apache Flink.\" class=\"wp-image-10295\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/BoundedUnbounded.png 639w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/BoundedUnbounded-300x79.png 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><figcaption class=\"wp-element-caption\">Bounded and unbounded streams processable by Apache Flink. Source: [9]<\/figcaption><\/figure>\n\n\n\n<p>In Flink, workload is parallelized to multiple execution tasks that are distributed and run concurrently. It integrates with clusters like Kubernetes natively and can even be setup as a standalone cluster.[9]<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>Apache Storm<\/strong><\/h2>\n\n\n\n<p>Apache Storm introduces itself as an open-source realtime computation system. It can process streams of unbounded data. [10]<\/p>\n\n\n\n<p>Apache\nStorm has three abstraction types: spouts, bolts and topologies as\ncan be seen in the architecture diagram below. Spouts are the sources\nof streaming data for the further data processing. \nThey\ntypically read from a queueing broker like Kafka but can also\ngenerate its own stream. Bolts process an input stream and produce\nany number of output streams. A topology unites it all and represents\na network of spouts and bolts with their connections.[11]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1052\" height=\"512\" data-attachment-id=\"10296\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/storm_arch\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/storm_arch.png\" data-orig-size=\"1052,512\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"storm_arch\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/storm_arch-1024x498.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/storm_arch.png\" alt=\"Apache Storm architecture\" class=\"wp-image-10296\"\/><figcaption class=\"wp-element-caption\">Apache Storm typical dataflow architecture. Source: [11]<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><br>Apache Spark<\/h2>\n\n\n\n<p>Apache Spark has a streaming component named Spark Streaming. It can consume streams of data by sources like Apache Kafka. It enables scalable, fault-tolerant processing of data in micro-batches. That means Spark divides the streaming data into small batch packages and then writes the results to an output stream of batches. You can then make use of the powerful other components of Spark like its machine learning library and apply it to the result data. [12]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"647\" height=\"134\" data-attachment-id=\"10297\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/sparkarch\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/SparkArch.png\" data-orig-size=\"647,134\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"SparkArch\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/SparkArch.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/SparkArch.png\" alt=\"\" class=\"wp-image-10297\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/SparkArch.png 647w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/SparkArch-300x62.png 300w\" sizes=\"auto, (max-width: 647px) 100vw, 647px\" \/><figcaption class=\"wp-element-caption\">Apache Spark architecture. Source: [12]<\/figcaption><\/figure>\n\n\n\n<p>Also tak a look into this <a href=\"https:\/\/blog.scottlogic.com\/2018\/07\/06\/comparing-streaming-frameworks-pt1.html\">blog article<\/a> which provides a comparison on the coding perspective. It shows how a simple <em>hello-word-like<\/em> task is done in different stream processing frameworks<\/p>\n\n\n\n<p>So far, we&#8217;ve got an overview about several frameworks and usecases. Next up is an overview about which criteria matter when designing a stream processing engine.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>The important factors of a distributed stream processing engine<\/strong><\/h2>\n\n\n\n<p>For\na system that has to handle and process a continous flow of incoming\ndata any downtime is fatal. If the system is unavailable or has\nperformance issues data might be lost completely. That is because\nunlike a batch processing system data is not getting stored before\nprocessing but is\nrather\nprocessed immediately and stored afterwards. Therefore there is a\nnumber of factors that have to be considered when analyzing a stream\nprocessing engine. [6][7]<\/p>\n\n\n\n<p>Thinking of stream processing in a distributed environment the following factors come into mind.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Delivery Guarantees: No matter what happens, the incoming dataset or record will be processed. Even in the event of a system, network or application failure, the engine continues to process data. There are different types of delivery guarantees, namely:\n<ul class=\"wp-block-list\">\n<li> At-most-once<\/li>\n\n\n\n<li> At-least-once <\/li>\n\n\n\n<li> Exactly-once<br>While exactly-once is the desired state, it is really hard to achieve  in a distributed system.  Tradeoffs for performance will have to be made to ensure delivery guarantees. <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li> Peformance: Latency and throughput. Latency should be as low as possible and throughput as high as possible. <\/li>\n\n\n\n<li> Management of state: Storing state information allows operations on multiple streams like joining, transformations or aggregation operations. However, additional computational power is needed to keep the state updated and stored. <\/li>\n\n\n\n<li> Fault-tolerance: In case of a system or component failure the system should be able to recover. Ideally, it should start processing from the point it failed. This can be achieved by periodically saving checkpoints as we have seen in the Event-driven application example before. This also guarantees that all data will be processed and no records get lost. <\/li>\n\n\n\n<li> Scalability: The system should be able to deal with varying workload, frequencies and inconsistent size of incoming data. <\/li>\n\n\n\n<li> Windowing operations: Allows to extract a subset (e.g. by grouping events based on a time window) of an infinite data stream to view and process this subset separately. The window can be defined by event creation or processing time.<\/li>\n\n\n\n<li> Maturity of the framework: How long the framework has been in use and how big the community is. [6][7] <\/li>\n<\/ul>\n\n\n\n<p>As a last step let us take a detailed look at the capabilities of existing frameworks and get some insights about how they scale upon a huge workload.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>A benchmarking comparison<\/strong><\/h2>\n\n\n\n<p>The following chapter uses benchmarking data out of two different sources to provide a comparison of the performance of the most popular frameworks.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\n\n\tThe\n\tfirst source is a benchmarking comparison of Adobe for their Adobe\n\tExperience Platform. The Adobe Experience Platform handles more than\n\t200k events per second. Adobe evaluated existing frameworks for\n\tstream processing to support the growing needs for real-time\n\tprocessing and growing amounts of data. They evaluated performance\n\tand reliability of the frameworks Storm, Flink, Samza and Spark. [3]\n\t<\/li>\n\n\n\n<li>\nThe second\n\tsource is a benchmarking published by TU Berlin. This benchmarking\n\tmeasures the performance of windowed operations (the basic\n\toperations of data analytics) of frameworks Apache Flink, Apache\n\tStorm and Apache Spark. The TU Berlin team used setups with multiple\n\tnodes to measure scaling and performance on different hardware\n\tsetups. [8]\n<\/li>\n<\/ul>\n\n\n\n<p>The two benchmarkings measure mainly latency and throughput of stream processing frameworks. The Adobe benchmarking also compares qualitative criteria while the TU Berlin benchmarking factors in skewed data and fluctuating workloads. For more information on the benchmarks of these two teams please head to their articles as listed in the references of this blog post.<\/p>\n\n\n\n<p>The adobe benchmarking consists of multiple &#8216;load&#8217; tests with one million events for every framework and then a three-day reliability test. The following table are the performance measurement results of Adobe benchmarking with data from [3]. The results were as follows :<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-stripes\"><table><thead><tr><th><\/th><th>Throughput<\/th><th> 99th process latency* <\/th><th> Reliability <\/th><\/tr><\/thead><tbody><tr><td>Storm<\/td><td>50-1500 events per second<\/td><td>&lt; 30 ms<\/td><td>No crashes\/failures<\/td><\/tr><tr><td>Flink<\/td><td>600-1800 events per second<\/td><td>&lt; 10 ms<\/td><td>No crashes\/failures<\/td><\/tr><tr><td>Spark<\/td><td>500-1700 events per second<\/td><td>6-7 seconds<\/td><td>Crashes on every run<\/td><\/tr><tr><td>Samza<\/td><td>185-815 events per second<\/td><td>54 ms<\/td><td>No crashes, error log raised<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"has-small-font-size\">*A 99th percentile latency of x ms means that every 1 in 100 requests experiences x ms of delay.  <\/p>\n\n\n\n<p>The TU Berlin benchmarking consists of multiple experiments for windowed joins and windowed aggregations for streams. The tables shown below illustrate the performance measurements for throughput and latency. For further information please refer to <a href=\"http:\/\/www.redaktion.tu-berlin.de\/fileadmin\/fg131\/Publikation\/Papers\/Stream_Benchmarks_ICDE18-CRC.pdf\">this paper<\/a>. [8]<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"286\" height=\"93\" data-attachment-id=\"10298\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/berlin_bm1\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm1.png\" data-orig-size=\"286,93\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"berlin_bm1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm1.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm1.png\" alt=\"\" class=\"wp-image-10298\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"874\" height=\"140\" data-attachment-id=\"10299\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/berlin_bm2\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2.png\" data-orig-size=\"874,140\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"berlin_bm2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2.png\" alt=\"\" class=\"wp-image-10299\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2.png 874w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2-300x48.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/berlin_bm2-768x123.png 768w\" sizes=\"auto, (max-width: 874px) 100vw, 874px\" \/><figcaption class=\"wp-element-caption\">TU Berlin benchmarking data. Source: [8]<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><br><strong>Results of the benchmarkings<\/strong><\/h2>\n\n\n\n<p>The benchmarking conducted by Adobe came to the following conclusion: Apache Flink is the best framework for their large-scale event processing neccesities. Apache Flink performs best in terms of latency and throughtput and proved to have the best reliability and did not crash at all during the conducted tests. Flink had an excellent ability of handling backpressure compared to other tested frameworks. Furthermore, Flink has a good community which will be useful in development and maintenance of their application.[3]<\/p>\n\n\n\n<p>The following table shows their results for qualitative and quantitative benchmarking.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"706\" height=\"250\" data-attachment-id=\"10300\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/09\/distributed-stream-processing-frameworks-what-they-are-and-how-they-perform\/adobe_bm\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/adobe_bm.png\" data-orig-size=\"706,250\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"adobe_bm\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/adobe_bm.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/adobe_bm.png\" alt=\"\" class=\"wp-image-10300\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/adobe_bm.png 706w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/adobe_bm-300x106.png 300w\" sizes=\"auto, (max-width: 706px) 100vw, 706px\" \/><figcaption class=\"wp-element-caption\">Adobe benchmarking results. Source: [3]<\/figcaption><\/figure>\n\n\n\n<p>The TU Berlin benchmarking shows that each tested framework has a different set of usecases that they excel in. Overall Apache Flink has the best throughput and latency for different setups while Apache Spark manages skewed data and bound latency better than its competitors. Both Spark and Flink are very robust to fluctuating data. [8]<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br>Conclusion<\/h2>\n\n\n\n<p>In this article we have got an overview about how stream processing works. We have seen some example usecases where using a stream processing engine is beneficial. Stream processing is a great tool to handle unbound streams of data that have to be processed with minimal latency or at best in realtime. Many applications with IOT devices or sensor technology can benefit from this approach. However, not all types of applications fit with stream processing. We got a brief introduction to several popular stream processing  frameworks as well as some measurements on performance. Each framework  has its own usecase in either performance or additional libraries like  e.g. machine learning libraries that it provides. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><br>References<\/h2>\n\n\n\n<p>[1]:  <a href=\"https:\/\/medium.com\/stream-processing\/what-is-stream-processing-1eadfca11b97\">https:\/\/medium.com\/stream-processing\/what-is-stream-processing-1eadfca11b97<\/a><br>[2]:  <a href=\"https:\/\/flink.apache.org\/usecases.html\">https:\/\/flink.apache.org\/usecases.html<\/a><br>[3]:  <a href=\"https:\/\/medium.com\/adobetech\/evaluating-streaming-frameworks-for-large-scale-event-streaming-7209938373c8\">https:\/\/medium.com\/adobetech\/evaluating-streaming-frameworks-for-large-scale-event-streaming-7209938373c8<\/a><br>[4]:  <a href=\"https:\/\/ci.apache.org\/projects\/flink\/flink-docs-release-1.10\/dev\/datastream_api.html\">https:\/\/ci.apache.org\/projects\/flink\/flink-docs-release-1.10\/dev\/datastream_api.html<\/a><br>[5]:  <a href=\"https:\/\/blog.scottlogic.com\/2018\/07\/06\/comparing-streaming-frameworks-pt1.html\">https:\/\/blog.scottlogic.com\/2018\/07\/06\/comparing-streaming-frameworks-pt1.html<\/a><br>[6]:  <a href=\"https:\/\/medium.com\/@chandanbaranwal\/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b\">https:\/\/medium.com\/@chandanbaranwal\/spark-streaming-vs-flink-vs-storm-vs-kafka-streams-vs-samza-choose-your-stream-processing-91ea3f04675b<\/a><br>[7]:  <a href=\"https:\/\/developer.ibm.com\/code\/2018\/03\/21\/stream-processing-brief-overview\/\">https:\/\/developer.ibm.com\/code\/2018\/03\/21\/stream-processing-brief-overview\/<\/a><br>[8]:  <a href=\"http:\/\/www.redaktion.tu-berlin.de\/fileadmin\/fg131\/Publikation\/Papers\/Stream_Benchmarks_ICDE18-CRC.pdf\">http:\/\/www.redaktion.tu-berlin.de\/fileadmin\/fg131\/Publikation\/Papers\/Stream_Benchmarks_ICDE18-CRC.pdf<\/a><br>[9]:  <a href=\"https:\/\/flink.apache.org\/flink-architecture.html\">https:\/\/flink.apache.org\/flink-architecture.html<\/a><br>[10]:  <a href=\"https:\/\/storm.apache.org\/index.html\">https:\/\/storm.apache.org\/index.html<\/a><br>[11]:  <a href=\"https:\/\/storm.apache.org\/about\/simple-api.html\">https:\/\/storm.apache.org\/about\/simple-api.html<\/a><br>[12]:  <a href=\"https:\/\/spark.apache.org\/docs\/latest\/streaming-programming-guide.html\">https:\/\/spark.apache.org\/docs\/latest\/streaming-programming-guide.html<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>An overview on stream processing, common frameworks as well as some insights on performance based on benchmarking data<\/p>\n","protected":false},"author":963,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1,223,662],"tags":[350,38,213],"ppma_author":[809],"class_list":["post-10289","post","type-post","status-publish","format-standard","hentry","category-allgemein","category-ultra-large-scale-systems","category-web-performance","tag-distributed-stream-processing","tag-distributed-systems","tag-stream-processing"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":10318,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/04\/13\/open-source-batch-and-stream-processing-realtime-analysis-of-big-data\/","url_meta":{"origin":10289,"position":0},"title":"Open Source Batch and Stream Processing: Realtime Analysis of Big Data","author":"Marcel Stolin","date":"13. April 2020","format":false,"excerpt":"Abstract Since the beginning of Big Data, batch processing was the most popular choice for processing large amounts of generated data. These existing processing technologies are not suitable to process the large amount of data we face today. Research works developed a variety of technologies that focus on stream processing.\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/mapreduce.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":3114,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/09\/01\/sport-data-stream-processing-on-ibm-bluemix-real-time-stream-processing-basics\/","url_meta":{"origin":10289,"position":1},"title":"Sport data stream processing on IBM Bluemix:  Real Time Stream Processing Basics","author":"nk065@hdm-stuttgart.de","date":"1. September 2017","format":false,"excerpt":"New data is created every second. Just on Google the humans preform 40,000 search queries every second. By 2020 Forbes estimate 1.7 megabytes of new information will be created every second for every human on our planet. However, it is about collecting and exchanging data, which then can be used\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/09\/Real-Time-Stream-Processing-Basics_6.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/09\/Real-Time-Stream-Processing-Basics_6.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/09\/Real-Time-Stream-Processing-Basics_6.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":12906,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2021\/03\/11\/event-driven-architectures\/","url_meta":{"origin":10289,"position":2},"title":"Event-driven Architectures","author":"Max Merz","date":"11. March 2021","format":false,"excerpt":"Next to the powerful Request \/ Response architecture exists another architecture, the event-driven one. How this architecture works, where the differences to Request \/ Response systems are and how transactions can be realized will be part of this article. Events There are three types of messages that can be used\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/03\/stateful-stream-processing.png?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":2153,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/08\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-3-what-is-apache-spark\/","url_meta":{"origin":10289,"position":3},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 3 &#8211; What is Apache Spark?","author":"bh051, cz022, ds168","date":"8. March 2017","format":false,"excerpt":"Apache Spark is a framework for fast processing of large data on computer clusters. Spark applications can be written in Scala, Java, Python or R and can be executed in the cloud or on Hadoop (YARN) or Mesos cluster managers. It is also possible to run Spark applications standalone, that\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2017\/03\/spark-overview-768x195.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":5120,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2019\/02\/09\/observability-where-do-we-go-from-here\/","url_meta":{"origin":10289,"position":4},"title":"Observability?! \u2013 Where do we go from here?","author":"Alexander Wallrabenstein","date":"9. February 2019","format":false,"excerpt":"The last two years in software development and operations have been characterized by the emerging idea of \u201cobservability\u201d. The need for a novel concept guiding the efforts to control our systems arose from the accelerating paradigm changes driven by the need to scale and cloud native technologies. In contrast, the\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"MEME: I always, always test my code. The I test it again in production.","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":2165,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2017\/03\/09\/of-apache-spark-hadoop-vagrant-virtualbox-and-ibm-bluemix-services-part-6-apache-spark-andvs-apache-hadoop\/","url_meta":{"origin":10289,"position":5},"title":"Of Apache Spark, Hadoop, Vagrant, VirtualBox and IBM Bluemix Services &#8211; Part 6 &#8211; Apache Spark and\/vs Apache Hadoop?","author":"bh051, cz022, ds168","date":"9. March 2017","format":false,"excerpt":"At the beginning of this article series we introduced the core concepts of Hadoop and Spark in a nutshell. Both, Apache Spark and Apache Hadoop are frameworks for efficient processing of large data on computer clusters. The question arises how they differ or relate to each other. Hereof it seems\u2026","rel":"","context":"In &quot;Student Projects&quot;","block_context":{"text":"Student Projects","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/student-projects\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":809,"user_id":963,"is_guest":0,"slug":"am206","display_name":"Alexander Merker","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/924a5f7d2ada1019cd2bf59768132fdff7500b1f3726055a5ef6fde41cf30e95?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/10289","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/963"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=10289"}],"version-history":[{"count":6,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/10289\/revisions"}],"predecessor-version":[{"id":25418,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/10289\/revisions\/25418"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=10289"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=10289"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=10289"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=10289"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}