Open Source Batch and Stream Processing: Realtime Analysis of Big Data

Abstract

Since the beginning of Big Data, batch processing was the most popular choice for processing large amounts of generated data. These existing processing technologies are not suitable to process the large amount of data we face today. Research works developed a variety of technologies that focus on stream processing. Stream processing technologies bring significant performance improvements and new opportunities to handle Big Data. In this paper, we discuss the differences of batch and stream processing and we explore existing batch and stream processing technologies. We also explain the new possibilities that stream processing make possible.

1 Introduction

A huge amount of information is generated everyday by social media, e-mails, sensors, instruments and enterprise applications, to mention a few resources. This amount of data brings a lot of challenges according to volume, velocity and variety. In the past two years, 90% of all data was created and the amount of data will double every two years. This data comes in a variety of formats and types, each of it requires a different way to process the generated data [9].

Batch processing was the most popular choice to process Big Data. The most notable batch processing framework is MapReduce [7]. MapReduce was first implemented and developed by Google. It was used for large-scale graph processing, text processing, machine learning and statistical machine translation. MapReduce can process large amounts of data but is only designed for batch processing. Today’s demands rely on real-time processing of Big Data that will finish in seconds [19]. For this demand, various stream-processing technologies have been developed. In this paper we will focus on Apache Spark Streaming [22] and Apache Flink [6], which are the most famous tools for stream processing [12].

In this work we will explain the concepts of batch processing and stream processing in detail while introducing the most popular frameworks. After that we introduce new opportunities that stream processing provides to face today’s issues where a response is needed in seconds.

2 Related Work

Big Data analysis is an active area of research but comparisons of Big Data analysis concepts are difficult to find. Most research papers focus on comparing stream processing frameworks on performance. In this work we will focus on open source technologies. There are widely used proprietary solutions like Google Millwheel [1], IBM InfoSphere Streams [3] and Microsoft Azure Stream Analytics [8] we won’t discuss in this paper.

Lopez, Lobato and Duarte describe and compare the streaming platforms Apache Flink, Apache Spark Streaming and Apache Storm [15]. The work focuses on processing performance and behaviour when a worker node fails. The results of each platform are analysed and compared.

Shahrivari compares the concepts between batch processing and stream processing [19]. In detail, the work compares the performance of MapReduce and ff Apache Spark Streaming with different experiments.

Unlike the mentioned papers, we will focus on the difference between batch processing and stream processing and discuss the new opportunities of stream processing instead of comparing performance measurements.

3 Batch Processing

Batch jobs run in the background without any interaction from an operator. In Theory a batch job gets executed in a specific time window between the end of a workday and the start of the next workday to process millions of records which will take hours to execute. This time window will increase with availability requirements. Batch processing is still used today in organisations and financial institutions [10].

Individual batch jobs are usually organized into calendar periods. Common batch schedules are daily, weekly, and monthly batches. Weekly and monthly batch schedules are mostly used for technical tasks like backups, integrity checks or disk defragmentation. Functional tasks should be executed on a daily schedule. Typically jobs on a daily basis are, data processing and transferring. Organizing the batch schedule can save effort in the development cycle. To categorize a job, a simple rule of thumb is to determine if it has to do a functional or a technical task. To reduce batch execution time, performing jobs in parallel is a key factor [10].

There are two different architectures of how batch jobs should be executed: As scripts or as services. The major differences are logging and control. Batch jobs that will run as a service usually report their status through log files and can be controlled over a control panel which is provided by the system. Batch jobs that are triggered over the command line, report their progress through streams and an appropriate exit code. The batch scheduler will terminate the job if it’s necessary [10].

3.1 MapReduce

MapReduce is a programming model that enables processing and generating large amounts of data. The model defines two methods: map and reduce.

Figure 1: Pseudo code of counting the number of occurrences of each word in a large selection of documents.

The map function takes a key/value pair as input to generate an intermediate set of key/value pairs. The reduce function takes an intermediate key and intermediate values associated to that key, as input and returns a set of key/value pairs [7]. Figure 1 illustrates the MapReduce programming model of a real-world use case.

4 Stream Processing

Stream processing refers to real-time processing of continuous data [14]. A stream processing system consists of a queue, a stream processor and real time views [16].

In a system without a queue, the stream processor has to process each event directly. This approach cannot guarantee that each event gets processed correctly. If the stream processor dies, there is no way to detect the error. A cluster would be overwhelmed by the incoming amount of data it has to process. A persistent queue helps to address these issues. Writing events to a persistent queue before processing the data will buffer the events and it allows the stream processor to retry an event when it fails [16]. An example for a modern queue system is Apache Kafka [13].

The stream-processor processes incoming events in the queue and then updates the real time views. There are two models of stream-processing that have emerged in the recent years: Record-at-a-time and micro-batched [16].

Record-at-a-time stream processing The record-at-a-time processing model processes tuples independently of each other, updates the internal state and sends out new records in response. This leads to inconsistency, when different nodes process different data that arrive at different times. The model handles recovery through replication which requires twice the amount of hardware. This is not optimal for large clusters [22]. To be scalable with high throughput, the systems run in parallel across the cluster [16].

Micro-batch stream processing The micro-batch stream processing approach processes the tuples as discrete batches. A batch is processed in a strong order until completion before moving on to the next batch. To know if a batch has been processed before, each batch has its own unique identifier that always stays the same on every replay [16].

4.1 Apache Spark Streaming

Apache Spark Streaming is an extension to the Apache Spark cluster computing engine. It was developed to overcome the challenges of the record-at-a-time processing model. Spark Streaming provides a stream programming model for large clusters called discretized streams (D-Streams). In D-Streams, streaming computation will be treated as a series of deterministic batch computations on small time intervals [22].

To generate an input dataset for an interval, the received data during that interval is stored reliably across the cluster. To generate new datasets as a response, after each interval the datasets are processed via deterministic parallel operations. To avoid replication by using lineage, the new datasets will be stored in resilient distributed datasets (RDDs) [21]. A D-Stream allows users to manipulate grouped RDDs through various operations [22].

Figure 2: Each RDD contains data from a certain time interval [24].

D-Stream provides consistency, fault recovery and integration with batch systems to bring batch processing models to stream processing. Apache Spark Streaming lets users mix together streaming, batch and interactive queries to build integrated systems [22].

4.2 Apache Flink

Apache Flink is a stream-processing framework and an Apache top-level project. The core of Apache Flink is a distributed streaming data-flow engine which is optimized to perform batch and stream analytics [6]. The distributed streaming data-flow engine executes programs called dataflow graphs which can consume and produce data [4].

Dataflow graphs consists of stateful operators and data streams. The stateful operators implement logic of producing or consuming data. Data streams distribute the data between all operators. On execution Dataflow graphs parallelize operators into one or more instances called subtasks and split streams into one or more stream partitions [6].

Figure 3: Code showing the Apache Flink dataflow programming model [23].

Apache Flink is a high-throughput, low-latency streaming engine and optimized for batch execution using a query optimizer [6]. Dataflow graphs are optimized to be executed in a cluster or cloud environment [20].

5 Stream Processing Opportunities

Batch processing is still needed for legacy implementations and data analysis where no efficient algorithms are known [6]. Nevertheless, stream processing offers new opportunities to face issues where the result is needed in seconds instead of hours or days.

5.1 Machine Learning

Machine learning for Big Data is dominated by online machine learning algorithms. In streaming there is a need for scalable learning algorithms that are adaptive and inherently open-ended [11]. This makes online machine learning optimal for stream processing where the algorithm has to adapt new patterns in the data dynamically.

Apache Flink Apache Flink brings together batch processing and stream processing. This makes Apache Flink very suitable for machine learning [11]. Apache Flink provides the machine learning library FlinkML. FlinkML supports the PMML standard for online predictions [5].

Apache Spark Apache Spark provides a distributed machine learning library called MLlib. MLlib provides distributed implementations of learning algorithms that can serve (but not limited to) linear models, naive Bayes, classification and clustering. MLlib can be integrated with other high-level libraries, for example Apache Spark Streaming. Apache Spark Streaming enables the development of online learning algorithms with MLlib on realtime data streams [17].

Detecting cases of fraud is an ongoing area of research. A study from 2016 estimated, that credit card fraud is responsible for over 20 billion dollars in loss worldwide [18]. It is important to detect credit card fraud immediately after a financial transaction has been made. Today, credit card fraud can be detected with supervised or unsupervised machine learning models [2]. For an instant detection, online machine learning on realtime data stream serves the needed technology to face this issue.

6 Conclusion

This paper explains the two data analysis concepts batch processing and stream processing. Since realtime analysis is needed to face the issues of today’s demands, batch processing is still being used for legacy implementations and data analysis where no efficient algorithms are known. Stream processing offers new opportunities to handle big data and response with an immediate result to the user.

References

[1] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. Millwheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases, pages 734–746, 2013.

[2] Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. Wiley Publishing, 1st edition, 2015.

[3] Chuck Ballard, Kevin Foster, Andy Frenkiel, Bugra Gedik, Michael P. Koranda, Deepak Senthil, Nathanand Rajan, Roger Rea, Mike Spicer, Brian Williams, and Vitali N. Zoubov. Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks publication, 2011.

[4] Ilaria Bartolini and Marco Patella. Comparing performances of big data stream processing platforms with ram 3 s (extended abstract).

[5] Andr´as Bencz´ur, Levente Kocsis, and R´obert P´alovics. Online machine learning in big data streams. 02 2018.

[6] Paris Carbone, Asterios Katsifodimos, † Kth, Sics Sweden, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink TM : Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38, 01 2015.

[7] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: A flexible data processing tool. Commun. ACM, 53, 01 2010.

[8] Charles Feddersen. Real-time event processing with microsoft azure stream analytics. Jan 2015.

[9] Mugdha Ghotkar and Priyanka Rokde. Big data: How it is generated and its importance.

[10] Dave Ingram. Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development. Wrox, 2009.

[11] W. Jamil, N-C. Duong, W. Wang, C. Mansouri, S. Mohamad, and A. Bouchachia. Scalable online learning for flink: Solma library. In Proceedings of the 12th European Conference on Software Architecture: Companion Proceedings, ECSA ’18, New York, NY, USA, 2018. Association for Computing Machinery.

[12] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1507–1518, April 2018.

[13] Jay Kreps. Kafka : a distributed messaging system for log processing. 2011.

[14] Anuj Kumar. Architecting Data-Intensive Applications. Packt Publishing, 2018.

[15] M. A. Lopez, A. G. P. Lobato, and O. C. M. B. Duarte. A performance comparison of open-source stream processing platforms. In 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–6, Dec 2016.

[16] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2015.

[17] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, and et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res., 17(1):1235–1241, January 2016.

[18] David Robertson. The nilson report, issue 1096. Oct 2016.

[19] Saeed Shahrivari. Beyond batch processing: Towards real-time and streaming big data. Computers, 3, 03 2014.

[20] Daniel Warneke and Odej Kao. Nephele: Efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS ’09, New York, NY, USA, 2009. Association for Computing Machinery.

[21] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, San Jose, CA, 2012. USENIX.

[22] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoffi ica. Discretized streams: An e cient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, HotCloud’12, pages 10–10, Berkeley, CA, USA, 2012. USENIX Association.

[23] Dataflow Programming Model. https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html

[24] Discretized Streams (DStreams). https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Distributed stream processing frameworks – what they are and how they perform

This blog aims to provide an overview about the topic of stream processing and its capabilites in a large scale environment. The post starts with an introduction to stream processing. After that, it explains how stream processing works and shows different areas of application as well as some common stream processing frameworks. Finally, this article will provide a performance comparison of several common frameworks based on benchmarking data.

Continue reading

Apache Kafka – The one Stream Processor to rule them all?

If there is one statement that can be made about the current developments in the realm of distributed systems, it would probably be how most developers are turning away from a centralised, monolithic architecture and move towards a microservice architecture. This type of architecture proved itself as much more flexible and robust for the modern world where more and more software is offered as a cloud-based solution. By splitting up systems into smaller parts, they can be updated more easily and crashed services can be recovered faster. These services can be containerized with Docker, so quickly putting up and pulling down parts of the infrastructure became very easy. On most occasions it is simply less work to trash a running software instance and recreate it, than logging into the instance and trying to fix what is broken.

Continue reading

Federated Learning

The world is enriched daily with the latest and most sophisticated achievements of Artificial Intelligence (AI). But one challenge that all new technologies need to take seriously is training time. With deep neural networks and the computing power available today, it is finally possible to perform the most complex analyses without need of pre-processing and feature selection. This makes it possible to apply new models to numerous applications. The only thing needed is tons of training data, a good model and a lot of patience during training.

Continue reading

End-to-end Monitoring of Modern Cloud Applications


Introduction

During the last semester and as part of my Master’s thesis, I worked at an automotive company on the development of a vehicle connectivity platform. Within my team I was assigned the task of monitoring, which turned out to be a lot more interesting but at the same time way more complex than I expected. In this blog post, I would like to present an introduction to system health monitoring and describe the challenges that one faces when monitoring a cloud-based and highly distributed IoT platform. Following that I would like to share the monitoring concept that was the result of my investigation.

Monitoring

When developing an application, monitoring usually is not the first topic to come up to the developers or managers. During my time in university, whenever we had a fixed date to present results as part of a group’s project, me and my colleagues usually shifted our focus towards feature development rather than testing and monitoring. The need for monitoring however, becomes extremely important as soon as you have to operate a software system. Especially when maintaining distributed systems which cannot be debugged as traditional, monolithic programs the need for enhanced monitoring grows. The biggest difficulty with monitoring, is that it has to be designed specifically for the application that is needs to be monitored. There is no one-size-fits-all solution for monitoring. Within the scope of monitoring, there are a lot of different use cases to focus on. When talking about operating a software platform, our focus relies on system health monitoring.

System health monitoring

Monitoring has a different meaning depending on the domain it is being employed in. Efforts when monitoring software systems can put special interest into topics concerning security, application performance, compliance, feature usage and others. In this post I would like to focus on system health monitoring, following the goal of watching software services’ availability and being able to detect unhealthy states that can lead to service outages.

Even when reduced to health monitoring, one implementation of monitoring can differ strongly from another, depending on how different the systems, the processes, the tools and the users and stakeholders of that monitoring information are defined. The basis for monitoring is constantly changing due to the introduction of new technologies and practices. Therefore, a state of full maturity for monitoring tools can hardly ever be reached.

Monitoring Tools

Monitoring tools were created and adapted to fit for the domain of the systems that had to be monitored. Monitoring requirements are substantially different when monitoring clusters of homogeneous hardware running the same operating systems to those when managing the IT-infrastructure of a middle to large-size company with heterogeneous hardware (e.g. databases, routers, application servers, etc.).

If you are building a new software application or have been assigned the task of maintaining one, here are a few example tools that focus on different aspects of monitoring that could be interesting to try out.

  • Collectd – Push based host metric collection: Collectd is able to collect data from dynamically started and stopped instances. It works by periodically sending host metrics from preconfigured hosts to a central repository. By being a push-based system, it works also for short-lived processes. [2]
  • StatsD – Application level metric collection: StatsD was designed as a simple tool that collects traces and sends them via UDP to a central collector, where they can be forwarded to a collection and visualization tool like Graphite to be used to render graphs that are displayed in dashboards. [3]
  • The Elastic-Stack – Log management and analysis: When applications are hosted inside containers running on separate hosts, it is not only difficult to extract a containers low-level metrics, but also to extract and collect application-level metrics and logs. The standard open-source solution for this problem is the Elastic-Stack.
  • Riemann – Monitoring based on Complex Event Processing: Riemann differentiates itself from other monitoring tools in that it introduces event stream processing techniques to monitoring. Riemann works by processing events that are pushed by many distinct sources and, thanks to its push-based architecture, provides high scalability.
  • New Relic – Monitoring solution as SaaS: New Relic is one of the more recent SaaS solutions that have emerged in the past years. The feature of MaaS (Monitoring-as-a-service) is that the infrastructure and services needed to implement monitoring tools is abstracted away from operations teams. MaaS systems like New Relic can be configured to watch the infrastructure running services and receive monitoring data from the monitored objects either via data pushes directly or by installing agents that run inside hosts. Apart from collecting and processing logs, New Relic provides a browser based interface to access data and create customized dashboards as well as features like auto-detecting anomalies with the use of machine learning.
  • Zipkin – Distributed tracing: Zipkin, among other distributed tracing solutions, focuses on tracking requests on a distributed system in order to inspect the system for errors or performance bottlenecks and to enable troubleshooting in architectures where multiple components are being employed.
    Zipkin supports the OpenTracing API. OpenTracing is a project aimed to introduce a vendor-neutral tracing specification. Its objective is to standardize the tracing semantics to allow for better distributed tracing capabilities across frameworks and programming languages. Trace collectors can implement the OpenTracing API and can thereby be employed in large scale distributed systems, in which system subcomponents are written by different teams in different languages.

If your project if focused more on infrastructure components, it would be interesting to take a look at collectd. For a smaller web application, I would recommend to build in StatsD and focus on tracking relevant business KPIs from inside the relevant functions within the code. For any software project I would recommend to employ a log management system like the Elastic-Stack or an alternative. Since its an open source solution, there are no licence costs attached and it can notably reduce troubleshooting time when debugging your application, specially if your application relies on a microservice architecture. For start-ups with smaller cloud projects going to production, I would recommend trying out a Monitoring-as-a-service solution to analyze your application usage. Especially for smaller businesses going to the cloud, finding out how your customers are using your application is crucial. Discovering which functions and to what extent resources are being utilized can enable operations teams to optimize the cloud resource allocation while providing the business with important feedback about feature reception.
If you are building a complex distributed architecture involving many different components, it might make sense to evaluate more versatile tools like Riemann and take a look at OpenTracing.

Monitoring Strategy

Unarguably, the employment of the right tools or solutions to monitor a distributed system is crucial for the successful operation of it. However, effective monitoring does not only focus on tooling, but rather incorporates concepts and practices derived from the experience of system operations in regard to monitoring, and to some extent, to incident management. If your task is to monitor a software system, the best way to start is to define a monitoring strategy. When doing so, it should consider the following concepts:

  • Symptoms and causes distinction – The distinction of symptoms and causes is especially important when generating notifications for issues and defining thresholds for alerting. In the context of monitoring, symptoms are misbehavior from a user perspective. Symptoms can be slow rendering web pages or corrupt user data. Causes in contrast are the roots of symptoms. Causes for slow rendering web pages can be unhealthy network devices or high CPU load on the application servers.
  • Black-box and white-box monitoring – Black-box monitoring is about testing a system functionality and checking if the results are valid. If results are not what was expected, an alert is triggered. White-box monitoring is about getting information about a component’s state and alerting on unhealthy values. While white-box monitoring can help the operations personnel find causes, black-box monitoring is used to detect symptoms.
  • Proactive and reactive monitoring – In relation to black-box and white-box monitoring are the concepts of proactive and reactive monitoring. Whereas reactive monitoring is concerned about collecting, analyzing and alerting on data that can reflect an outage of the system, reactive monitoring emphasizes on detecting critical situations within the monitored object that can lead to a service performance degradation or complete unavailability. While the results from black-box monitoring can be useful to detect problems affecting users, white-box monitoring techniques provide more visibility into a system’s internal health state. With careful analysis of trends on internal metrics, a prediction of imminent failures can be made, which can either trigger a system recovery mechanism, an auto-scale of system resources or alert a system operator.
  • Alerting – Alerting is a crucial part of monitoring. Alerting however, does not necessarily involve paging a human being. The latter should mostly remain the exception and only be employed when services are already unavailable or failure is imminent without human action. For events that require attention, but do not involve failing systems, alerts can simply be stored as incidents in a ticketing system, to be reviewed by the operations team when no other situation requires their attention. In case of pages, they should be triggered as sparsely as possible, since constantly responding to pages can be frustrating and time consuming for operations employees. When defining thresholds to be notified on, new approaches propose to aggregate metrics and generate alerts on trends, instead of watching simple low-level metrics, e.g. watching for the rate in which available space on a database is filling up, instead of its current free space at a certain point in time. This concept relates to focusing on symptoms rather than causes when alerting.

When implementing the monitoring strategy for your project, it’s best to start with black-box monitoring, covering the areas that affect your users the most (such as being able to log in to your system). When defining alerts, focus should rely on symptoms rather than causes, since causes might or might not lead to symptoms. Also, if your are defining alerts, always focus on the right choice of recipients. Different groups of project employees might be interested on different types of alerts or different granularity. Alert the business about decreasing page views and not about cluster certificates that will expire soon.

Real-life scenario: monitoring a cloud-based vehicle connectivity platform

The connectivity platform I worked on provides infrastructure and core services to develop connectivity services for commercial vehicles. These services enable users to open the vehicles’ doors remotely via smartphone or displaying the position and current state of a vehicle on a map. The platform provided a scalable backend which handled the load of incoming vehicle signals and provided it as JSON data from APIs. It also included frontend components which fetched data from the provided APIs and displayed it in a smartphone in the custom app or in a browser as a web application. The platform was built on top of Microsoft Azure and was based on the IoT reference architecture proposed by Microsoft in the Azure documentation. It was built on a highly distributed microservice architecture that made extensive usage of Azure’s products, both SaaS and PaaS. Its core was made of an Azure Service Fabric cluster running all microservices that implemented the business logic of the connectivity services. The communication between microservices was established either synchronously via HTTP or asynchronously using Azure’s enterprise messaging component Azure Service Bus. In order to implement most use cases the platform’s components also had to communicate with various company-internal systems, as well as third party software services that implemented specific functionality, such as user management.

Monitoring challenges

In order to reach a basic level of monitoring in the platform, at least each component of the platform, which was not being fully managed by Microsoft or a third-party service provider (e.g. a SaaS- component), had to be monitored by the operations team. As stated above, special care was put in the choice of components that demanded the lowest level of administration effort of the aforementioned team. That led to an increased usage of PaaS-components instead of Virtual Machines or other infrastructure alternatives.

Even though the infrastructure of PaaS-components does not need to be managed by the cloud consumer, there are some aspects when using PaaS-components that can generate errors in runtime and must, for this reason, be constantly monitored by the operations team. Some examples are:

A queue in the Service Bus, filling up rapidly because a microservice is not keeping up with the number of incoming messages.

An Azure Data Factory Pipeline that stopped running because a wrong configuration change removed or changed a database’s credentials and the pipeline is not able to fetch new data.

The former scenario can be easily detected, since Azure provides the resource owner with many metrics regarding the high-level usage of each component. In this case, a counter of unread messages is available in the Azure Portal or via Azure Metrics. Since the counter is available in the portal, widgets displaying its data can be pinned to a dashboard and alerts can be created for a given threshold on this metric.

In the data pipeline scenario, an outage can be detected by setting alerts on the activity log, which is the only place where occurring errors within Data Factories are reported. However, an aspect adding risk to the latter scenario is that detecting that a pipeline stopped running is especially difficult when testing the system from the user’s perspective, since it is only reflected by outdated or incorrect data, instead of directly throwing an error.

In this sense, the operations team of the platform should always be notified in the case of a stopped pipeline because user complains might be misleading. Also, the advantage of using PaaS components and having the underlying infrastructure abstracted away from the user comes with the disadvantage that all available logs and metrics, from which monitoring events can be derived, are defined by the component’s provider and cannot be easily expanded, customized or complemented with own information.

The monitoring strategy

When defining the monitoring strategy for the platform from a technological point of view, the first decision was to make extensive use of the SaaS monitoring solutions already provided by Microsoft to monitor applications and infrastructure hosted inside of Azure, before introducing external proprietary solutions or deploying open-source monitoring tools to watch the platform. The tools employed were Azure Monitor, Azure Log Analytics and Azure Application Insights. Implementing white-box monitoring was simpler due to the seamless integration of tool provided by Microsoft to monitor Applications on Azure. Implementing black-box monitoring to be able to tell that the most important functions of the platform were available was way trickier. The next section explains the concept developed as part of my thesis.

End-to-end black-box monitoring

Why monitoring single components is not enough

The problem statement above only describes the challenges of monitoring each component on its own. The services offered by the platform however, are provided only as a result of the correct functioning of a specific set of these subcomponents together. In many cases, the platform’s subcomponents receive messages, process them and then forward them via a synchronous or asynchronous calls to the next subcomponents, forming complex event chains. These event chains can also include third party systems operated within the same organization or third parties, which handle specific technical areas of the connected vehicle logic. From an operations’ perspective, the most valuable information is to know whether a specific, user-facing service is running at the moment or if it is facing technical disruptions. In this case, not only knowing if a service is running in the sense of a passing an integration test is important, but rather having further insights into the system (including its complex event chains) and which part of it is failing to process the requested operation.

This observability can be partially provided by monitoring the platform’s subcomponents on their own, but not entirely. Here is one example:

In case a Service Bus queue filles up, the cause of messages missing in the frontend can be detected by alerting on a threshold on the involved Service Bus queue’s length.

However, here is an example of a case in which finding out the root cause of a service disruption might become less obvious:

A configuration change from a deployment altered the expected payload of a microservice, and incoming asynchronous messages are being ignored by that microservice for being badly formatted.

The error would only be visible by examining the logs generated by the service. If the message drop is not being actively logged by that specific microservice, there is no way for the operations team to find out on which part of the system the messages are being lost. This
observability problem is increased by the fact that PaaS components do not generate logs for every message being processed. That means the last microservice which created logs including the message’s ID is the last point of known correct functioning of the event chain.

This case might not seem probable, but most of the disruptions happening in a production environment come from unintentional changes to the configuration made by an update. Also, as the architecture becomes more and more complex and further components are added to support the main services of the chain, the task of troubleshooting the platform grows in complexity. An end-to-end black-box monitoring system, which monitors the core services of the platform, would substantially reduce effort when searching for misbehaving components since it would alert the operations team as soon as one of the core services is not available. The alert would as well include information about the exact subcomponent responsible for the outage.

What we need is Event Chain Monitoring

The core of the proposed solution is based on event correlation. In this case, an own definition of events and their aggregation was used. The core ideas are inspired by the Complex Event Processing paradigm but its full complexity would go beyond the scope of this blog post. The use of events creates an abstraction level between incoming signals and relevant information for monitoring. Sources of events can be either HTTP requests triggering a component or logs being added to the log repository of the platform or any kind of process that can be converted to an event. To map data to an event, it has to contain information that can be derived to at least the following event metadata:

  1. Event source
  2. Event timestamp
  3. Tracking ID
  4. Event type

The event source is used to map the event to the component which originally generated the event. The timestamp contains the time in UTC at which the event was generated or processed. The tracking ID is used to correlate events that belong to the same action. Using a tracking ID, all events originated by the same operation can be grouped together. The event name states the type of event that was triggered, such as data from vehicle received at gateway or message processed by microservice A.

When triggered, all events have to be routed to central repository for analysis and aggregation. From this central repository, single events can be analyzed for patterns that indicate system failures.
Defining the model for a fully functioning event chain is then simple. It is calculated with a query as the aggregation of any number of events and a syntax stating the order and time frames in which the events need to reach the central repository. The exact count of events and their order depends on the level of granularity to which the event chain is going to be monitored.
A visualization of the proposed solution is presented in the following diagram using a sample event chain:

After the event chain is modelled in the monitoring web application, the flow of events can be tracked. First, signals have to be generated. Since generating real signals would involve driving around in a real car connected to the backend, a vehicle has to be simulated, thus the first event would come from a vehicle simulator. After vehicle data events start being pushed to the platform, the different components of the system will start processing the incoming data and interacting with other subcomponents. For each subcomponent, the telemetry capabilities have to be defined. If it supports generating an event for every data signal and sending it to the central event repository or if the event hast to be generated out of the logs or other information sources that are filled by this subcomponent. Depending on the type of components and the level of administration needed, there is more or less information available. If crucial information is missing from a component, a helper or observer unit (here the “observer” Azure Functions) has to be added to the chain, to provide the missing information.

After events are matched, their results can produce other, higher-level events. These high-level events can be consumed by a monitoring dashboard, or a ticketing system. If consumed by a custom dashboard, monitoring information can then be presented by visually showing a graph of the subcomponents of the system. Depending on the patterns matched by predefined queries, different parts of the event chain, that get recognized as failing, can be highlighted in the dashboard.
In addition to displaying a graph of the event chain, the operations team can subscribe to changes to the use-cases health state. If a disruption of a core service is registered, a text message can be sent to the person on-call. Less crucial incidents can be tracked in a ticketing system for further investigation by the operations team. A further advantage of having the event chains displayed visually to the operations team is that even without deep knowledge of the platform, the operations team is able to reason about symptoms visible by users of the platform and track back many incidents that might have the same origin. That is, the operations team would get direct visibility into the state of the subcomponents of the platform and, in the case of an outage, the team gets direct information about the outage’s repercussion on the operability of the system.

Conclusion

The implementation of modern software architecture patterns like microservices combined with the use cloud services beyond IaaS facilitates a faster development of software services. Despite the reduced effort to manage infrastructure when deploying applications to the cloud, being able to monitor the resulting highly distributed systems is becoming an increasingly complex task. On top of that, customers of modern online applications expect short response times and high availability, which raises the need for accurate monitoring systems. In this post I provided a short introduction to monitoring and an overview of current monitoring tools you can test in your projects. However, I also described how a well-thought monitoring strategy is more important than tools in order to increase your system’s availability. In the second half of this post I provided a real-life example scenario and depicted the challenges related to monitoring a distributed IoT-platform. I then presented the concept that was developed in order to reach a better state of observability into the platform.

References

Some of the ideas summarised in this blog post were originally presented in the following articles or books. They are also excellent reads in case the topic of monitoring got your interest.

  • B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. “ O’Reilly Media, Inc.,” 2016.
  • M. Julian, “Practical Monitoring: Effective Strategies for the Real World,” 2017.
  • J. Turnbull, The art of monitoring. James Turnbull, 2014.
  • I. Malpass, “Measure Anything, Measure Everything,” 2011. link.
  • J. S. Ward and A. Barker, “Observing the clouds: a survey and taxonomy of cloud
    monitoring,” J. Cloud Comput., vol. 3, no. 1, p. 24, 2014.
  • B. H. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” 2010.

The Elixir Programming Language

elixir

An introduction to the language, its main concepts and its potential.

The number of security incidents has been on the rise for years, and the growth of the Internet of Things is unlikely to improve the situation. Successful attacks on all kinds of interconnected smart devices, from car locks over home security systems to highly automated factories, have already been demonstrated and carried out.

An aspect contributing to the bad state of security is the use of programming languages which are highly prone to vulnerabilities, namely C/C++. With the increased interconnection of systems which could pose a danger to life if disrupted, for example vehicle controls or critical infrastructure, there is a need for programming languages which provide a higher level of security „by default“.

Continue reading