Monitoring with the Elastic Stack

Netflix, stackoverflow and Linkedin are just a few of the companies activly using the Elastic Stack. Use cases include performance and security monitoring or log analytics. [1] The Elastic Stack consists of different products with the most popular ones being Elasticsearch, Kibana, Logstash and Beats. This article will introduce the different products and explain the architectural concepts behind Elasticsearch that make it the scaleable and high available search engine which it is today.

READ MORE NOW

Corona-Klassenzimmer: Ein Blick auf das Bildungsplattform-Chaos in Deutschland

Bei der Landtagswahl in Baden-Württemberg sind Corona und Bildung Topthemen. Kein Wunder, denn die Schulen und Hochschulen mussten letztes Jahr in ganz Deutschland – wie vielerorts weltweit – kurzfristig auf Online-Lehre umstellen. Dadurch verzeichnete hierzulande eine Vielzahl an Software-Tools für das sogenannte Distance Learning einen extremen Nutzungszuwachs. Jedoch vergeht kaum eine Schulwoche ohne Meldungen über Ausfälle und weitere Probleme beim Online-Unterricht. Da liegt es nahe, sich das wirre Feld der Bildungsplattformen in Deutschland mal genauer anzuschauen. Welche Tools setzen Schulen hierzulande ein? Wer betreibt die Systeme? Wo liegen die größten Probleme und wie könnten uns Zahlen bei Skalierungsüberlegungen helfen?

Continue reading

Black Swans in IT-Systemen und Ausfälle 2020

Seit über einem Jahr sorgt die Coronapandemie für tägliche Berichterstattung und vielerlei Einschränkungen. Kontakte werden auf ein Minimum begrenzt, Gastronomien und Großteile des Einzelhandels geschlossen und Freizeit- sowie Kulturveranstaltungen abgesagt. Mehr Zeit als je zuvor verbringen die Menschen in ihren eigenen vier Wänden und nutzen IT-Systeme, um im Home-Office zu arbeiten, über Lernplattformen zu lernen oder durch Videokonferenzsysteme soziale Kontakte herzustellen. Auch die dafür benötigten, cloudbasierten IT-Systeme erfahren dadurch Auslastungen, die zuvor nur schwer vorstellbar waren. Dieser Mehraufwand und die dafür erforderliche Skalierung der Systeme, sorgte im vergangenen Jahr 2020 für Ausfälle (Outages), von welchen auch die “Big-Player” des Cloud-Computings nicht verschont blieben. Einer Microsoft Azure Einschränkung im März folgte im November der Ausfall einiger AWS-Dienste des Cloud-Marktführers Amazon, ehe im Dezember zahlreiche Google-Dienste wie YouTube oder GoogleDrive für einige Stunden unerreichbar waren. Mögliche Ursachen solcher Ausfälle wurden bereits 2018 in Laura Nolans USENIX-Konferenzbeitrag “What Breaks Our Systems: A Taxonomy of Black Swans” thematisiert und in sechs Muster kategorisiert [2]. Der folgende Blogbeitrag stellt diese Kategorien, die Ursachen schwarzer Schwäne in IT Systemen dar (Seite 1), ordnet einige Ausfälle aus dem vergangenen Jahr in diese ein (Seite 2) und zeigt mögliche Präventionsmaßnahmen auf (Seite 3).

Continue reading

KISS, DRY ‘n SOLID — Yet another Kubernetes System built with Ansible and observed with Metrics Server on arm64

This blog post shows how a plain Kubernetes cluster is automatically created and configured on three arm64 devices using an orchestration tool called Ansible. The main focus relies on Ansible; other components that set up and configure the cluster are Docker, Kubernetes, Helm, NGINX, Metrics Server and Kubernetes Dashboard. Individual steps are covered more or less; the whole procedure follows three principles:

Continue reading

How to Scale Jitsi Meet

Person sitting on their bed, having a video conference on their laptop

In today’s world, video conferencing is getting more and more important – be it for learning, business events or social interaction in general. Most people use one of the big players like Zoom or Microsoft Teams, which both have their share of privacy issues. However, there is an alternative approach: self-hosting open-source software like Jitsi Meet. In this article, we are going to explore the different scaling options for deploying anything from a single Jitsi server to a sharded Kubernetes cluster.

Continue reading

Finding the right strategy for your way to the cloud

– Abstract –

This article is about finding the right strategy if you want to empower your business with cloud computing. It’s written for those of you who are still new to this topic and want to get a first sight through the dense cloud mystery. After reading you should have gained an orientation on how to start your first cloud project and a basic knowledge of the most important buzzwords of the industry.

“The essence of strategy is choosing what not to do”
– Prof. Dr. Michael E. Porter – Harvard Business School
Cofounder of the strategic management


It’s stormy outside

It’s been over a decade now that Amazon was stirring up the market with their web services and cloud computing is now more mainstream than ever. What for many people started with file-sharing- and backup-services like Dropbox or company internal NAS-systems, quickly transformed into a trending global topic. Why should you always upgrade local hardware and storage to stay competitive if you can have the same or even greater performance with shared resources, reduced costs and accessibility from everywhere?

Some understood that earlier than others and can already harvest the fruits from their work – and some, hopefully not you – should seriously think about protecting their house because the storm is not over yet! With the introduction of cloud gaming services like GeForce Now and the start of projects from the major players in the market, it starts to be clear that cloud computing is not just a hype, quite the reverse, anything seems to be possible and any upcoming business transformations should consider the opportunities of cloud computing for their needs. And who knows, maybe one day you can even sell your computing power as a part of the worldwide cloud like it’s been introduced by crypto-currency miners or solar tech companies?

Faster, Greater, Higher! The sky’s the limit!

So as enough time has passed that the first CTOs and CIOs can look back to their early experiences with moving to cloud services, we can wrap up some key facts if you’re still looking for a few good arguments which may convince you and your team for moving to the cloud.

Scalability with minor effort

  • Overwhelming traffic is not a show stopper anymore
  • Auto scaling infrastructure is auto money for your pocket!
  • Reduce the multi region infrastructure challenge

Security

  • Cloud storage takes less risks than lost devices!
  • Stop to worry about underlying infrastructure security & maintenance
  • The disaster can come! Well proven Disaster Recovery Systems of great
    cloud providers are probably better than yours!

Speed & Performance

  • Increase your server-side performance on-demand as you go
  • Handle request peaks without costly overprovisioning your hardware

Reduced Costs = Greater Profit

  • Pay-as-you-go is better than buy-and-deprecate!
  • Financial advantages through transformation from CAPEX to OPEX
  • Makes your financial model a lot easier to plan and calculate
  • High chance to reduce your Time to Market 
  • Emerging markets can perhaps make you even fly higher

Agility & User Experience

  • Stop your team from leaving their key competences
  • Increase your team performance with the latest collaboration tools
  • Satisfy your team by giving them a state-of-the-art feeling
  • Give more freedom for physical appearance

Ecological Footprint

  • Less eWaste for deprecated hardware
  • Savings on power supply and emissions

And if you’re doing it right, you earn a lot of FLEXIBILITY which is nowadays more important than ever! Our technological environment changes so fast that you may be outdated before you finish your project – ok, probably not that fast – but you should be prepared! If your cloud provider can’t handle your SLA or gets outdated by himself, you may move on to a new provider but be aware: cloud providers want to keep you as a customer, for sure! How that may end up badly for you we’ll discuss in the next section.

Pitfalls you better leave for others

So if you’re convinced now by the advantages of the cloud, we reveal you some pain points where others have been stumbling, so keep them in mind and don’t make these mistakes again!

Readiness of Organization
One of the greatest mistakes you can do while dreaming of your cloud project is to underestimate the know-how that is necessary to do the job right. Many teams stumbled over the fact that their ideas and plans were great but they forgot about the people who finally have to do the job. Your change management team also has to be aware that going to the cloud can also have a negative impact on other teams, so other teams in your company may start fighting against that change. Imagine for example your customers have now direct access to your order- and payment system, will your sales team like that? Your change can also impact your hierarchical structure and long cherished employees lose their major value for the team. So it’s probably likely that not everybody in your organization will like your project, be aware of that!

Readiness of Application
It doesn’t matter if you’re developing a new or transforming an existing application, you have to keep the drawbacks in mind that can come along with cloud services. Too many technical leaders made the mistake of thinking a simple lift-and-shift strategy will work out and they had to learn by hard that some refactoring is always necessary. This can start by underestimating the threshold of your cloud provider, continuing with disregarding upcoming latency that destroys your user experience or dependency libraries that are restricted to be used in the cloud (rare, but reported). So you better make a new concept first and think from the very beginning which components can be reused in which way. Think about your build steps, your deployment, your test environment, your architecture, current infrastructure, previous concurrency issues, clusters you want to use – convince yourself first and not later!

Data Privacy & Compliance
Especially with the release of the new data privacy regulations from 2018, many companies got messed up and reported big question marks on how to continue with their services. So you can be lucky to start your cloud project after this disrupting wave but you shouldn’t think that the current data privacy agreements will last forever, this was rather just the beginning of regulations that were missed since the beginning of our digital lifestyle. Data privacy for your customers and data protection imposed by your company compliance will harden your way to the cloud – that’s for sure!

Remember your Business Idea!
It doesn’t matter if you start a new project or lift some existing functionality to the cloud, if it creates advantages, any motivated employee will create further and further ideas how to improve and scale the impact of your cloud movement. So if you impress your environment, you can likely be sure that it doesn’t take a long time until the first ideas from other teams or team members land on your desk! So keep your primary business goals in mind and finish them first before you start working on other issues. If you’re a technical leader or you’re the one who pushes the cloud movement forward, take the advice to reflect yourself frequently and check if the tasks that you’ve created also have a value to your business. Too many become starry-eyed for cloud transformations and lift them to a holy-grail, which it isn’t for sure! So think twice if the current steps are really necessary and profitable for you and your team!

Vendor Lock-In
It’s too sad to be true but in reality it’s a fact. We have two strongly diverging mindsets in our industry. There are the ones who believe everything should be free and open sourced and the others try to catch you in their cage and won’t ever let you go. So you got to be aware of the fact that it’s common under cloud providers that they will try to trap you with special offers, extra support, personal training sessions and individual implementation opportunities just to keep you as a customer as long as possible. So if you start to implement services based on proprietary interfaces it will be harder for you to relocate your codebase one day. Or if you build up a special knowledge customized to their services, you may wake up one day and realize that this knowledge is now worthless. So in other words, if their ship goes down, you’ll go down too! So be aware to avoid a vendor lock-in as much as you can!

Cloud Provider Evaluation
Especially those who are new to this topic may stumble across a thing called SLA (Service Level Agreement) which defines the contract you sign with your cloud provider. If you don’t know exactly what it contains and where the borders of your cooperation are, you may run into trouble sooner or later. So it’s a good advice to get somebody for your team who was able to collect some experience in the cloud environment before and protects you from making the same mistakes that others did. It’s not about missing some contract details, it’s more about finding the right values and parameters that fit to your application and finally to your business case. As long as your project works fine you’ll probably never get in touch with your SLA that much, but once you got your customers or your boss in your neck, if there is an unexpected downtime, drastically rising costs due to a miss-configuration or a security lack, or even a disaster with a never ending recovery time, you will understand why knowing your SLA details is important. Take it as a good advice, especially at the beginning of your cooperation, that measurement and control is better than relying on trust. So the job is to monitor and measure as much as you can until you get a good feeling of what’s going on and how much responsibility you can take for it. Expect the unexpected!

Your guide through the jungle

So as far as you are familiar with the greatest advantages and biggest show stoppers now, we can start to create a cloud strategy that fits to your needs. Get your weapons locked and loaded and follow me through the jungle!

What do you want to create?

Remember your business goals and remember how others have lost the trail and you don’t want that the same happens to you, right? So define your project and especially your project scope precisely. What do you want to create? What is what you want to offer to your customer? Who is your customer and what do they really need? Think about your use-cases and match them accordingly. To make it a little bit more confusing, maybe you’ve heard about XaaS or *aaS which is one of the most blown up cloud buzzwords ever! What had once started with Software as a Service (SaaS), which basically means software delivered by cloud services, later transformed into a term that is almost used for any kind of service. Firewall as a Service, Payment as a Service, Analytics as a Service and even non- technological ones. But the idea behind the services-oriented model is great! So if you want to market your later product right, think in terms of services! Remember how you changed your business model from CAPEX to OPEX and your future customers want to have that too! They don’t want to buy a product, they have a goal and they are looking for the right service to reach their target as fast and as best as they can. And that’s a fact for any kind of customer. So you are serving them your best to let them reach their best!
They other way around you have to think of the same terms for your goals. But this is a matter of your skills, compliance and amount of responsibility you want to share. Netflix for example thought they could fly high and even higher, so they built up their own infrastructure and data centers until they had to realize that they can never get this job done any better than Amazon does. From that point on they focus on what they really want to do: delivering awesome series! So think about the services you want to deliver, what kind of services would be a help for you, short term and long term! Imagine your project growth stages from time to time, where is your starting point and where do you finally want to get and find a match with a cloud provider that has all the abilities you need for your scale.

If you’re willing to create services that are mainly for corporate users, think of a private cloud with outsourced hosting facilities which is in fact technically nothing else than a restricted public cloud for your private usage. If your compliance or your business case doesn’t force you to keep your data in-house, guard it with a virtual private network access (VPN) and you are ready to go. If you once need to open some services for public usage you can still transform it to a hybrid cloud by removing the according restrictions or routing it through a public cloud.

Checklist

To not leave you unarmed, we’ve created a checklist that you may consider at the start of your cloud project. Critics are always welcome, if you’ve noticed some missing points or made different experiences throughout the years, let us know and write it in the comments down below!

  1. What’s my business goal? Which requirements does it take?
  2. How can I accelerate my time to market?
    Is the market ready for my concept?
  3. How ready is my application or existing architecture?
  4. Is there a need for refactoring? Does it also consider potential cloud
  5. drawbacks like latency, downtime, noisy neighbours, migration, licenses and so on?
  6. How ready is my organization and team? Who can do the job?
  7. Will it change my business model? Does it disrupt my team? 
  8. Which restrictions do exist? Compliance? Security?
    Data privacy policies?
  9. Where do I need help in terms of services? Long term? Short term?
  10. Where do I want to grow? How much responsibility can I share?
  11. Does the SLA match my requirements? How about Disaster Recovery?
  12. Do the project requirements match my budget? Can it be done at the given time?

And as a final word: Don’t over engineer and don’t over manage it all! Keep it simple and straightforward and learn from your mistakes. Never lay back, be proactive, expect the unexpected and create your plan b – than you’re prepared for whatever it takes 😀

References

Author: Cloud Technology Partners, a Hewlett Packard Enterprise Company
Publication Date: July 21, 2016
Title: Five things every CEO should know before going to the cloud
Retrieved August 08, 2020 from:
https://www.cloudtp.com/doppler/5-things-every-ceo-know-going-cloud/

Author: Jeremy Cook, Cloud Academy
Publication Date: September 2019, 2019
Title: Cloud Migration Risks & Benefits
Retrieved August 09, 2020 from:
https://cloudacademy.com/blog/cloud-migration-benefits-risks/

Author: Sharad Acharya, Ace Cloud Hosting
Publication Date: April 05, 2019
Title: Things to look out while choosing a cloud service provider
Retrieved August 09, 2020 from:
https://www.acecloudhosting.com/blog/choosing-cloud-service-provider/

Author: Eze Castle Integration
Publication Date: January 21, 2020
Title: 6 Common cloud mistakes and how to avoid them
Retrieved August 09, 2020 from:
https://www.eci.com/blog/15706-6-common-cloud-mistakes-and-how-to-avoid-them.html

Author: Stefan Luber, Cloud-Computing Insider
Publication Date: August 16, 2017
Title: Was ist eine private Cloud?
Retrieved August 12, 2020 from:
https://www.cloudcomputing-insider.de/was-ist-eine-private-cloud-a-631415/

Open Source Batch and Stream Processing: Realtime Analysis of Big Data

Abstract

Since the beginning of Big Data, batch processing was the most popular choice for processing large amounts of generated data. These existing processing technologies are not suitable to process the large amount of data we face today. Research works developed a variety of technologies that focus on stream processing. Stream processing technologies bring significant performance improvements and new opportunities to handle Big Data. In this paper, we discuss the differences of batch and stream processing and we explore existing batch and stream processing technologies. We also explain the new possibilities that stream processing make possible.

1 Introduction

A huge amount of information is generated everyday by social media, e-mails, sensors, instruments and enterprise applications, to mention a few resources. This amount of data brings a lot of challenges according to volume, velocity and variety. In the past two years, 90% of all data was created and the amount of data will double every two years. This data comes in a variety of formats and types, each of it requires a different way to process the generated data [9].

Batch processing was the most popular choice to process Big Data. The most notable batch processing framework is MapReduce [7]. MapReduce was first implemented and developed by Google. It was used for large-scale graph processing, text processing, machine learning and statistical machine translation. MapReduce can process large amounts of data but is only designed for batch processing. Today’s demands rely on real-time processing of Big Data that will finish in seconds [19]. For this demand, various stream-processing technologies have been developed. In this paper we will focus on Apache Spark Streaming [22] and Apache Flink [6], which are the most famous tools for stream processing [12].

In this work we will explain the concepts of batch processing and stream processing in detail while introducing the most popular frameworks. After that we introduce new opportunities that stream processing provides to face today’s issues where a response is needed in seconds.

2 Related Work

Big Data analysis is an active area of research but comparisons of Big Data analysis concepts are difficult to find. Most research papers focus on comparing stream processing frameworks on performance. In this work we will focus on open source technologies. There are widely used proprietary solutions like Google Millwheel [1], IBM InfoSphere Streams [3] and Microsoft Azure Stream Analytics [8] we won’t discuss in this paper.

Lopez, Lobato and Duarte describe and compare the streaming platforms Apache Flink, Apache Spark Streaming and Apache Storm [15]. The work focuses on processing performance and behaviour when a worker node fails. The results of each platform are analysed and compared.

Shahrivari compares the concepts between batch processing and stream processing [19]. In detail, the work compares the performance of MapReduce and ff Apache Spark Streaming with different experiments.

Unlike the mentioned papers, we will focus on the difference between batch processing and stream processing and discuss the new opportunities of stream processing instead of comparing performance measurements.

3 Batch Processing

Batch jobs run in the background without any interaction from an operator. In Theory a batch job gets executed in a specific time window between the end of a workday and the start of the next workday to process millions of records which will take hours to execute. This time window will increase with availability requirements. Batch processing is still used today in organisations and financial institutions [10].

Individual batch jobs are usually organized into calendar periods. Common batch schedules are daily, weekly, and monthly batches. Weekly and monthly batch schedules are mostly used for technical tasks like backups, integrity checks or disk defragmentation. Functional tasks should be executed on a daily schedule. Typically jobs on a daily basis are, data processing and transferring. Organizing the batch schedule can save effort in the development cycle. To categorize a job, a simple rule of thumb is to determine if it has to do a functional or a technical task. To reduce batch execution time, performing jobs in parallel is a key factor [10].

There are two different architectures of how batch jobs should be executed: As scripts or as services. The major differences are logging and control. Batch jobs that will run as a service usually report their status through log files and can be controlled over a control panel which is provided by the system. Batch jobs that are triggered over the command line, report their progress through streams and an appropriate exit code. The batch scheduler will terminate the job if it’s necessary [10].

3.1 MapReduce

MapReduce is a programming model that enables processing and generating large amounts of data. The model defines two methods: map and reduce.

Figure 1: Pseudo code of counting the number of occurrences of each word in a large selection of documents.

The map function takes a key/value pair as input to generate an intermediate set of key/value pairs. The reduce function takes an intermediate key and intermediate values associated to that key, as input and returns a set of key/value pairs [7]. Figure 1 illustrates the MapReduce programming model of a real-world use case.

4 Stream Processing

Stream processing refers to real-time processing of continuous data [14]. A stream processing system consists of a queue, a stream processor and real time views [16].

In a system without a queue, the stream processor has to process each event directly. This approach cannot guarantee that each event gets processed correctly. If the stream processor dies, there is no way to detect the error. A cluster would be overwhelmed by the incoming amount of data it has to process. A persistent queue helps to address these issues. Writing events to a persistent queue before processing the data will buffer the events and it allows the stream processor to retry an event when it fails [16]. An example for a modern queue system is Apache Kafka [13].

The stream-processor processes incoming events in the queue and then updates the real time views. There are two models of stream-processing that have emerged in the recent years: Record-at-a-time and micro-batched [16].

Record-at-a-time stream processing The record-at-a-time processing model processes tuples independently of each other, updates the internal state and sends out new records in response. This leads to inconsistency, when different nodes process different data that arrive at different times. The model handles recovery through replication which requires twice the amount of hardware. This is not optimal for large clusters [22]. To be scalable with high throughput, the systems run in parallel across the cluster [16].

Micro-batch stream processing The micro-batch stream processing approach processes the tuples as discrete batches. A batch is processed in a strong order until completion before moving on to the next batch. To know if a batch has been processed before, each batch has its own unique identifier that always stays the same on every replay [16].

4.1 Apache Spark Streaming

Apache Spark Streaming is an extension to the Apache Spark cluster computing engine. It was developed to overcome the challenges of the record-at-a-time processing model. Spark Streaming provides a stream programming model for large clusters called discretized streams (D-Streams). In D-Streams, streaming computation will be treated as a series of deterministic batch computations on small time intervals [22].

To generate an input dataset for an interval, the received data during that interval is stored reliably across the cluster. To generate new datasets as a response, after each interval the datasets are processed via deterministic parallel operations. To avoid replication by using lineage, the new datasets will be stored in resilient distributed datasets (RDDs) [21]. A D-Stream allows users to manipulate grouped RDDs through various operations [22].

Figure 2: Each RDD contains data from a certain time interval [24].

D-Stream provides consistency, fault recovery and integration with batch systems to bring batch processing models to stream processing. Apache Spark Streaming lets users mix together streaming, batch and interactive queries to build integrated systems [22].

4.2 Apache Flink

Apache Flink is a stream-processing framework and an Apache top-level project. The core of Apache Flink is a distributed streaming data-flow engine which is optimized to perform batch and stream analytics [6]. The distributed streaming data-flow engine executes programs called dataflow graphs which can consume and produce data [4].

Dataflow graphs consists of stateful operators and data streams. The stateful operators implement logic of producing or consuming data. Data streams distribute the data between all operators. On execution Dataflow graphs parallelize operators into one or more instances called subtasks and split streams into one or more stream partitions [6].

Figure 3: Code showing the Apache Flink dataflow programming model [23].

Apache Flink is a high-throughput, low-latency streaming engine and optimized for batch execution using a query optimizer [6]. Dataflow graphs are optimized to be executed in a cluster or cloud environment [20].

5 Stream Processing Opportunities

Batch processing is still needed for legacy implementations and data analysis where no efficient algorithms are known [6]. Nevertheless, stream processing offers new opportunities to face issues where the result is needed in seconds instead of hours or days.

5.1 Machine Learning

Machine learning for Big Data is dominated by online machine learning algorithms. In streaming there is a need for scalable learning algorithms that are adaptive and inherently open-ended [11]. This makes online machine learning optimal for stream processing where the algorithm has to adapt new patterns in the data dynamically.

Apache Flink Apache Flink brings together batch processing and stream processing. This makes Apache Flink very suitable for machine learning [11]. Apache Flink provides the machine learning library FlinkML. FlinkML supports the PMML standard for online predictions [5].

Apache Spark Apache Spark provides a distributed machine learning library called MLlib. MLlib provides distributed implementations of learning algorithms that can serve (but not limited to) linear models, naive Bayes, classification and clustering. MLlib can be integrated with other high-level libraries, for example Apache Spark Streaming. Apache Spark Streaming enables the development of online learning algorithms with MLlib on realtime data streams [17].

Detecting cases of fraud is an ongoing area of research. A study from 2016 estimated, that credit card fraud is responsible for over 20 billion dollars in loss worldwide [18]. It is important to detect credit card fraud immediately after a financial transaction has been made. Today, credit card fraud can be detected with supervised or unsupervised machine learning models [2]. For an instant detection, online machine learning on realtime data stream serves the needed technology to face this issue.

6 Conclusion

This paper explains the two data analysis concepts batch processing and stream processing. Since realtime analysis is needed to face the issues of today’s demands, batch processing is still being used for legacy implementations and data analysis where no efficient algorithms are known. Stream processing offers new opportunities to handle big data and response with an immediate result to the user.

References

[1] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. Millwheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases, pages 734–746, 2013.

[2] Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. Wiley Publishing, 1st edition, 2015.

[3] Chuck Ballard, Kevin Foster, Andy Frenkiel, Bugra Gedik, Michael P. Koranda, Deepak Senthil, Nathanand Rajan, Roger Rea, Mike Spicer, Brian Williams, and Vitali N. Zoubov. Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks publication, 2011.

[4] Ilaria Bartolini and Marco Patella. Comparing performances of big data stream processing platforms with ram 3 s (extended abstract).

[5] Andr´as Bencz´ur, Levente Kocsis, and R´obert P´alovics. Online machine learning in big data streams. 02 2018.

[6] Paris Carbone, Asterios Katsifodimos, † Kth, Sics Sweden, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink TM : Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38, 01 2015.

[7] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: A flexible data processing tool. Commun. ACM, 53, 01 2010.

[8] Charles Feddersen. Real-time event processing with microsoft azure stream analytics. Jan 2015.

[9] Mugdha Ghotkar and Priyanka Rokde. Big data: How it is generated and its importance.

[10] Dave Ingram. Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development. Wrox, 2009.

[11] W. Jamil, N-C. Duong, W. Wang, C. Mansouri, S. Mohamad, and A. Bouchachia. Scalable online learning for flink: Solma library. In Proceedings of the 12th European Conference on Software Architecture: Companion Proceedings, ECSA ’18, New York, NY, USA, 2018. Association for Computing Machinery.

[12] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1507–1518, April 2018.

[13] Jay Kreps. Kafka : a distributed messaging system for log processing. 2011.

[14] Anuj Kumar. Architecting Data-Intensive Applications. Packt Publishing, 2018.

[15] M. A. Lopez, A. G. P. Lobato, and O. C. M. B. Duarte. A performance comparison of open-source stream processing platforms. In 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–6, Dec 2016.

[16] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2015.

[17] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, and et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res., 17(1):1235–1241, January 2016.

[18] David Robertson. The nilson report, issue 1096. Oct 2016.

[19] Saeed Shahrivari. Beyond batch processing: Towards real-time and streaming big data. Computers, 3, 03 2014.

[20] Daniel Warneke and Odej Kao. Nephele: Efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS ’09, New York, NY, USA, 2009. Association for Computing Machinery.

[21] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, San Jose, CA, 2012. USENIX.

[22] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoffi ica. Discretized streams: An e cient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, HotCloud’12, pages 10–10, Berkeley, CA, USA, 2012. USENIX Association.

[23] Dataflow Programming Model. https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html

[24] Discretized Streams (DStreams). https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Distributed stream processing frameworks – what they are and how they perform

This blog aims to provide an overview about the topic of stream processing and its capabilites in a large scale environment. The post starts with an introduction to stream processing. After that, it explains how stream processing works and shows different areas of application as well as some common stream processing frameworks. Finally, this article will provide a performance comparison of several common frameworks based on benchmarking data.

Continue reading

Isolation and Consistency in Databases

by Samuel Hack and Sebastian Wachter.

Most people assume that the data coming from a database is correct. For most applications this is true, but when the databases are used in systems where the database is at its limit, this is no longer always the case. What happens if during a query of a value exactly this value is changed by another user? Which entry was first when two users create different values at the same time. Especially in systems where one database server is not enough it is very important to choose the right rules for Isolation and Consistency, if this is not the case it can lead to data loss or in worst case to incorrect values in the application. 

Most database systems promise virtually nothing by default. Therefore it is important to learn what isolation and consistency level the database promises and then adjust it to fit the application.

In this paper we will give an overview of what isolation and consistency levels are, what different levels are available in databases, and what errors can occur in databases. Afterwards we will give a short introduction to the CAP theorem and discuss the problems that it includes for distributed systems.

Continue reading