Finding the right strategy for your way to the cloud

– Abstract –

This article is about finding the right strategy if you want to empower your business with cloud computing. It’s written for those of you who are still new to this topic and want to get a first sight through the dense cloud mystery. After reading you should have gained an orientation on how to start your first cloud project and a basic knowledge of the most important buzzwords of the industry.

“The essence of strategy is choosing what not to do”
– Prof. Dr. Michael E. Porter – Harvard Business School
Cofounder of the strategic management


It’s stormy outside

It’s been over a decade now that Amazon was stirring up the market with their web services and cloud computing is now more mainstream than ever. What for many people started with file-sharing- and backup-services like Dropbox or company internal NAS-systems, quickly transformed into a trending global topic. Why should you always upgrade local hardware and storage to stay competitive if you can have the same or even greater performance with shared resources, reduced costs and accessibility from everywhere?

Some understood that earlier than others and can already harvest the fruits from their work – and some, hopefully not you – should seriously think about protecting their house because the storm is not over yet! With the introduction of cloud gaming services like GeForce Now and the start of projects from the major players in the market, it starts to be clear that cloud computing is not just a hype, quite the reverse, anything seems to be possible and any upcoming business transformations should consider the opportunities of cloud computing for their needs. And who knows, maybe one day you can even sell your computing power as a part of the worldwide cloud like it’s been introduced by crypto-currency miners or solar tech companies?

Faster, Greater, Higher! The sky’s the limit!

So as enough time has passed that the first CTOs and CIOs can look back to their early experiences with moving to cloud services, we can wrap up some key facts if you’re still looking for a few good arguments which may convince you and your team for moving to the cloud.

Scalability with minor effort

  • Overwhelming traffic is not a show stopper anymore
  • Auto scaling infrastructure is auto money for your pocket!
  • Reduce the multi region infrastructure challenge

Security

  • Cloud storage takes less risks than lost devices!
  • Stop to worry about underlying infrastructure security & maintenance
  • The disaster can come! Well proven Disaster Recovery Systems of great
    cloud providers are probably better than yours!

Speed & Performance

  • Increase your server-side performance on-demand as you go
  • Handle request peaks without costly overprovisioning your hardware

Reduced Costs = Greater Profit

  • Pay-as-you-go is better than buy-and-deprecate!
  • Financial advantages through transformation from CAPEX to OPEX
  • Makes your financial model a lot easier to plan and calculate
  • High chance to reduce your Time to Market 
  • Emerging markets can perhaps make you even fly higher

Agility & User Experience

  • Stop your team from leaving their key competences
  • Increase your team performance with the latest collaboration tools
  • Satisfy your team by giving them a state-of-the-art feeling
  • Give more freedom for physical appearance

Ecological Footprint

  • Less eWaste for deprecated hardware
  • Savings on power supply and emissions

And if you’re doing it right, you earn a lot of FLEXIBILITY which is nowadays more important than ever! Our technological environment changes so fast that you may be outdated before you finish your project – ok, probably not that fast – but you should be prepared! If your cloud provider can’t handle your SLA or gets outdated by himself, you may move on to a new provider but be aware: cloud providers want to keep you as a customer, for sure! How that may end up badly for you we’ll discuss in the next section.

Pitfalls you better leave for others

So if you’re convinced now by the advantages of the cloud, we reveal you some pain points where others have been stumbling, so keep them in mind and don’t make these mistakes again!

Readiness of Organization
One of the greatest mistakes you can do while dreaming of your cloud project is to underestimate the know-how that is necessary to do the job right. Many teams stumbled over the fact that their ideas and plans were great but they forgot about the people who finally have to do the job. Your change management team also has to be aware that going to the cloud can also have a negative impact on other teams, so other teams in your company may start fighting against that change. Imagine for example your customers have now direct access to your order- and payment system, will your sales team like that? Your change can also impact your hierarchical structure and long cherished employees lose their major value for the team. So it’s probably likely that not everybody in your organization will like your project, be aware of that!

Readiness of Application
It doesn’t matter if you’re developing a new or transforming an existing application, you have to keep the drawbacks in mind that can come along with cloud services. Too many technical leaders made the mistake of thinking a simple lift-and-shift strategy will work out and they had to learn by hard that some refactoring is always necessary. This can start by underestimating the threshold of your cloud provider, continuing with disregarding upcoming latency that destroys your user experience or dependency libraries that are restricted to be used in the cloud (rare, but reported). So you better make a new concept first and think from the very beginning which components can be reused in which way. Think about your build steps, your deployment, your test environment, your architecture, current infrastructure, previous concurrency issues, clusters you want to use – convince yourself first and not later!

Data Privacy & Compliance
Especially with the release of the new data privacy regulations from 2018, many companies got messed up and reported big question marks on how to continue with their services. So you can be lucky to start your cloud project after this disrupting wave but you shouldn’t think that the current data privacy agreements will last forever, this was rather just the beginning of regulations that were missed since the beginning of our digital lifestyle. Data privacy for your customers and data protection imposed by your company compliance will harden your way to the cloud – that’s for sure!

Remember your Business Idea!
It doesn’t matter if you start a new project or lift some existing functionality to the cloud, if it creates advantages, any motivated employee will create further and further ideas how to improve and scale the impact of your cloud movement. So if you impress your environment, you can likely be sure that it doesn’t take a long time until the first ideas from other teams or team members land on your desk! So keep your primary business goals in mind and finish them first before you start working on other issues. If you’re a technical leader or you’re the one who pushes the cloud movement forward, take the advice to reflect yourself frequently and check if the tasks that you’ve created also have a value to your business. Too many become starry-eyed for cloud transformations and lift them to a holy-grail, which it isn’t for sure! So think twice if the current steps are really necessary and profitable for you and your team!

Vendor Lock-In
It’s too sad to be true but in reality it’s a fact. We have two strongly diverging mindsets in our industry. There are the ones who believe everything should be free and open sourced and the others try to catch you in their cage and won’t ever let you go. So you got to be aware of the fact that it’s common under cloud providers that they will try to trap you with special offers, extra support, personal training sessions and individual implementation opportunities just to keep you as a customer as long as possible. So if you start to implement services based on proprietary interfaces it will be harder for you to relocate your codebase one day. Or if you build up a special knowledge customized to their services, you may wake up one day and realize that this knowledge is now worthless. So in other words, if their ship goes down, you’ll go down too! So be aware to avoid a vendor lock-in as much as you can!

Cloud Provider Evaluation
Especially those who are new to this topic may stumble across a thing called SLA (Service Level Agreement) which defines the contract you sign with your cloud provider. If you don’t know exactly what it contains and where the borders of your cooperation are, you may run into trouble sooner or later. So it’s a good advice to get somebody for your team who was able to collect some experience in the cloud environment before and protects you from making the same mistakes that others did. It’s not about missing some contract details, it’s more about finding the right values and parameters that fit to your application and finally to your business case. As long as your project works fine you’ll probably never get in touch with your SLA that much, but once you got your customers or your boss in your neck, if there is an unexpected downtime, drastically rising costs due to a miss-configuration or a security lack, or even a disaster with a never ending recovery time, you will understand why knowing your SLA details is important. Take it as a good advice, especially at the beginning of your cooperation, that measurement and control is better than relying on trust. So the job is to monitor and measure as much as you can until you get a good feeling of what’s going on and how much responsibility you can take for it. Expect the unexpected!

Your guide through the jungle

So as far as you are familiar with the greatest advantages and biggest show stoppers now, we can start to create a cloud strategy that fits to your needs. Get your weapons locked and loaded and follow me through the jungle!

What do you want to create?

Remember your business goals and remember how others have lost the trail and you don’t want that the same happens to you, right? So define your project and especially your project scope precisely. What do you want to create? What is what you want to offer to your customer? Who is your customer and what do they really need? Think about your use-cases and match them accordingly. To make it a little bit more confusing, maybe you’ve heard about XaaS or *aaS which is one of the most blown up cloud buzzwords ever! What had once started with Software as a Service (SaaS), which basically means software delivered by cloud services, later transformed into a term that is almost used for any kind of service. Firewall as a Service, Payment as a Service, Analytics as a Service and even non- technological ones. But the idea behind the services-oriented model is great! So if you want to market your later product right, think in terms of services! Remember how you changed your business model from CAPEX to OPEX and your future customers want to have that too! They don’t want to buy a product, they have a goal and they are looking for the right service to reach their target as fast and as best as they can. And that’s a fact for any kind of customer. So you are serving them your best to let them reach their best!
They other way around you have to think of the same terms for your goals. But this is a matter of your skills, compliance and amount of responsibility you want to share. Netflix for example thought they could fly high and even higher, so they built up their own infrastructure and data centers until they had to realize that they can never get this job done any better than Amazon does. From that point on they focus on what they really want to do: delivering awesome series! So think about the services you want to deliver, what kind of services would be a help for you, short term and long term! Imagine your project growth stages from time to time, where is your starting point and where do you finally want to get and find a match with a cloud provider that has all the abilities you need for your scale.

If you’re willing to create services that are mainly for corporate users, think of a private cloud with outsourced hosting facilities which is in fact technically nothing else than a restricted public cloud for your private usage. If your compliance or your business case doesn’t force you to keep your data in-house, guard it with a virtual private network access (VPN) and you are ready to go. If you once need to open some services for public usage you can still transform it to a hybrid cloud by removing the according restrictions or routing it through a public cloud.

Checklist

To not leave you unarmed, we’ve created a checklist that you may consider at the start of your cloud project. Critics are always welcome, if you’ve noticed some missing points or made different experiences throughout the years, let us know and write it in the comments down below!

  1. What’s my business goal? Which requirements does it take?
  2. How can I accelerate my time to market?
    Is the market ready for my concept?
  3. How ready is my application or existing architecture?
  4. Is there a need for refactoring? Does it also consider potential cloud
  5. drawbacks like latency, downtime, noisy neighbours, migration, licenses and so on?
  6. How ready is my organization and team? Who can do the job?
  7. Will it change my business model? Does it disrupt my team? 
  8. Which restrictions do exist? Compliance? Security?
    Data privacy policies?
  9. Where do I need help in terms of services? Long term? Short term?
  10. Where do I want to grow? How much responsibility can I share?
  11. Does the SLA match my requirements? How about Disaster Recovery?
  12. Do the project requirements match my budget? Can it be done at the given time?

And as a final word: Don’t over engineer and don’t over manage it all! Keep it simple and straightforward and learn from your mistakes. Never lay back, be proactive, expect the unexpected and create your plan b – than you’re prepared for whatever it takes 😀

References

Author: Cloud Technology Partners, a Hewlett Packard Enterprise Company
Publication Date: July 21, 2016
Title: Five things every CEO should know before going to the cloud
Retrieved August 08, 2020 from:
https://www.cloudtp.com/doppler/5-things-every-ceo-know-going-cloud/

Author: Jeremy Cook, Cloud Academy
Publication Date: September 2019, 2019
Title: Cloud Migration Risks & Benefits
Retrieved August 09, 2020 from:
https://cloudacademy.com/blog/cloud-migration-benefits-risks/

Author: Sharad Acharya, Ace Cloud Hosting
Publication Date: April 05, 2019
Title: Things to look out while choosing a cloud service provider
Retrieved August 09, 2020 from:
https://www.acecloudhosting.com/blog/choosing-cloud-service-provider/

Author: Eze Castle Integration
Publication Date: January 21, 2020
Title: 6 Common cloud mistakes and how to avoid them
Retrieved August 09, 2020 from:
https://www.eci.com/blog/15706-6-common-cloud-mistakes-and-how-to-avoid-them.html

Author: Stefan Luber, Cloud-Computing Insider
Publication Date: August 16, 2017
Title: Was ist eine private Cloud?
Retrieved August 12, 2020 from:
https://www.cloudcomputing-insider.de/was-ist-eine-private-cloud-a-631415/

Open Source Batch and Stream Processing: Realtime Analysis of Big Data

Abstract

Since the beginning of Big Data, batch processing was the most popular choice for processing large amounts of generated data. These existing processing technologies are not suitable to process the large amount of data we face today. Research works developed a variety of technologies that focus on stream processing. Stream processing technologies bring significant performance improvements and new opportunities to handle Big Data. In this paper, we discuss the differences of batch and stream processing and we explore existing batch and stream processing technologies. We also explain the new possibilities that stream processing make possible.

1 Introduction

A huge amount of information is generated everyday by social media, e-mails, sensors, instruments and enterprise applications, to mention a few resources. This amount of data brings a lot of challenges according to volume, velocity and variety. In the past two years, 90% of all data was created and the amount of data will double every two years. This data comes in a variety of formats and types, each of it requires a different way to process the generated data [9].

Batch processing was the most popular choice to process Big Data. The most notable batch processing framework is MapReduce [7]. MapReduce was first implemented and developed by Google. It was used for large-scale graph processing, text processing, machine learning and statistical machine translation. MapReduce can process large amounts of data but is only designed for batch processing. Today’s demands rely on real-time processing of Big Data that will finish in seconds [19]. For this demand, various stream-processing technologies have been developed. In this paper we will focus on Apache Spark Streaming [22] and Apache Flink [6], which are the most famous tools for stream processing [12].

In this work we will explain the concepts of batch processing and stream processing in detail while introducing the most popular frameworks. After that we introduce new opportunities that stream processing provides to face today’s issues where a response is needed in seconds.

2 Related Work

Big Data analysis is an active area of research but comparisons of Big Data analysis concepts are difficult to find. Most research papers focus on comparing stream processing frameworks on performance. In this work we will focus on open source technologies. There are widely used proprietary solutions like Google Millwheel [1], IBM InfoSphere Streams [3] and Microsoft Azure Stream Analytics [8] we won’t discuss in this paper.

Lopez, Lobato and Duarte describe and compare the streaming platforms Apache Flink, Apache Spark Streaming and Apache Storm [15]. The work focuses on processing performance and behaviour when a worker node fails. The results of each platform are analysed and compared.

Shahrivari compares the concepts between batch processing and stream processing [19]. In detail, the work compares the performance of MapReduce and ff Apache Spark Streaming with different experiments.

Unlike the mentioned papers, we will focus on the difference between batch processing and stream processing and discuss the new opportunities of stream processing instead of comparing performance measurements.

3 Batch Processing

Batch jobs run in the background without any interaction from an operator. In Theory a batch job gets executed in a specific time window between the end of a workday and the start of the next workday to process millions of records which will take hours to execute. This time window will increase with availability requirements. Batch processing is still used today in organisations and financial institutions [10].

Individual batch jobs are usually organized into calendar periods. Common batch schedules are daily, weekly, and monthly batches. Weekly and monthly batch schedules are mostly used for technical tasks like backups, integrity checks or disk defragmentation. Functional tasks should be executed on a daily schedule. Typically jobs on a daily basis are, data processing and transferring. Organizing the batch schedule can save effort in the development cycle. To categorize a job, a simple rule of thumb is to determine if it has to do a functional or a technical task. To reduce batch execution time, performing jobs in parallel is a key factor [10].

There are two different architectures of how batch jobs should be executed: As scripts or as services. The major differences are logging and control. Batch jobs that will run as a service usually report their status through log files and can be controlled over a control panel which is provided by the system. Batch jobs that are triggered over the command line, report their progress through streams and an appropriate exit code. The batch scheduler will terminate the job if it’s necessary [10].

3.1 MapReduce

MapReduce is a programming model that enables processing and generating large amounts of data. The model defines two methods: map and reduce.

Figure 1: Pseudo code of counting the number of occurrences of each word in a large selection of documents.

The map function takes a key/value pair as input to generate an intermediate set of key/value pairs. The reduce function takes an intermediate key and intermediate values associated to that key, as input and returns a set of key/value pairs [7]. Figure 1 illustrates the MapReduce programming model of a real-world use case.

4 Stream Processing

Stream processing refers to real-time processing of continuous data [14]. A stream processing system consists of a queue, a stream processor and real time views [16].

In a system without a queue, the stream processor has to process each event directly. This approach cannot guarantee that each event gets processed correctly. If the stream processor dies, there is no way to detect the error. A cluster would be overwhelmed by the incoming amount of data it has to process. A persistent queue helps to address these issues. Writing events to a persistent queue before processing the data will buffer the events and it allows the stream processor to retry an event when it fails [16]. An example for a modern queue system is Apache Kafka [13].

The stream-processor processes incoming events in the queue and then updates the real time views. There are two models of stream-processing that have emerged in the recent years: Record-at-a-time and micro-batched [16].

Record-at-a-time stream processing The record-at-a-time processing model processes tuples independently of each other, updates the internal state and sends out new records in response. This leads to inconsistency, when different nodes process different data that arrive at different times. The model handles recovery through replication which requires twice the amount of hardware. This is not optimal for large clusters [22]. To be scalable with high throughput, the systems run in parallel across the cluster [16].

Micro-batch stream processing The micro-batch stream processing approach processes the tuples as discrete batches. A batch is processed in a strong order until completion before moving on to the next batch. To know if a batch has been processed before, each batch has its own unique identifier that always stays the same on every replay [16].

4.1 Apache Spark Streaming

Apache Spark Streaming is an extension to the Apache Spark cluster computing engine. It was developed to overcome the challenges of the record-at-a-time processing model. Spark Streaming provides a stream programming model for large clusters called discretized streams (D-Streams). In D-Streams, streaming computation will be treated as a series of deterministic batch computations on small time intervals [22].

To generate an input dataset for an interval, the received data during that interval is stored reliably across the cluster. To generate new datasets as a response, after each interval the datasets are processed via deterministic parallel operations. To avoid replication by using lineage, the new datasets will be stored in resilient distributed datasets (RDDs) [21]. A D-Stream allows users to manipulate grouped RDDs through various operations [22].

Figure 2: Each RDD contains data from a certain time interval [24].

D-Stream provides consistency, fault recovery and integration with batch systems to bring batch processing models to stream processing. Apache Spark Streaming lets users mix together streaming, batch and interactive queries to build integrated systems [22].

4.2 Apache Flink

Apache Flink is a stream-processing framework and an Apache top-level project. The core of Apache Flink is a distributed streaming data-flow engine which is optimized to perform batch and stream analytics [6]. The distributed streaming data-flow engine executes programs called dataflow graphs which can consume and produce data [4].

Dataflow graphs consists of stateful operators and data streams. The stateful operators implement logic of producing or consuming data. Data streams distribute the data between all operators. On execution Dataflow graphs parallelize operators into one or more instances called subtasks and split streams into one or more stream partitions [6].

Figure 3: Code showing the Apache Flink dataflow programming model [23].

Apache Flink is a high-throughput, low-latency streaming engine and optimized for batch execution using a query optimizer [6]. Dataflow graphs are optimized to be executed in a cluster or cloud environment [20].

5 Stream Processing Opportunities

Batch processing is still needed for legacy implementations and data analysis where no efficient algorithms are known [6]. Nevertheless, stream processing offers new opportunities to face issues where the result is needed in seconds instead of hours or days.

5.1 Machine Learning

Machine learning for Big Data is dominated by online machine learning algorithms. In streaming there is a need for scalable learning algorithms that are adaptive and inherently open-ended [11]. This makes online machine learning optimal for stream processing where the algorithm has to adapt new patterns in the data dynamically.

Apache Flink Apache Flink brings together batch processing and stream processing. This makes Apache Flink very suitable for machine learning [11]. Apache Flink provides the machine learning library FlinkML. FlinkML supports the PMML standard for online predictions [5].

Apache Spark Apache Spark provides a distributed machine learning library called MLlib. MLlib provides distributed implementations of learning algorithms that can serve (but not limited to) linear models, naive Bayes, classification and clustering. MLlib can be integrated with other high-level libraries, for example Apache Spark Streaming. Apache Spark Streaming enables the development of online learning algorithms with MLlib on realtime data streams [17].

Detecting cases of fraud is an ongoing area of research. A study from 2016 estimated, that credit card fraud is responsible for over 20 billion dollars in loss worldwide [18]. It is important to detect credit card fraud immediately after a financial transaction has been made. Today, credit card fraud can be detected with supervised or unsupervised machine learning models [2]. For an instant detection, online machine learning on realtime data stream serves the needed technology to face this issue.

6 Conclusion

This paper explains the two data analysis concepts batch processing and stream processing. Since realtime analysis is needed to face the issues of today’s demands, batch processing is still being used for legacy implementations and data analysis where no efficient algorithms are known. Stream processing offers new opportunities to handle big data and response with an immediate result to the user.

References

[1] Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. Millwheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases, pages 734–746, 2013.

[2] Bart Baesens, Veronique Van Vlasselaer, and Wouter Verbeke. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. Wiley Publishing, 1st edition, 2015.

[3] Chuck Ballard, Kevin Foster, Andy Frenkiel, Bugra Gedik, Michael P. Koranda, Deepak Senthil, Nathanand Rajan, Roger Rea, Mike Spicer, Brian Williams, and Vitali N. Zoubov. Ibm infosphere streams: Assembling continuous insight in the information revolution. IBM Redbooks publication, 2011.

[4] Ilaria Bartolini and Marco Patella. Comparing performances of big data stream processing platforms with ram 3 s (extended abstract).

[5] Andr´as Bencz´ur, Levente Kocsis, and R´obert P´alovics. Online machine learning in big data streams. 02 2018.

[6] Paris Carbone, Asterios Katsifodimos, † Kth, Sics Sweden, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink TM : Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38, 01 2015.

[7] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: A flexible data processing tool. Commun. ACM, 53, 01 2010.

[8] Charles Feddersen. Real-time event processing with microsoft azure stream analytics. Jan 2015.

[9] Mugdha Ghotkar and Priyanka Rokde. Big data: How it is generated and its importance.

[10] Dave Ingram. Design – Build – Run: Applied Practices and Principles for Production-Ready Software Development. Wrox, 2009.

[11] W. Jamil, N-C. Duong, W. Wang, C. Mansouri, S. Mohamad, and A. Bouchachia. Scalable online learning for flink: Solma library. In Proceedings of the 12th European Conference on Software Architecture: Companion Proceedings, ECSA ’18, New York, NY, USA, 2018. Association for Computing Machinery.

[12] J. Karimov, T. Rabl, A. Katsifodimos, R. Samarev, H. Heiskanen, and V. Markl. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1507–1518, April 2018.

[13] Jay Kreps. Kafka : a distributed messaging system for log processing. 2011.

[14] Anuj Kumar. Architecting Data-Intensive Applications. Packt Publishing, 2018.

[15] M. A. Lopez, A. G. P. Lobato, and O. C. M. B. Duarte. A performance comparison of open-source stream processing platforms. In 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–6, Dec 2016.

[16] Nathan Marz and James Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2015.

[17] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, and et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res., 17(1):1235–1241, January 2016.

[18] David Robertson. The nilson report, issue 1096. Oct 2016.

[19] Saeed Shahrivari. Beyond batch processing: Towards real-time and streaming big data. Computers, 3, 03 2014.

[20] Daniel Warneke and Odej Kao. Nephele: Efficient parallel data processing in the cloud. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS ’09, New York, NY, USA, 2009. Association for Computing Machinery.

[21] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pages 15–28, San Jose, CA, 2012. USENIX.

[22] Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoffi ica. Discretized streams: An e cient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing, HotCloud’12, pages 10–10, Berkeley, CA, USA, 2012. USENIX Association.

[23] Dataflow Programming Model. https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/programming-model.html

[24] Discretized Streams (DStreams). https://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams

Distributed stream processing frameworks – what they are and how they perform

This blog aims to provide an overview about the topic of stream processing and its capabilites in a large scale environment. The post starts with an introduction to stream processing. After that, it explains how stream processing works and shows different areas of application as well as some common stream processing frameworks. Finally, this article will provide a performance comparison of several common frameworks based on benchmarking data.

Continue reading

Isolation and Consistency in Databases

by Samuel Hack and Sebastian Wachter.

Most people assume that the data coming from a database is correct. For most applications this is true, but when the databases are used in systems where the database is at its limit, this is no longer always the case. What happens if during a query of a value exactly this value is changed by another user? Which entry was first when two users create different values at the same time. Especially in systems where one database server is not enough it is very important to choose the right rules for Isolation and Consistency, if this is not the case it can lead to data loss or in worst case to incorrect values in the application. 

Most database systems promise virtually nothing by default. Therefore it is important to learn what isolation and consistency level the database promises and then adjust it to fit the application.

In this paper we will give an overview of what isolation and consistency levels are, what different levels are available in databases, and what errors can occur in databases. Afterwards we will give a short introduction to the CAP theorem and discuss the problems that it includes for distributed systems.

Continue reading

How internet giants deliver their data to the world

In the course of attending the lecture “Ultra Large Scale Systems” I was intrigued by the subject of traffic load balancing in ultra-large-scale systems. Out of this large topic I decided to look at traffic distribution at the frontend in detail and held a presentation about it as part of this lecture. As this subject has proven to be difficult to comprehend, as long as all related factors are considered, multiple questions remained open. In order to elaborate on these questions I decided to write this blog post to provide a more detailed view of this topic for those interested. Herein, I will not discuss the subject of traffic load balancing inside an internal infrastructure, however corresponding literature can be found at the end of this article. Despite concentrating only on the frontend part of the equation an in-depth look into the workings of traffic load balancing will be provided.

Continue reading

Large Scale Deployment for Deep Learning Models with TensorFlow Serving

Image source

Introduction

“How do you turn a trained model into a product, that will bring value to your enterprise?”

In recent years, serving has become a hot topic in machine learning. With the ongoing success of deep neural networks, there is a growing demand for solutions that address the increasing complexity of inference at scale. This article will explore some of the challenges of serving machine learning models in production. After a brief overview of existing solutions, it will take a closer look at Google’s TensorFlow-Serving system and investigate its capabilities. Note: Even though they may be closely related, this article will not deal with the training aspect of machine learning, only inference.

Inference and Serving

Before diving in, it is important to differentiate between the training and inference phases, because they have completely different requirements.

  • Training is extremely compute-intensive. The goal here is to maximize the number of compute operations in a given time. Latency is not of concern.
  • Inference costs only a fraction of the computing power that training does. However, it should be fast. When you query the model, you want the answer immediately. Inference must be optimized for latency and throughput.

There are two ways to deploy a model for inference. Which one to use largely depends on the use case. First, you can push the entire model to client devices and have them do inference there. Lots of ML features are already baked into our mobile devices this way. T­his works well for some applications e.g. for Face-ID or activity-detection on phones, but falls flat for many other, large-scale industrial applications. You probably won’t have latency problems, but you are limited to the client’s compute power and local information. On the other hand, you can serve the model yourself. This would be suitable for industrial-scale applications, such as recommender systems, fraud detection schemes, intelligent intrusion detection systems and so forth. Serving allows for much larger models, direct integration into your own systems and the direct control and insights that come with it.

Serving Machine Learning at Scale

Of course, it’s never that easy. In most “real-world” scenarios, there isn’t really such a thing as a “finished ML model”. Consider the “Cross-industry standard process for data mining”:

Fig. 1 – Back to the basics: the six phases of CRISP-DM. Source

It might be ancient, but it describes a key concept for successful data mining/machine learning: It is a continuous process [12]. Deployment is part of this process, which means: You will replace your productive models, and you will do it a lot! This will happen for a number of reasons:

  • Data freshness: The ML model is trained on historical data. This data can go stale quickly, because new patterns constantly appear in the real world. Model performance will deteriorate, and you must replace the model with one that was trained on more recent data, before performance drops too low.
  • Model revision: With time, retraining the model might just not be enough to keep the performance up. At this point you need to revise the model architecture itself, perhaps even start from scratch.
  • Experiments: Perhaps you want to try another approach to a problem. For that reason you want to load a temporary, new model, but not discontinue your current one.
  • Rollbacks: Something went wrong, and you need to revert to a previous version.

Version control and lifecycle management aren’t exactly new ideas. However, here they come with a caveat: Since artificial neural networks are essentially “clunky, massive databases” [5], loading and unloading them can have an impact on performance. For reference, some of the most impactful deep models of recent years are a few hundreds of megabytes in size (AlexNet: 240MB, VGG-19: 574MB, ResNet-200: 519MB). But as model performance tends to scale with depth, model size can easily go to multiple gigabytes. That might not be much in terms of “Big Data”, but it’s still capable of causing ugly latency spikes when implemented poorly. Besides ML performance metrics, the primary concerns are latency and throughput. Thus, the serving solution should be able to [4]:

  • quickly replace a loaded model with another,
  • have multiple models loaded at the same time, in the same process,
  • cope with differences in model size and computational complexity,
  • avoid latency spikes when new models are loaded into RAM,
  • if possible, be optimized for GPUs and TPUs to accelerate inference,
  • scale out inference horizontally, depending on demand.

Serving Before “Model Servers”

Until some three years ago, you basically had to build your ML serving solution yourself. A popular approach was (and still is) using Flask or some other framework to serve requests against the model, some WSGI server to handle multiple requests at once and have it all behind some low-footprint web-server like Nginx.

Fig. 2 – An exemplary serving solution architecture. Source: [8]

However, while initially simple, these solutions are not meant to perform at “ultra large” scale on their own. They have difficulty benefiting from hardware acceleration and can become complex fast. If you needed scale, you had to create your own solution, like Facebook’s “FBLearnerPredictor” or Uber’s “Michelangelo”. Within Google, initially simple solutions would often evolve into sophisticated, complex pieces of software, that scaled but couldn’t be repurposed elsewhere [1].

The Rise of Model Servers

Recent years have seen the creation of various different model serving systems, “model servers”, for general machine learning purposes. They take inspiration from the design principles of web application servers and interface through standard web APIs (REST/RPC), while hiding most of their complexity. Among simpler deployment and customization, model servers also offer machine learning-specific optimizations, e.g. support for Nvidia GPUs or Google TPUs. Most model servers have some degree of interoperability with other machine learning platforms, especially the more popular ones. That said, you may still restrict your options, depending on your choice of platform.

Fig. 3 – Exemplary model server for an image recognition task. Source: [10]

A selection of popular model serving and inference solutions includes:

  • TensorFlow Serving (Google)
  • TensorRT (Nvidia)
  • Model Server for Apache MXNet (Amazon)
  • Clipper
  • MLflow
  • DeepDetect
  • Skymind Intelligence Layer for Deeplearning4j

TensorFlow-Serving

By far the most battle-tested model serving system out there is Google’s own TensorFlow-Serving. It is used in Google’s internal model hosting service TFS², as part of their TFX general purpose machine learning platform [2]. It drives services from the Google PlayStore’s recommender system to Google’s own, fully hosted “Cloud Machine Learning Engine”. TensorFlow-Serving natively uses gRPC, but it also supports RESTful APIs. The software can be downloaded as a binary, Docker image or as a C++ library.

Architecture

The core of TensorFlow-Serving is made up of four elements: Servables, Loaders, Sources and Managers. The central element in TensorFlow Serving is the servable [3]. This is where your ML model lives. Servables are objects, that TensorFlow-Serving uses for inference. For example, one servable could correspond to one version of your model. Servables can be simplistic or complicated, anything from lookup-tables to multi-gigabyte deep neural networks. The lifecycles of servables are managed by loaders, which are responsible for loading servables to RAM and unloading them again. Sources provide the file system, where saved models are stored. They also provide a list of the specific servables, that should be loaded and used in production, the aspired versions. Managers are the broadest class. Their job is to handle the full life cycle of servables, i.e. loading, serving and unloading the aspired versions. They try to fulfill the requests from sources with respect the specified version policy.

Fig. 4 – TensorFlow-Serving architecture overview. Source: [3]

When a servable is elevated to an aspired version, its source creates a loader object for it. This object only contains metadata at first, not the complete (potentially large) servable. The manager listens for calls from loaders, that inform it of new aspired versions. According to its version policy, the manager then executes the requested actions, such as loading the aspired version and unloading the previous one. Loading a servable can be temporarily blocked if resources are not available yet. Unloading a servable can be postponed while there are still active requests to it. Finally, clients interface with the TensorFlow-Serving core through the manager. Both requests and responses are JSON objects.

Simple Serving Example

Getting started with a minimal setup is as simple as pulling the tensorflow/serving Docker image and pointing it at the saved model file [16]. Here I’m using a version of ResNet v2, a deep CNN for image recognition, that has been pretrained on the ImageNet dataset. The image below is encoded in Base64 and sent to the manager as a JSON object.

Fig. 5 – Some random image to predict. Source

The prediction output of this model consists of the estimated probabilities for each of the 1000 classes in the ImageNet dataset, and the index of the most likely class.

Fig. 6 – Model output, class 771 corresponds to “running_shoe”.

Performance

Implementing and hosting a multi-model serving solution for an industrial-scale web application with millions of users, just for benchmarks, is slightly out of scope for now. However, Google provides some numbers that should give an idea of what you can expect TensorFlow-Serving to do for you.

Latency

A strong point of TensorFlow-Serving is multi-tenancy, i.e. serving multiple models in the same process concurrently. The key problem with this is avoiding cross-model interference, i.e. one model’s performance characteristics affecting those of another. This is especially challenging while models are being loaded to RAM. Google’s solution is to provide a separate thread-pool for model-loading. They report that even under heavy load, while constantly switching between models, the 99th percentile inference request latency stayed in the range from ~75 to ~150 milliseconds in their own TFX benchmarks [2].

Throughput

Google claims that the serving system on its own can handle around 100,000 requests per second per core on a 16 vCPU Intel Xeon E5 2.6 GHz machine [1]. That is however ignoring API overhead and model complexity, which may significantly impact throughput. To accelerate inference on large models with GPUs or TPUs, requests can be batched together and processed jointly. They do not disclose whether this affects request latency. Since late February (TensorFlow-Serving v1.13), TensorFlow-Serving can now work directly in conjunction with TensorRT [14], Nvidia’s high-performance deep learning inference platform, which claims a 40x increase in throughput compared to CPU-only methods [15].

Usage and Adoption

In their paper on TFX (TensorFlow Extended), Google presents their own machine learning platform, that many of its services use [2]. TFX’ serving component, TFS², uses TensorFlow-Serving. As of November 2017, TensorFlow-Serving is handling tens of millions of inferences per second for over 1100 of Google’s own projects [13]. One of the first deployments of TFX is the recommender system for Google Play, which has millions of apps and over a billion active users (over two billion if you count devices). Furthermore, TensorFlow-Serving is also used by companies like IBM, SAP and Cloudera in their respective multi-purpose machine learning and database platforms [2].

Limitations

Today’s machine learning applications are very much capable of smashing all practical limits: DeepMind’s AlphaGo required 1920 CPUs and 280 GPUs running concurrently in real-time, for inference, for a single “client” [6]. That example might be excessive, but the power of deep ML models does scale with their size and compute complexity. Deep learning models today can become so large that they don’t fit on a single server node anymore (Google claims that they can already serve models up to a size of one terabyte in production, using a technique called model sharding [13]). Sometimes it is worth investing the extra compute power, sometimes you just need to squeeze that extra 0.1 percent accuracy out of your model, but often there are diminishing returns,. To wrap it up, there may be a trade-off between the power of your model versus latency, throughput and runtime cost.

Conclusion

When you serve ML models, your return on investment is largely determined by two factors: How easily you can scale out inference and how fast you can adapt your model to change. Model servers like TensorFlow-Serving address the lifecycle of machine learning models, without making the process disruptive in a productive environment. A good serving solution can reduce both runtime and implementation costs by a significant margin. While building a productive machine learning system at scale has to integrate a myriad different steps from data preparation to training, validation and testing, a scalable serving solution is the key to making it economically viable.

References and Further Reading

  1. Olston, C., Fiedel, N., Gorovoy, K., Harmsen, J., Lao, L., Li, F., Rajashekhar, V., Ramesh, S., and Soyke, J. (2017). Tensorflow-serving: Flexible, high-performance ML serving.CoRR, abs/1712.06139
  2. Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., Koo, C. Y., Lew, L., Mewald, C., Modi, A. N., Polyzotis, N., Ramesh, S., Roy, S., Whang, S. E., Wicke, M., Wilkiewicz, J., Zhang, X., and Zinkevich, M. (2017). Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 1387–1395, New York, NY, USA. ACM.
  3. TensorFlow-Serving documentation – https://www.tensorflow.org/tfx/guide/serving (accessed 11.03.2019)
  4. Serving Models in Production with TensorFlow Serving (TensorFlow Dev Summit 2017) – https://www.youtube.com/watch?v=q_IkJcPyNl0 (accessed 11.03.2019)
  5. Difference Inference vs. Training – https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/ (accessed 11.03.2019)
  6. Challenges of ML Deployment – https://www.youtube.com/watch?v=JKxIiSfWtjI (accessed 11.03.2019)
  7. Lessons Learned from ML deployment – https://www.youtube.com/watch?v=-UYyyeYJAoQ (accessed 11.03.2019)
  8. https://hackernoon.com/a-guide-to-scaling-machine-learning-models-in-production-aa8831163846 (accessed 11.03.2019)
  9. https://medium.com/@maheshkkumar/a-guide-to-deploying-machine-deep-learning-model-s-in-production-e497fd4b734a (accessed 11.03.2019)
  10. https://medium.com/@vikati/the-rise-of-the-model-servers-9395522b6c58 (accessed 11.03.2019)
  11. https://blog.algorithmia.com/deploying-deep-learning-cloud-services/ (accessed 11.03.2019)
  12. https://the-modeling-agency.com/crisp-dm.pdf (accessed 11.03.2019)
  13. https://ai.googleblog.com/2017/11/latest-innovations-in-tensorflow-serving.html (accessed 12.03.2019)
  14. https://developer.nvidia.com/tensorrt (accessed 12.03.2019)
  15. https://medium.com/tensorflow/optimizing-tensorflow-serving-performance-with-nvidia-tensorrt-6d8a2347869a (accessed 12.03.2019)
  16. https://medium.com/tensorflow/serving-ml-quickly-with-tensorflow-serving-and-docker-7df7094aa008 (accessed 12.03.2019)
  17. https://www.slideshare.net/shunyaueta/tfx-a-tensor-flowbased-productionscale-machine-learning-platform (accessed 12.03.2019)

The Renaissance of column stores

While attending the lecture ‘Ultra Large Scale Systems’ I got introduced into the quite intriguing topic of high-performance data storage systems. One subject which caught my special attention were column-oriented database management systems (column stores) about which I decided to give a presentation. Being quite lengthy and intricate, I realized that the presentation left my colleagues more baffled than informed. So I decided to write a blog post to recapitulate the topic for all those who were left with unanswered questions that day and for all the rest out there who might be interested in such matters. I believe this article, even though depicting a quite technical and specialized topic, is nevertheless of general interest because it shows how a system can be optimized for performance by emphasizing on inherent design characteristics.

Introduction

So what are column stores and what do we need them for?

This may be the most eminent question that crosses the mind of people who hear the term ‘column stores’ for the first time. Well let me tell you what they aren’t, a euphonic buzzword which, once uttered, will capture the attention of every IT geek in close vicinity. However, a rather matured technology that has been around since the early 70s and which has been going through constant architectonic refinements that allowed it to establish a foothold on the field of data storage systems used for large scale systems or big data management [1, 2]. Nevertheless, because of their quite specific area of application, column stores still cover a rather opaque field of technical innovation. This article, therefore, tries to provide a brief overview of the subject by giving insights into the architecture, design concepts and current technical advancements concerning column stores.

Most modern database management systems (DBMSs) rely on the  N-ary Storage Model (NSM). Here records are contiguously stored starting from the beginning of each disk page while using an offset table at the end of the page to position the start of each tuple (record). Thus, within each page the tuples are stored in sequence until the maximum page length of the storage system has been reached and a new page has to be created (figure 1). Database systems centered on this model show good access times when executing queries that either insert or modify single tuples or that result in a projection of a limited number of complete tuples [3]. The major drawback of this model, however, is its poor cache performance because it often burdens the cache with unnecessary attributes [4]. In contrast, column stores follow an entirely different concept called Decomposition Storage Model (DSM) where tables are vertically fragmented storing each attribute in a separate column (figure 1). The different attribute values for each tuple can then be reassembled by correlating their absolute position within each page. Another approach is to use binary relations based on an artificial key (surrogate) that allows reconnecting the different attributes to generate a partial or complete reconstruction of the initial tuple [5]. The performance advantage of this model can be seen when executing queries that require operations on entire columns. Those include aggregation queries (where only subsets of the entire data are required) or scan operations. Furthermore, since the data composition of each column is very homogeneous with little entropy, much better compression ratios can be reached [6]. This becomes even more accentuated when increasingly larger datasets have to be processed.

Figure 1: Row-based vs. column-based storage models

Diving into the internals

Trying to determine whether to use a row-oriented or a column-oriented storage system will inevitably result in pondering about the pros and cons of the above-mentioned architectures. It is clear that both systems have their strengths and weaknesses and the choice, like so many times, entirely depends on the problem to be solved. This may be elaborated by taking a closer look into the subject using an example. Let’s take the table depicted in figure 1 and imagine a row-oriented database system stored information about company employees in a similar manner. In that case, queries resulting in key lookups and extraction of single but complete records of employees would be executed with high performance by the system. This could be, for example, the search for a record of a specific employee by its ‘ID’ or ‘Name’. This process could even be improved by putting indexes on high cardinality columns like the ‘ID’, for example, which would further speed up the search. Thus, an application or service operating on the database soliciting requests of that classification would definitely benefit from the advantages provided by a row-oriented storage system.

However, what about a request to determine the average age of all male employees stored in the database. This kind of analytical query could in the worst case result in a complete scan of the entire table and would generate a completely different strain on the system [7]. Even though it could be mitigated by the use of composite indexes which, however, are only feasible when the table contains a small number of columns. Latter on the other hand is not the case in many big data storage systems where rather hundreds of columns per table are the norm. Working with composite indexes here will sooner or later produce an immense processing overhead which in the long run would consume substantial system resources [8]. This ultimately means, that for systems containing tables with sizes in the range of several hundred gigabytes, many analytical queries could potentially initiate sequential scans of the entire dataset. For this scenario, column stores represent a better option because the query execution would, by design, be limited to only those attributes required for the final projection. This would spare computation resources by avoiding the necessity to scan large amounts of irrelevant data and, as a result, lead to overall better performance of the system. Consequently, the right choice of the database system should, therefore, be guided by the demands posed by the services operating on it, because they ultimately define the predominant query structure processed by the system.

Query processing models

Figure 2: Row-scanner implementation

To understand the subjects in the sections that follow, some principle design aspects on how row and column stores execute queries have to be elaborated. Thus, when comparing the implementations of query execution between the two systems, fundamental differences become obvious. Capitalizing on the previous example, let’s observe the execution of the following simple SQL query “SELECT Name, Profession FROM Employees WHERE Age > 30”. The expected result from the query would be a list of the names and professions of all employees being older than 30 years. The request leads to low-level database operations where scanning processes on the corresponding table will be performed to gather the necessary datasets. In the center of every query execution are scanners that apply predicates on tuples, generate projections and provide their parent operators with the corresponding output data.

In the case of the row-scanner implementation (figure 2), the execution process is quite straightforward. Here the data is fetched from the storage layer in the form of record batches on which the scanner will perform filter operations. The data will then be forwarded to the parent operator which aggregates the data to assemble the final projection [9]. Now in case of the column store, the process looks considerably different. As illustrated in figure 3, operations are executed based on single columns rather than entire tables. Here the initial scan operation is performed on a single column containing the data on which the predicate has to be applied. Hence, in the first step, values are filtered by submitting them to predicate evaluation, leaving only a subset of the original dataset. However, instead of returning the values directly, only a list of their corresponding column positions is returned. In the steps that follow, the positions are correlated with the columns containing the requested attributes to extract the corresponding values which are then aggregated to assemble the projection [9]. Thus, the example already indicates why analytical queries may perform better on column stores than on row stores. Instead of scanning the entire table to generate an output consisting of only a small subset of all attributes of the extracted records, operations are limited to that subset of attributes from the beginning, resulting in a significant reduction of the operational overhead.

Figure 3: Column-scanner implementation

Bound to be optimized

Simply storing data in the form of columns will not bring the improvements that can be expected from column stores. Actually, with few exceptions, they usually get outperformed by row stores in most scenarios. Consequently, a number of optimization techniques have been adopted over the past years yielding significant performance enhancements. This allowed column stores to be successfully utilized in areas where large datasets have to be handled like, for example, data warehousing,  data mining or data analytics [4]. Therefore, the following section will give a brief overview of a selected number of optimization techniques which have been integrated into many modern column store systems today.

Compression

Given the characteristics of columnar data, using compression on such structures seems to be the most obvious approach to reduce disc space usage. Indeed, values from the same column tend to fall into the same domain and, therefore, display low information entropy and more value locality [10]. Those qualities allow compressing one column at a time while even permitting different compression algorithms for individual columns. In addition, if values are sorted within a column, which is common for column store systems, that column will become remarkably compressible  [6]. Another technique is ‘frequency partitioning’ where a column is reorganized in such a way, that each page of the column shows as low information entropy as possible. To accomplish this, certain column stores reorganize columns based on the frequency of values that appear in the column and allow, for example, frequent values to be stored together on the same page [11]. The improvements of such methods are apparent and investigations suggest that while row stores allow average compression ratios of 1:3, column stores usually achieve ratios of 1:10 or better. Finally, in addition to lowering disk space usage, compression also helps to improve query performance. If data is compressed, then less time is spent in I/O operations during query execution because of reduced seek times, increased buffer hit rates and less transfer time of data from disk into memory and from there to the CPU [6, 10].

Dictionary Encoding

This form of compression works fine on data sets composed of a small number of very frequent values. For each value appearing in a column, an entry is created in a dictionary table. The values in the column are then represented by integer values referencing the positions in this table. Furthermore, dictionary encoding can not only be applied to single columns but also to entire blocks [12]. Another advantage of dictionary compression is that it allows working with columns of fixed length if the system keeps all codes of the same width. This can further maximize data processing speeds in systems that rely on vectorized query execution.

Figure 4: Examples of ‘Dictionary’ encoding and ‘Run-Length’ encoding

Run-Length Encoding

The encoding is well suited to compress columns containing repeating sequences of the same value by reducing them to a compact singular representation. Here, the column entries are replaced by triple values describing the original value, its initial start position and its frequency (run-length). Hence, when a column starts with 20 consecutive entries of the value ‘male’ than these entries can be reduced to the triple (‘male’, 1, 20). This compression form works especially well on sorted columns or columns with reasonable-sized runs constituted of the same value [10].

Figure 5: Examples of ‘Bit-Vector’ encoding, ‘Differential’ encoding and ‘Frame of Reference’ encoding

Bit-Vector Encoding

In this type of encoding for each unique value in a column, a bit-string with the same length as the column itself is generated. The string contains only binary entries designating a ‘1’ if the value the string is associated with exists at the corresponding position in the column, or a ‘0’ otherwise. ‘Bit-Vector’ encoding is frequently used when columns have a limited number of unique data values. In addition, there is also the possibility to further compress the bit-vector allowing to use the encoding even on columns containing a larger amount of unique values [10].

Differential Encoding and Frame of Reference Encoding

‘Differential’ encoding expresses values as bit-sized offsets from the previous value. A value sequence beginning with ’10, 8, 6, 12′ for example, can be represented as ’10, -2, -2, 6′. The bit-size for the offset value, however, is fixed and cannot be changed once established. Therefore, special escape codes have to be used to indicate values whose offset cannot be represented with the specified bit-size. The encoding performs well on columns containing sequences of increasing or decreasing values, thus, demonstrating value locality. Those can be inverted lists, timestamps, object IDs or sorted numeric columns. As a variation of the concept, there is also the ‘Frame of Reference’ encoding which works in a very similar way. The main difference here is that the offsets do not refer to the direct predecessor but to a reference value within the set. For example, the previous sequence ’10, 8, 6, 12′ would be represented as ’10, -2, -4, 2’ with ’10’ being the reference value [13].

Operations on compressed data

Figure 6: Layout of a compression block

Performance gains through compression can be maximized when operators are able to directly act on compressed values without the need for prior decompression [14]. This can be achieved through the introduction of buffers that consist of column data in a compressed format providing an API which allows query operators to work directly on the compressed values. Consequently, a component wrapping an intermediate representation for compressed data termed a ‘compression block’ is added to the query executor (figure 6). The methods provided by the API can be utilized by query operators to directly access compressed data without having to decompress and iterate through it [6, 10]. The illustration in figure 7 should exemplify the design by showing how a query execution on compressed data would look like. The query is aimed to determine the number of male and female employees who work as accountants. In the first step, a filter operation is performed on the sorted and ‘Run-length’ encoded ‘Profession’ column by calling the corresponding API method which returns the index positions of those values passing the predicate condition (Profession = ‘Accountant’). The positions can then be used to delimit the corresponding region within the ‘Bit-Vector’ encoded ‘Gender’ column. Finally, the number of males and females working as accountants can be calculated using a corresponding API method that sums up all occurrences of the value ‘1’ within the interval. As seen, for none of the operations any decompression of the scanned data was necessary.

Figure 7: Example depicting query operations on compressed data

Vectorized execution

Figure 8: Tuple-at-a-time vs. vectorized execution

Most of the traditional implementation strategies for the query execution layer are based on the ‘iterator’ or ‘tuple-at-a-time’ model where individual tuples are moved from one operator to another through the query plan tree [10]. Each operator normally provides a next() method which outputs a tuple that can be used as input by a caller operator from further up the execution tree. The advantage of this approach is that the materialization of intermediate results is minimal. There is, however, another alternative called ‘vectorized execution’ where in contrast to the ‘tuple-at-a-time’ model,  each operator returns a vector of N tuples instead of only a single tuple (figure 8). This approach offers several advantages [6]:

  1. Reduction in interpretation overhead by limiting the amount of function calls through the query interpreters.
  2. Improved cache locality by being able to adjust the vector size to the CPU cache.
  3. Better profiling by allowing operators to execute all expression evaluation work in a vectorized fashion (i.e. array-at-a-time), keeping the overhead for measurements of individual operations low.
  4. Taking advantage of the columnar format by reading larger data batches of N tuples at a time allowing array iteration with good loop pipelining techniques, so that operations can repeatedly be executed within one function call.

Early vs. late materialization

One fundamental problem when designing the execution plan of queries for column stores is to determine when the projection of columns should occur. In a column store, information of a logical entity or object is distributed over several column pages on the storage medium. As a result, during the execution of most queries, several attributes of a singular entity have to be accessed. In many cases, however, database outputs are expected to be entity-based and not column-based. Consequently, during every execution, the information scattered over multiple columns has to be reassembled at some point, to form ‘rows’ of information about the entity. This joining of tuples is a process very common for column stores and has been coined with the term ‘materialization’ [15]. In this context, there are principally two design concepts that address the problem of column projection called ‘Early Materialization’ and ‘Late Materialization’. During query execution, most naive column stores first select the columns of relevance, construct tuples from their component attributes, and then execute standard row store operations on the resulting rows. This approach of constructing tuples early in the query plan called ‘Early Materialization’ will in many cases result in better performance for analytical queries when compared to those seen in typical row stores, however, much of the potential of column stores will still be left untouched.

Modern column stores adapted the concept of ‘Late Materialization’ where operations are performed on the basis of single columns as long as possible and projection of columns occurs late in the query plan. From this rises the necessity to work with intermediate ‘position lists’ to join operations that have been conducted on individual columns. This lists can be represented as a simple array of bit strings or as a set of ranges of positions. Those position representations can then be intersected to extract the values of interest which then confluent into the final projection [15]. Thus, the concept of ‘Late Materialization’ offers several advantages that result in significant performance boosts which can be attributed to the following characteristics:

  1. Given specific selection and aggregation operations, it is possible to completely skip the materialization of some tuples.
  2. Decompression of data for the reconstruction of tuples can be avoided which allows the continuous operation on compressed data in memory.
  3. Cache performance can be improved while operating directly on column data because irrelevant attributes for given operations can be omitted.
  4. Efficient CPU usage through operations on highly compressible position representations which, given their structure, are well suited for CPU processing.
Figure 9: Analytical query to be executed using late materialization

In the following, a typical query execution using late materialization as implemented in modern column store systems will be described. Here the query consists of a simple SQL-statement aimed to determine the number of all female employees over 30 years of age ordered by professions (figure 9). In this example, the intermediate lists are expressed in the form of bit-vectors representing the positions of those values that passed the predicates and on which bit-wise ‘AND’ operations can be executed.

The query execution illustrated in figure 10 is a select-project operation where essentially two columns of a table (Employees) are filtered, while subsequently a sum aggregation is performed to generate the projection. Thus, in the first step, the two predicates are applied to the ‘Age’ and ‘Gender’ columns which results in two bit-vectors representing only those values which passed the predicates. These are then intersected by applying a bit-wise ‘AND’ operation and the resulting bit-vector then used to extract the corresponding values from the ‘Profession’ column. In the last step, the results are aggregated to assemble the projection by grouping and summing the values from the previous operation.

Figure 10: Execution of query using late materialization

Virtual IDs

A possible way to structure columns within a column store is to associate individual columns with an identifier like, for example, a numeric primary key. Adding an identifier in such an explicit fashion, however, unavoidably introduces redundancy and increases the amount of data to be stored on the disk. To solve this problem, modern database systems try to avoid additional columns containing solely IDs by substituting them with virtual identifiers which represent the position (offset) of the tuple in the column [5]. The design can be further enhanced by implementing columns composed of fixed-width dense arrays. This allows storing attributes of an individual record at the same position across all the columns of a table. In combination with offsets, the design permits to significantly improve the localization of individual records. A value at the i-th position of a table EMP, for example, could be located and accessed by just calculating ‘startOf(EMP)+i*width(EMP)’.

Database cracking

Sorted columns are a helpful measure to significantly improve the performance of column stores, including the realization of high compression ratios. However, common approaches require a complete sorting of columns in advance, demanding idle time and workload knowledge. More dynamic approaches have brought forward architectures aiming to perform such tasks incrementally by combining them with query execution. The principal motivation is to continuously change the physical data store with every executed query. Sorting is consequently done adaptively in a continuous manner and limited to the accessed sections of a column [16]. Therefore, each query performs a partial reorganization of all processed columns making subsequential access faster. This process is called ‘Database cracking’ which allows using the database system immediately once data is available. It is an interesting approach to adaptive indexing because on every range-selection query, the data is reorganized and compartmentalized using the provided predicates as pivots. In that manner, the optimal performance is achieved incrementally without the prior need to analyze the expected workload, tune the system and create indexes. Thus, figure 11 shows an example of a search request where two consecutive queries search and ‘crack’ a column (‘Age’). The first query subdivides the column into three pieces while the second query further improves the partitioning process. The final result is a column that is partially sorted and comprised of five value ranges. Consequently, the structure of the column data represents a reflection of the query structure and thus constitutes an adaption to the data requirements of the applications (services) accessing the database.

Figure 11: Adaptive indexing of column using database cracking

Conclusion

The column store concept, albeit nearly half a century old, has received considerable attention during the last decade. This is due to the fact that column stores exceed when it comes to performing analytical-style processing of large datasets [9]. This has made them the storage system of choice especially for applications operating with OLAP-like workloads which rely on very complex queries often involving complete datasets. Investigations have shown, however, that substantial improvements of the basic concept of column stores are necessary to truly reap the benefits such an architecture may provide. Therefore, several optimizations have been introduced over the past years, some of which have been discussed here. Those include compression, API-based compression buffers, vectorized processing, late materialization, virtual IDs and database cracking. All of which aim to significantly improve the processing time by tackling performance issues from different angles. When adding such optimization techniques column stores outperform row stores by an order of magnitude on analytical workloads [8]. This also indicates that the architectural design will be of interest even in the years to come because of the ever-growing number of large-scale, data-intensive applications with high workload. Those include scientific data management, business intelligence, data warehousing,  data mining, and decision support systems.

Finally, there are also still directions for future developments that are worth investigating like, for example, hybrid systems that are partially column-oriented. Those could be realized in the form of architectures that store columns grouped by access frequency or that adapt to access patterns allowing to switch between column-oriented and row-oriented table structures when needed. Another issue to be addressed are the loading times of column stores, which still do not perform well when compared with row stores, especially if there are many views to materialize. Thus, studies on possible new algorithms that could alleviate the problem by substantially improve read performance would surely be an interesting field for future investigations.

References

[1] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton and T. Vassilakis, Dremel: Interactive Analysis of Web-scale Datasets, Proceedings of the VLDB Endowment, VLDB Endowment, 2010, Vol. 3(1-2), pp. 330-339.

[2] D. J. Abadi, P. A. Boncz and S. Harizopoulos, Column-oriented database systems, Proceedings of the VLDB Endowment, VLDB Endowment, 2009, Vol. 2(2), pp. 1664-1665.

[3] D. Bößwetter, Spaltenorientierte Datenbanken, Informatik-Spektrum, Springer, 2010, Vol. 33(1), pp. 61-65.

[4] A. El-Helw, K. A. Ross, B. Bhattacharjee, C. A. Lang and G. A. Mihaila, Column-oriented Query Processing for Row Stores, Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, ACM, 2011, pp. 67-74.

[5] S. Idreos, F. Groffen, N. Nes, S. Manegold, K. S. Mullender and M. L. Kersten, MonetDB: Two Decades of Research in Column-oriented Database, IEEE Data Engineering Bulletin, 2012, Vol. 35(1), pp. 40-45.

[6] D. J. Abadi, Query execution in column-oriented database systems, Massachusetts Institute of Technology, 2008.

[7] D. J. Abadi, Column Stores for Wide and Sparse Data, CIDR, 2007, pp. 292-297.

[8] D. J. Abadi, S. R. Madden and N. Hachem, Column-stores vs. row-stores: how different are they really?, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008, pp. 967-980.

[9] S. Harizopoulos, V. Liang, D. J. Abadi and S. Madden, Performance tradeoffs in read-optimized databases, Proceedings of the 32nd international conference on very large data bases, 2006, pp. 487-498.

[10] D. Abadi, S. Madden and M. Ferreira, Integrating compression and execution in column-oriented database systems, Proceedings of the 2006 ACM SIGMOD international conference on management of data, 2006, pp. 671-682.

[11] V. Raman, G. Swart, L. Qiao, F. Reiss, V. Dialani, D. Kossmann, I. Narang and R. Sidle, Constant-Time Query Processing, IEEE 24th International Conference on Data Engineering (ICDE ’08), IEEE Computer Society, 2008, pp. 60-69.

[12] P. Raichand, A short survey of data compression techniques for column oriented databases, Journal of Global Research in Computer Science, 2013, Vol. 4(7), pp. 43-46.

[13] J. Goldstein, R. Ramakrishnan and U. Shaft, Compressing relations and indexes, Proceedings 14th International Conference on Data Engineering, IEEE, 1998, pp. 370-379.

[14] O. Polychroniou and K. A. Ross, Efficient lightweight compression alongside fast scans, Proceedings of the 11th International Workshop on Data Management on New Hardware, 2015, pp. 9.

[15] D. J. Abadi, D. S. Myers, D. J. DeWitt and S. R. Madden, Materialization strategies in a column-oriented DBMS, IEEE 23rd International Conference on Data Engineering, 2007, pp. 466-475.

[16] S. Idreos, M. L. Kersten and S. Manegold, Self-organizing Tuple Reconstruction in Column-stores, Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, ACM, 2009, pp. 297-308.