Peer2Peer Multiplayer Real-time Strategy Game “Admiral: WW2”

Admiral: WW2

1 Intro

Gaming is fun. Strategy games are fun. Multiplayer is fun. That’s the idea behind this project.

In the past I developed some games with the Unity engine – mainly 2D strategy games – and so I thought it is now time for an awesome 3D multiplayer game; or more like a prototype.

The focus of this blog post is not on the game development though, but rather on the multiplayer part with the help of Cloud Computing.

However where to start? There are many ways one can implement a multiplayer game. I chose a quite simple, yet most of the time very effective and cheap approach: Peer-to-Peer (P2P).

But first, let us dive in the gameplay of Admiral: WW2 (working title).

2 Game Demo

2.1 Gameplay

Admiral: WW2 is basically like the classic board game “Battleships”. You’ve got a fleet and the enemy player has got a fleet. Destroy the enemy’s fleet before your own fleet is sunk. The big difference is that Admiral: WW2 is a real-time strategy game. So the gameplay is more like a real-life simulation where you as the admiral can command your ships via direct orders:

  • Set speed of a ship (stop, slow ahead, full ahead, …)
  • Set course of a ship
  • Set the target of the ship (select a ship in the enemy fleet)

Currently there is only one ship class (the German cruiser Admiral Hipper), so the tactical options are limited. Other classes like battleships, destroyers or even aircraft carriers would greatly improve replayability; on the other hand they would need many other game mechanics to be implemented first.

Ships have multiple damage zones:

  • Hull (decreases the ship’s hitpoints or triggers a water ingress [water level of the ship increases and reduces the hitpoints based on the amount of water in the hull])
  • Turrets (disables the gun turrets)
  • Rudder (rudder cannot change direction anymore)
  • Engine/Propeller (ship cannot accelerate anymore)

If a ship loses all hitpoints the ship will sink and is not controllable.

2.2 The Lobby Menu

Before entering the gameplay action the player needs to connect to another player to play against. This is done via the lobby menu.

Here is the place where games are hosted and all available matches are listed.

On the right hand side is the host panel. To create a game the host must enter a unique name and a port. If the IP & Port combination of the host already exists, hosting is blocked.

After entering valid infos the public IP of the host is obtained via an external service (e.g. icanhazip.com). Then the match is registered on a server and the host waits for incoming connections from other players.

On the left hand side there is the join panel. The player must enter a port before viewing the match list. After clicking “Join”, a Peer-to-Peer connection to the host is established. Currently the game only supports two players, so after both peers (host and player) are connected the game will launch.

More on the connection process later.

3 Multiplayer Communication with Peer2Peer

3.1 Peer-to-Peer

P2P allows a direct connection between the peers with UDP packets – in this case the game host and player.

So in between no dedicated server handling all the game traffic data is needed, thus reducing hosting costs immensely.

Because most peers are behind a NAT and therefore connection requests between peers are blocked, one can make use of the NAT-Traversal method Hole-Punching.

3.1.1 P2P Connection with Hole-Punching

Given peer A and peer B. A direct connection between A and B is possible if:

  • A knows the public IP of B
  • A knows the UDP port B will open
  • B knows the public IP of A
  • B knows the UDP port A will open
  • A and B initiate the connection simultaneously

This works without port-forwarding, because each peer keeps the port open as if they would contact a simple web server and wait for the response.

To exchange the public IPs and ports of each peer a Rendezvous-Server behind no NAT is required.

3.1.2 Rendezvous-Server

The Rendezvous-Server needs to be hosted in the public web, so behind no NAT. Both peers now can send simple web requests as if the users would browse the internet.

If peer A tells the server he wants to host a game, the server saves the public IP and port of A.

If B now decides to join A’s game the server informs B of the IP and port of A.

A is informed of B’s public IP and port as well.

After this process A and B can now hole-punch through their NATs and establish a P2P connection to each other.

A Rendezvous-Server can be very cheap, because the workload is quite small.

But there are some cases where Hole-Punching does not succeed (“…we find that about 82% of the NATs tested support hole punching for UDP…”, https://bford.info/pub/net/p2pnat/).

In those cases a Relay-Server is needed.

3.1.3 Relay-Server

The Relay-Server is only used as a backup in case P2P fails. It has to be hosted in the public internet, so behind no NAT.

Its only task is the transfer of all game data from one origin peer to all other peers. So the game data just takes a little detour to the Relay-Server before continuing it’s usual way to the peers.

This comes at a price though. Since all of the game traffic is now travelling through this server the workload can be quite tough depending on the amount of information the game needs to exchange. Naturally the ping or RTT (Round Trip Time: the time it takes to send a packet from peer to peer) of a packet is increased resulting in lags. And finally multiple Relay-Servers would be required in each region (Europe, America, Asia, …). Otherwise players far away from the Relay-Server suffer heavy lags. All of these lead to high demands on the server hardware. To be clear: a proper Relay-Server architecture can be expensive in time and money.

Because of that in this project I ignored the worst-case and focused on the default Peer-to-Peer mechanism.

3.1.4 Peer2Peer Conclusion

The big advantage of this method: it’s mainly serverless, so the operation costs of the multiplayer is very low. Because of that, P2P is a very viable multiplayer solution for small projects and indie games. The only thing that is needed is a cheap Rendezvous-Server (of course only if no Relay-Server is used). P2P also does not require to port-forward, which can be a difficult and/or time consuming task depending on the player’s knowledge.

But there are disadvantages:

  • A home network bandwidth may not be enough to host larger games with much traffic; a server hosted at a server farm has much more bandwidth
  • The game stops if a P2P host leaves the game
  • No server authority
    • every player has a slightly different game state that needs to be synchronized often; a dedicated server has only one state and distributes it to the players; players only send inputs to the server
    • anti-cheat has to be performed by every peer and not just the server
    • random is handled better if only the server generates random values, otherwise seeds have to be used
    • game states may need to be interpolated between peers, which is not the case if only the server owns the game state

A dedicated server would solve these disadvantages but in return the hardware requirements are much higher making this approach more expensive. Also multiple servers would be needed in all regions of the world to reduce ping/RTT.

3.2 Game Connection Process

After starting the game the player sees the multiplayer games lobby. As described previously the player can host or join a game from the list.

3.2.1 Hosting a game

The host needs to input a unique game name and the port he will open for the connection. When the host button is clicked the following procedure is triggered:

  1. Obtain public IP address
    • Originally this should be handled by the Rendezvous-Server, because it is hosted behind no NAT and can see the public IP of requests, but limitations of the chosen hosting service prevented this approach (more on that later)
    • Instead I used a web request to free services like icanhazip.com or bot.whatismyipaddress.com as a second backup in case the first service is down; these websites respond with a plain text containing the ipv6 or ipv4 of client/request
  2. The Rendezvous-Server is notified of the new multiplayer game entry and saves the game with public IP and port, both sent to the server by the host
    • Host sends GET-Request to the server (web server) containing all the information needed /registermpgame?name=GameOne&hostIP=1.1.1.1&hostPort=4141
    • On success the game is registered and a token is returned to the host; the token is needed for further actions affecting the created multiplayer game
  3. The host now waits for incoming connections from other players/peers
    • The host sends another GET-Request to the Rendezvous-Server /listenforjoin?token=XYZ123
      • This is a long-polling request (websocket alternative): the connection is held open by the server until a player joined the multiplayer game
      • If that is the case the GET-Request is resolved with the public IP and port of the joined player, so that hole-punching is possible
      • If no player joins until the timeout is reached (I’ve set the timeout to 15 seconds), the request is resolved with http status code 204 No content and no body
      • In that case the GET-Request has to be sent again and again until a player joins
  4. On player join both peers init a connection and punch through NAT
  5. If successful the game starts
  6. (Otherwise a Relay-Server is needed; explained previously)
  7. The host closes the game with another GET /startorremovempgame?token=XYZ123

3.2.2 Joining a game

The player first needs to input a valid port. After that he is presented with a list of multiplayer games by retrieving the information from the Rendezvous-Server with a GET-Request to the endpoint /mpgameslist. This returns a JSON list with game data objects containing the following infos:

  • name: multiplayer game name
  • hostIP: public IP of the host
  • hostPort: port the host will open for the connection

If the player clicks “Join” on a specific game list item the following process handles the connection with the host:

  1. Obtain public IP address
    • Originally this should be handled by the Rendezvous-Server, because it is hosted behind no NAT and can see the public IP of requests, but limitations of the chosen hosting service prevented this approach (more on that later)
    • Instead I used a web request to free services like icanhazip.com or bot.whatismyipaddress.com as a second backup in case the first service is down; these websites respond with a plain text containing the ipv6 or ipv4 of the client/request
  2. Inform the Rendezvous-Server of the join
    • Send a GET-Request with all the information needed /joinmpgame?name=GameOne&ownIP=2.2.2.2&hostPort=2222
    • Now the host is informed by the server if the host was listening
    • The server resolves the request with the public IP and port of the host
    • Now the player and the host try to establish a P2P connection with hole-punching
    • If successful the game starts
    • (Otherwise a Relay-Server is needed; explained previously)

3.3 Game Synchronization

Real-time synchronization of game states is a big challenge. Unlike turn-based games the game does not wait until all infos are received from the other players. The game always goes on with a desirably minimal amount of lag.

Of course the whole game state could be serialized and sent to all players, but this would have to happen very frequently and the package size would be very large. Thus resulting in very high bandwidth demand.

Another approach is to only send user inputs/orders, which yields far less network traffic. I used this lightweight idea, so when the player issues an order the order is immediately transmitted to the other player. There the order is executed as well.

The following game events are synchronized:

  • GameStart: After the game scene is loaded the game is paused and the peer sends this message to the other player periodically until he receives the same message from the other peer; then the game is started
  • RandomSeed: Per game a “random seed master” (the host) periodically generates a random seed and distributes that seed to the other player; this seed is then used for all random calculations
  • All 3 ship orders:
    • ShipCourse
    • ShipSpeed
    • ShipTarget
  • GameSync: All of the previous messages still led to diverging game states, so a complete game serialization and synchronization is scheduled to happen every 30 seconds
    • Projectile positions, rotations, velocities are synched
    • The whole ship state is synched
    • Both game states (the received one and the own one) are interpolated, because I don’t use an authoritative server model and so both game states are “valid”

The following game events should have a positive impact on game sync, but are not implemented yet:

  • ProjectileFire: Syncs projectiles being fired
  • Waves: Because the waves have a small impact on the position where projectiles are fired and hit the ship the waves should be in-sync as well

3.3.1 IDs

In game development you mostly work with references. So for example a ship has a reference to another ship as the firing target. In code this has the benefit of easy access to the target ship’s properties, fields and methods.

The problem is with networking these references do not work. Every machine has different references although it may represent the same ship. So if we want to transfer the order “Ship1 course 180” we cannot use the local reference value to Ship1.

Ship1 needs an unique ID that is exactly the same on all machines. Now we can send “ShipWithID1234 course 180” and every machine knows which ship to address.

In code this is a bit more tedious, because the received ID has to be resolved to the appropriate ship reference.

The most difficult part is finding unique IDs for all gameobjects.

Ships can obtain an ID easily on game start by the host. Projectiles are a bit more tricky, because they are spawned later on. I solved this by counting the shots fired by a gun turret and combining the gun turret’s ID with the shot number to generate a guaranteed unique ID, provided the gun turret ID is unique. Gun turret IDs are combined as well: Ship ID + gun turret location (sternA, sternB, bowA, bowB, …).

Of course with an authoritative server this gets easier as only the server generates IDs and distributes them to all clients.

3.3.2 Lockstep

Additionally there is an interesting and promising approach to discretize the continuous game time called Lockstep. It is used in prominent real-time strategy games like Age of Empires (https://www.gamasutra.com/view/feature/131503/1500_archers_on_a_288_network_.php). The basic idea is to split up the time in small time chunks, for example 200ms intervals. In this time frame every player can do exactly one action that gets transferred to all the other players. Of course this action can also be “no action”. The action is then executed in the next interval almost simultaneously for all players. This way the real-time game is transformed into a turn-based game. It is important to adjust the interval based on the connection speeds between the players, so that no player lags behind. For the players the small order input delay is usually unnoticed, if the interval is small enough.

An important requirement is that the game is deterministic and orders issued by players have the same outcome on all machines. Sure there are ways to handle random game actions, but because AdmiralWW:2 uses random for many important calculations and my development time frame was limited I unfortunately did not implement this technique.

4 Rendezvous-Server Hosting

There are almost unlimited hosting options on the internet. Usually the selection shrinks after a specific programming language is picked. But because I used NodeJS with Typescript, which transpiles the code to default Javascript, there were still plenty of hosting options. If I decided to write the server in C# and therefore run a .NET Core application like the game is written with (Unity uses C# or some exotic programmers use Javascript) many hosting providers drop out.

4.1 Alternatives

Of course there is the option of renting an own dedicated server: very expensive for a simple Rendezvous-Server and maintenance heavy, but powerful and flexible (.NET ok).

There’s the option of a managed server: little maintenance but very, very expensive.

We have VPS (Virtual Private Servers): dedicated servers that are used by many customers and the hardware is distributed among them, cheaper.

Then there are the big players like AWS, Google Cloud Platform, IBM Cloud and Microsoft Azure: they can get very expensive, but in return they offer vast opportunities and flexibility; it is easy to scale and monitor your whole infrastructure and a load-balancer can increase availability and efficiency of your server(s); on the other hand the learning-curve is steeper and setting up a project needs more time.

4.2 Heroku

Heroku is a cloud based Platform-as-a-service (PaaS) offering hosting of many common programming languages like Javascript/NodeJS (which I used), Python and Ruby. It does not offer as many possibilities as AWS and co, but it is way simpler to learn and set up.

Also it does have a completely free plan, which grants over 500 hours uptime per month. This is not enough to run the whole month with 30 * 24 = 720 hours, but the application sleeps after 1 hour with no actions and automatically wakes up again if needed. This is perfectly fine for a Rendezvous-Server, because it is not used all the time. The wake up time is not that bad as well (around 4-8 seconds).

Of course Heroku offers scaling so that the performance is massively increased and the app will never sleep, but this comes with a price tag.

In a paid plan Heroku also has a solid monitoring page with events, up- and downtimes, traffic and so on.

Server logs are easily accessible as well.

For setup you just need to create a “Procfile” in your project folder that defines what to execute after the build is completed: web: npm run start will run the npm script called start as a web service. The application is then publicly reachable on your-app-name.herokuapp.com. The NodeJS web server can then listen on the port that is provided by Heroku in the environment variable process.env.PORT.

Deployment is automated: just push to your github master branch (or the branch you specified in Heroku); after that a github webhook triggers the build of your app in Heroku.

But during development I discovered a big disadvantage: Heroku does not support ipv6.

This is a problem, because I wanted to use the Rendezvous-Server as a STUN-Server as well, which can determine and save the public IPs of client requests. But if a client like me only has Dual-Stack lite (unique ipv6 but the ipv4 address is shared among multiple customers) Peer2Peer is not possible with the shared ipv4.

As a workaround the clients obtain their public ipv4 or ipv6 via GET-Request from icanhazip.com or as a backup from bot.whatismyipaddress.com. These websites return a plain text body containing the public IP. After that the peers send their public IP to the Rendezvous-Server as explained previously.

5 Architecture Overview

Typescript usually is a very good choice for larger projects, simply because of the type-safety and development-time error checking. This guarantees no more searching for bugs like typos as it is the case in plain Javascript.

To realize the web server I used the very popular ExpressJS, which does not need any introduction and should be well-known by this time.

6 Conclusion

Real-time multiplayer games are tricky. The game states quickly diverge and much effort has to be done to counteract this. Game time differences and lag drastically compound this. But methods such as Lockstep can help to synchronize the time across multiple players.

While developing, try to keep the game as deterministic as possible, so that player actions yield the same result on every machine. Random is usually problematic, but can be handled via a dedicated game server or seeds.

Peer-to-Peer is a simple and great solution for smaller indie multiplayer games, but comes with some disadvantages. For larger projects dedicated/authoritative servers are favourable.

Heroku offers a fast and simple setup for hosting cloud applications and the free plan is great for smaller projects. If demand increases scaling is no problem and the deployment is automated. But be aware of the missing ipv6 support of Heroku.

All in all: Gaming is fun. Strategy games are fun. Multiplayer is fun – for the player and an exciting challenge for developers.

Finding the right strategy for your way to the cloud

– Abstract –

This article is about finding the right strategy if you want to empower your business with cloud computing. It’s written for those of you who are still new to this topic and want to get a first sight through the dense cloud mystery. After reading you should have gained an orientation on how to start your first cloud project and a basic knowledge of the most important buzzwords of the industry.

“The essence of strategy is choosing what not to do”
– Prof. Dr. Michael E. Porter – Harvard Business School
Cofounder of the strategic management


It’s stormy outside

It’s been over a decade now that Amazon was stirring up the market with their web services and cloud computing is now more mainstream than ever. What for many people started with file-sharing- and backup-services like Dropbox or company internal NAS-systems, quickly transformed into a trending global topic. Why should you always upgrade local hardware and storage to stay competitive if you can have the same or even greater performance with shared resources, reduced costs and accessibility from everywhere?

Some understood that earlier than others and can already harvest the fruits from their work – and some, hopefully not you – should seriously think about protecting their house because the storm is not over yet! With the introduction of cloud gaming services like GeForce Now and the start of projects from the major players in the market, it starts to be clear that cloud computing is not just a hype, quite the reverse, anything seems to be possible and any upcoming business transformations should consider the opportunities of cloud computing for their needs. And who knows, maybe one day you can even sell your computing power as a part of the worldwide cloud like it’s been introduced by crypto-currency miners or solar tech companies?

Faster, Greater, Higher! The sky’s the limit!

So as enough time has passed that the first CTOs and CIOs can look back to their early experiences with moving to cloud services, we can wrap up some key facts if you’re still looking for a few good arguments which may convince you and your team for moving to the cloud.

Scalability with minor effort

  • Overwhelming traffic is not a show stopper anymore
  • Auto scaling infrastructure is auto money for your pocket!
  • Reduce the multi region infrastructure challenge

Security

  • Cloud storage takes less risks than lost devices!
  • Stop to worry about underlying infrastructure security & maintenance
  • The disaster can come! Well proven Disaster Recovery Systems of great
    cloud providers are probably better than yours!

Speed & Performance

  • Increase your server-side performance on-demand as you go
  • Handle request peaks without costly overprovisioning your hardware

Reduced Costs = Greater Profit

  • Pay-as-you-go is better than buy-and-deprecate!
  • Financial advantages through transformation from CAPEX to OPEX
  • Makes your financial model a lot easier to plan and calculate
  • High chance to reduce your Time to Market 
  • Emerging markets can perhaps make you even fly higher

Agility & User Experience

  • Stop your team from leaving their key competences
  • Increase your team performance with the latest collaboration tools
  • Satisfy your team by giving them a state-of-the-art feeling
  • Give more freedom for physical appearance

Ecological Footprint

  • Less eWaste for deprecated hardware
  • Savings on power supply and emissions

And if you’re doing it right, you earn a lot of FLEXIBILITY which is nowadays more important than ever! Our technological environment changes so fast that you may be outdated before you finish your project – ok, probably not that fast – but you should be prepared! If your cloud provider can’t handle your SLA or gets outdated by himself, you may move on to a new provider but be aware: cloud providers want to keep you as a customer, for sure! How that may end up badly for you we’ll discuss in the next section.

Pitfalls you better leave for others

So if you’re convinced now by the advantages of the cloud, we reveal you some pain points where others have been stumbling, so keep them in mind and don’t make these mistakes again!

Readiness of Organization
One of the greatest mistakes you can do while dreaming of your cloud project is to underestimate the know-how that is necessary to do the job right. Many teams stumbled over the fact that their ideas and plans were great but they forgot about the people who finally have to do the job. Your change management team also has to be aware that going to the cloud can also have a negative impact on other teams, so other teams in your company may start fighting against that change. Imagine for example your customers have now direct access to your order- and payment system, will your sales team like that? Your change can also impact your hierarchical structure and long cherished employees lose their major value for the team. So it’s probably likely that not everybody in your organization will like your project, be aware of that!

Readiness of Application
It doesn’t matter if you’re developing a new or transforming an existing application, you have to keep the drawbacks in mind that can come along with cloud services. Too many technical leaders made the mistake of thinking a simple lift-and-shift strategy will work out and they had to learn by hard that some refactoring is always necessary. This can start by underestimating the threshold of your cloud provider, continuing with disregarding upcoming latency that destroys your user experience or dependency libraries that are restricted to be used in the cloud (rare, but reported). So you better make a new concept first and think from the very beginning which components can be reused in which way. Think about your build steps, your deployment, your test environment, your architecture, current infrastructure, previous concurrency issues, clusters you want to use – convince yourself first and not later!

Data Privacy & Compliance
Especially with the release of the new data privacy regulations from 2018, many companies got messed up and reported big question marks on how to continue with their services. So you can be lucky to start your cloud project after this disrupting wave but you shouldn’t think that the current data privacy agreements will last forever, this was rather just the beginning of regulations that were missed since the beginning of our digital lifestyle. Data privacy for your customers and data protection imposed by your company compliance will harden your way to the cloud – that’s for sure!

Remember your Business Idea!
It doesn’t matter if you start a new project or lift some existing functionality to the cloud, if it creates advantages, any motivated employee will create further and further ideas how to improve and scale the impact of your cloud movement. So if you impress your environment, you can likely be sure that it doesn’t take a long time until the first ideas from other teams or team members land on your desk! So keep your primary business goals in mind and finish them first before you start working on other issues. If you’re a technical leader or you’re the one who pushes the cloud movement forward, take the advice to reflect yourself frequently and check if the tasks that you’ve created also have a value to your business. Too many become starry-eyed for cloud transformations and lift them to a holy-grail, which it isn’t for sure! So think twice if the current steps are really necessary and profitable for you and your team!

Vendor Lock-In
It’s too sad to be true but in reality it’s a fact. We have two strongly diverging mindsets in our industry. There are the ones who believe everything should be free and open sourced and the others try to catch you in their cage and won’t ever let you go. So you got to be aware of the fact that it’s common under cloud providers that they will try to trap you with special offers, extra support, personal training sessions and individual implementation opportunities just to keep you as a customer as long as possible. So if you start to implement services based on proprietary interfaces it will be harder for you to relocate your codebase one day. Or if you build up a special knowledge customized to their services, you may wake up one day and realize that this knowledge is now worthless. So in other words, if their ship goes down, you’ll go down too! So be aware to avoid a vendor lock-in as much as you can!

Cloud Provider Evaluation
Especially those who are new to this topic may stumble across a thing called SLA (Service Level Agreement) which defines the contract you sign with your cloud provider. If you don’t know exactly what it contains and where the borders of your cooperation are, you may run into trouble sooner or later. So it’s a good advice to get somebody for your team who was able to collect some experience in the cloud environment before and protects you from making the same mistakes that others did. It’s not about missing some contract details, it’s more about finding the right values and parameters that fit to your application and finally to your business case. As long as your project works fine you’ll probably never get in touch with your SLA that much, but once you got your customers or your boss in your neck, if there is an unexpected downtime, drastically rising costs due to a miss-configuration or a security lack, or even a disaster with a never ending recovery time, you will understand why knowing your SLA details is important. Take it as a good advice, especially at the beginning of your cooperation, that measurement and control is better than relying on trust. So the job is to monitor and measure as much as you can until you get a good feeling of what’s going on and how much responsibility you can take for it. Expect the unexpected!

Your guide through the jungle

So as far as you are familiar with the greatest advantages and biggest show stoppers now, we can start to create a cloud strategy that fits to your needs. Get your weapons locked and loaded and follow me through the jungle!

What do you want to create?

Remember your business goals and remember how others have lost the trail and you don’t want that the same happens to you, right? So define your project and especially your project scope precisely. What do you want to create? What is what you want to offer to your customer? Who is your customer and what do they really need? Think about your use-cases and match them accordingly. To make it a little bit more confusing, maybe you’ve heard about XaaS or *aaS which is one of the most blown up cloud buzzwords ever! What had once started with Software as a Service (SaaS), which basically means software delivered by cloud services, later transformed into a term that is almost used for any kind of service. Firewall as a Service, Payment as a Service, Analytics as a Service and even non- technological ones. But the idea behind the services-oriented model is great! So if you want to market your later product right, think in terms of services! Remember how you changed your business model from CAPEX to OPEX and your future customers want to have that too! They don’t want to buy a product, they have a goal and they are looking for the right service to reach their target as fast and as best as they can. And that’s a fact for any kind of customer. So you are serving them your best to let them reach their best!
They other way around you have to think of the same terms for your goals. But this is a matter of your skills, compliance and amount of responsibility you want to share. Netflix for example thought they could fly high and even higher, so they built up their own infrastructure and data centers until they had to realize that they can never get this job done any better than Amazon does. From that point on they focus on what they really want to do: delivering awesome series! So think about the services you want to deliver, what kind of services would be a help for you, short term and long term! Imagine your project growth stages from time to time, where is your starting point and where do you finally want to get and find a match with a cloud provider that has all the abilities you need for your scale.

If you’re willing to create services that are mainly for corporate users, think of a private cloud with outsourced hosting facilities which is in fact technically nothing else than a restricted public cloud for your private usage. If your compliance or your business case doesn’t force you to keep your data in-house, guard it with a virtual private network access (VPN) and you are ready to go. If you once need to open some services for public usage you can still transform it to a hybrid cloud by removing the according restrictions or routing it through a public cloud.

Checklist

To not leave you unarmed, we’ve created a checklist that you may consider at the start of your cloud project. Critics are always welcome, if you’ve noticed some missing points or made different experiences throughout the years, let us know and write it in the comments down below!

  1. What’s my business goal? Which requirements does it take?
  2. How can I accelerate my time to market?
    Is the market ready for my concept?
  3. How ready is my application or existing architecture?
  4. Is there a need for refactoring? Does it also consider potential cloud
  5. drawbacks like latency, downtime, noisy neighbours, migration, licenses and so on?
  6. How ready is my organization and team? Who can do the job?
  7. Will it change my business model? Does it disrupt my team? 
  8. Which restrictions do exist? Compliance? Security?
    Data privacy policies?
  9. Where do I need help in terms of services? Long term? Short term?
  10. Where do I want to grow? How much responsibility can I share?
  11. Does the SLA match my requirements? How about Disaster Recovery?
  12. Do the project requirements match my budget? Can it be done at the given time?

And as a final word: Don’t over engineer and don’t over manage it all! Keep it simple and straightforward and learn from your mistakes. Never lay back, be proactive, expect the unexpected and create your plan b – than you’re prepared for whatever it takes 😀

References

Author: Cloud Technology Partners, a Hewlett Packard Enterprise Company
Publication Date: July 21, 2016
Title: Five things every CEO should know before going to the cloud
Retrieved August 08, 2020 from:
https://www.cloudtp.com/doppler/5-things-every-ceo-know-going-cloud/

Author: Jeremy Cook, Cloud Academy
Publication Date: September 2019, 2019
Title: Cloud Migration Risks & Benefits
Retrieved August 09, 2020 from:
https://cloudacademy.com/blog/cloud-migration-benefits-risks/

Author: Sharad Acharya, Ace Cloud Hosting
Publication Date: April 05, 2019
Title: Things to look out while choosing a cloud service provider
Retrieved August 09, 2020 from:
https://www.acecloudhosting.com/blog/choosing-cloud-service-provider/

Author: Eze Castle Integration
Publication Date: January 21, 2020
Title: 6 Common cloud mistakes and how to avoid them
Retrieved August 09, 2020 from:
https://www.eci.com/blog/15706-6-common-cloud-mistakes-and-how-to-avoid-them.html

Author: Stefan Luber, Cloud-Computing Insider
Publication Date: August 16, 2017
Title: Was ist eine private Cloud?
Retrieved August 12, 2020 from:
https://www.cloudcomputing-insider.de/was-ist-eine-private-cloud-a-631415/

Cloudbased Image Transformation

Introduction

As part of the lecture „Software Development for Cloud Computing“, we had to come up with an idea for a cloud related project we’d like to work on. I had just heard about Artistic Style Transfer using Deep Neural Networks in our „Artificial Intelligence“ lecture, which inspired me to choose image transformation as my project. However, having no idea about the cloud environment at that time, I didn’t know where to start and what is possible. A few lectures in I had heard about Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Function as a Service (FaaS). Out of those three I liked the idea of FaaS the most. Simply upload your code and it works. Hence, I went with Cloud Functions in IBMs Cloud Environment. Before I present my project I’d like to explain what Cloud Functions are and how they work.

What are Cloud Functions?

Choose one of the supported programming languages. Write your code. Upload it. And it works. Serverless computing. That’s the theory behind Cloud Functions. You don’t need to bother with Infrastructure. You don’t need to bother with Load Balancers. You don’t need to bother with Kubernetes. And you definitely do not have to wake up at 3 am and race to work because your servers are on fire. All you do is write the code. Your Cloud Provider manages the rest. Cloud provider of my choice was IBM.

Why IBM Cloud Functions?

Unlike Google and Amazon, IBM offers FREE student accounts. No need to deposit any kind of payment option upon creation of your free student account either. Since I have no experience using any cloud environment, I didn’t want to risk accidentally accumulating a big bill. Our instructor was also very familiar with the IBM Cloud, in case I needed support I could have always asked him as well.

What do IBM Cloud Functions offer?

IBM offers a Command Line Interface (CLI), a nice User Interface on their cloud website, accessible using the web browser of your choice and very detailed Documentation. You can check, and if you feel like it, write or edit your code using the UI as well. The only requirement for your function is: It has to take a json object as input and it has to return a json as well. You can directly test the Function inside the UI as well. Simply change the Input, declare an example json object you want to run it with, then invoke your function. Whether the call failed or succeeded, the activation ID, the response time, results, and logs, if enabled, are then displayed directly. You can add default input Parameters or change your functions memory limit, as well as the timeout on the fly as well. Each instance of your function will then use the updated values.

Another nice feature of IBM Cloud Functions are Triggers. You can connect your function with different services and, once they trigger your function, it will be executed. Whether someone pushed new code to your GitHub repository or someone updated your Cloudant Database, IBMs database service. Once invoked by this trigger, your function executes.

You can also create a chain of Cloud Functions. The output of function 1 will then be the input of function 2.

IBM Cloud Function use the Apache OpenWhisk service, which packs your code into a Docker Container in order to run it. However, if you have more than one source file, or dependencies you need, you can pack it in a docker image or, in some cases, like Python or Ruby, you can also zip them. In order to do that in Python, you need a virtual environment using virtualenv, then zip the virtualenv folder together with your python files. The resulting zip files and Docker images can only be uploaded using the CLI.

You can also enable your function as Web Action, which allows it to handle HTTP Events. Since the link automatically provided by enabling a function as web action ends in .json, you might want to create an API Definition. This can be done with just a few clicks. You can even import an OpenAPI Definition in yaml or json format. Binding an API to a function is as simple as defining a base path for your API, giving it a name and creating an operation. For example: API name: Test, Base path for API: /hello and for the operation we define the path /world select our action and set response content type to application/json. Now, whenever we call <domain>/hello/world, we call our Cloud Function using our REST-API. Using the built-in API-Explorer we can test it directly. If someone volunteers to test the API for us, we can also share the API Portal Link with them. Adding a custom domain is also easily done, by dropping the domain name, the certificate manager service and then Certificate in the custom domain settings.

Finally, my Project

Architecture of the Image Transformation Service

The idea was:

A user interacts with my GitHub Page, selects a filter, adds an Image, tunes some parameters, then clicks confirm. The result: They receive the transformed image.

The GitHub Page has been written with HTML, CSS and JavaScript. It sends a POST request to the API I defined, which is bound to my Cloud Function, written in Python. It receives information about the chosen filter, the set parameters and a link to the image (for the moment, only jpeg and png are allowed). It then processes the image and returns the created png byte64 encoded. The byte64 encoded data will then be embedded in the html site and the user can then save the image.

The function currently has three options:

You can transform an image into a greyscale representation.

Left: Original Image by Johannes Plenio, Right: black and white version

You can upscale an image by a factor of two, three or four

Left: Original Image by Johannes Plenio, Right: Upscaled by a factor of 2

and you can transform an image into a Cartoon representation.

Left: Original Image by Helena Lopes, Right: Cartoon version

Cartoon images are characterized by clear edges and homogenous colors The Cartoon Filter first creates a grayscale image and median blurs it, then detects the edges using adaptive Threshold, which currently still has a predefined window size and threshold. It then median filters the colored image and does a bitwise and operation between every RGBA color channel of our median filtered color image and the found edges.

Dis-/ Advantage using (IBM) Cloud Functions

Serverless Infrastructure was fun to work with. No need to manually set up a server, secure it, etc. Everything is done for you, all you need is your code, which scales over 10.000+ parallel instances without issues. Function calls themselves don’t cost that much either. IBMs base rate is currently $0,000017 per second of execution, per GB of memory allocated. 10.000.000 Executions per month with 512MB action memory and average execution time of 1.000ms only cost $78,20 per month, including the 400,000 GB-s free tier. Another good feature was being able to upload zip packages and docker images.

Although those could only be uploaded using the CLI. As a Windows user it’s a bit of a hassle. But one day I’ll finally set up the 2nd boot image on my desktop pc. One day. Afterwards, no need for my VM anymore.

The current code size limit for IBM Cloud Functions is 48 MB. While this seems plenty, any modules you used to write your code, not included by default in IBMs runtime, needs to be packed with your source code. OpenCV was the module I used before switching over to Pillow and numpy, since OpenCV offers a bilateral filter, which would have been a better option than a median filter on the color image creation of the Cartoon filter. Sadly it is 125 MB large. Still 45 MB packed. Which was, according to the real limit of 36 MB after factoring in the base64 encoding of the binary files, sadly still too much. Neither would the 550 MB VGG16 model I initially wanted to use for an artistic style transfer neural network as possible filter option. I didn’t like the in- and output being limited to jsons either. Initially, before using the GitHub Page, the idea was to have a second Cloud Function return the website. This was sadly not possible. There being only a limited selection of predefined runtimes and modules are also more of a negative point. One could always pack their code with modules in a docker imag/zip, but being able to just upload a requirements.txt and the cloud automatically downloading those modules as an option would have been way more convenient. My current solution returns a base64 encoded image. Currently, if someone tries to upscale a large image and the result exceeds 5 MB, it returns an error, saying „The action produced a response that exceeded the allowed length: –size in bytes– > 5242880 bytes.“

What’s the Issue?

Currently, due to Github Pages not setting Cross Origin Resource Sharing (CORS) Headers, this does not work currently. CORS is a mechanism that allows web applications to request resources from a different origin than its own. A workaround my instructor suggested was creating a simple node.js server, which adds the missing CORS Headers. This resulted in just GET requests being logged in the Cloud API summary, which it responded to with a Code 500 Internal Server Error. After reading up on it, finding out it needs to be set by the server, trying to troubleshoot this for… what felt like ages, adding headers to the ajax jquery call, enabling cross origin on it, trying to workaround by setting the dataType as jsonp. Even uploading Cloud Function and API again. Creating a test function, binding it to the API (Which worked by the way. Both as POST and GET. No CORS errors whatsoever… till I replaced the code). I’m still pretty happy it works with this little workaround now, thank you again for the suggestion!

Other than that, I spent more time than I’m willing to admit trying to find out why I couldn’t upload my previous OpenCV code solution. Rewriting my function as a result was also a rather interesting experience.

Future Improvements?

I could give the user more options for the Cartoon Filter. the adaptive Threshold has a threshold limit, this one could easily be managed by the user. An option to change the window size could also be added, maybe in steps?

I could always add new filters as well. I like the resulting image of edge detection using a Sobel operator. I thought about adding one of those.

Finding a way to host a website/find a provider that adds CORS Header, allowing interested people to try a live-demo and play around with it, would be an option as well.

What i’d really like to see would be the artistic style transfer uploaded. I might be able to create it using IBM Watson, then add it as sequence to my service. I dropped this idea previously because i had no time left to spare trying to get it to work.

Another option would be allowing users to upload files, instead of just providing links. Similar to this, I can also include a storage bucket, linked to my function in which the transformed image is saved. It then returns the link. This would solve the max 5 MB response size issue as well.

Conclusion

Cloud Functions are really versatile, there’s a lot one can do with them. I enjoyed working with them and will definitely make use of them in future projects. The difference in execution time between my CPU and the CPUs in the Cloud Environment was already noticeable for the little code I had. Also being able to just call the function from wherever is pretty neat. I could create a cross-platform application, which saves, deletes and accesses data in an IBM Cloudant database using Cloud Functions.

Having no idea about Cloud Environments in general a semester ago, I can say I learned a lot and it definitely opened an interesting, yet very complex world I would like to learn more about in the future.

And at last, all Code used is provided in my GitHub repository. If you are interested, feel free to drop by and check it out. Instructions on how to set everything up are included.

End-to-end Monitoring of Modern Cloud Applications


Introduction

During the last semester and as part of my Master’s thesis, I worked at an automotive company on the development of a vehicle connectivity platform. Within my team I was assigned the task of monitoring, which turned out to be a lot more interesting but at the same time way more complex than I expected. In this blog post, I would like to present an introduction to system health monitoring and describe the challenges that one faces when monitoring a cloud-based and highly distributed IoT platform. Following that I would like to share the monitoring concept that was the result of my investigation.

Monitoring

When developing an application, monitoring usually is not the first topic to come up to the developers or managers. During my time in university, whenever we had a fixed date to present results as part of a group’s project, me and my colleagues usually shifted our focus towards feature development rather than testing and monitoring. The need for monitoring however, becomes extremely important as soon as you have to operate a software system. Especially when maintaining distributed systems which cannot be debugged as traditional, monolithic programs the need for enhanced monitoring grows. The biggest difficulty with monitoring, is that it has to be designed specifically for the application that is needs to be monitored. There is no one-size-fits-all solution for monitoring. Within the scope of monitoring, there are a lot of different use cases to focus on. When talking about operating a software platform, our focus relies on system health monitoring.

System health monitoring

Monitoring has a different meaning depending on the domain it is being employed in. Efforts when monitoring software systems can put special interest into topics concerning security, application performance, compliance, feature usage and others. In this post I would like to focus on system health monitoring, following the goal of watching software services’ availability and being able to detect unhealthy states that can lead to service outages.

Even when reduced to health monitoring, one implementation of monitoring can differ strongly from another, depending on how different the systems, the processes, the tools and the users and stakeholders of that monitoring information are defined. The basis for monitoring is constantly changing due to the introduction of new technologies and practices. Therefore, a state of full maturity for monitoring tools can hardly ever be reached.

Monitoring Tools

Monitoring tools were created and adapted to fit for the domain of the systems that had to be monitored. Monitoring requirements are substantially different when monitoring clusters of homogeneous hardware running the same operating systems to those when managing the IT-infrastructure of a middle to large-size company with heterogeneous hardware (e.g. databases, routers, application servers, etc.).

If you are building a new software application or have been assigned the task of maintaining one, here are a few example tools that focus on different aspects of monitoring that could be interesting to try out.

  • Collectd – Push based host metric collection: Collectd is able to collect data from dynamically started and stopped instances. It works by periodically sending host metrics from preconfigured hosts to a central repository. By being a push-based system, it works also for short-lived processes. [2]
  • StatsD – Application level metric collection: StatsD was designed as a simple tool that collects traces and sends them via UDP to a central collector, where they can be forwarded to a collection and visualization tool like Graphite to be used to render graphs that are displayed in dashboards. [3]
  • The Elastic-Stack – Log management and analysis: When applications are hosted inside containers running on separate hosts, it is not only difficult to extract a containers low-level metrics, but also to extract and collect application-level metrics and logs. The standard open-source solution for this problem is the Elastic-Stack.
  • Riemann – Monitoring based on Complex Event Processing: Riemann differentiates itself from other monitoring tools in that it introduces event stream processing techniques to monitoring. Riemann works by processing events that are pushed by many distinct sources and, thanks to its push-based architecture, provides high scalability.
  • New Relic – Monitoring solution as SaaS: New Relic is one of the more recent SaaS solutions that have emerged in the past years. The feature of MaaS (Monitoring-as-a-service) is that the infrastructure and services needed to implement monitoring tools is abstracted away from operations teams. MaaS systems like New Relic can be configured to watch the infrastructure running services and receive monitoring data from the monitored objects either via data pushes directly or by installing agents that run inside hosts. Apart from collecting and processing logs, New Relic provides a browser based interface to access data and create customized dashboards as well as features like auto-detecting anomalies with the use of machine learning.
  • Zipkin – Distributed tracing: Zipkin, among other distributed tracing solutions, focuses on tracking requests on a distributed system in order to inspect the system for errors or performance bottlenecks and to enable troubleshooting in architectures where multiple components are being employed.
    Zipkin supports the OpenTracing API. OpenTracing is a project aimed to introduce a vendor-neutral tracing specification. Its objective is to standardize the tracing semantics to allow for better distributed tracing capabilities across frameworks and programming languages. Trace collectors can implement the OpenTracing API and can thereby be employed in large scale distributed systems, in which system subcomponents are written by different teams in different languages.

If your project if focused more on infrastructure components, it would be interesting to take a look at collectd. For a smaller web application, I would recommend to build in StatsD and focus on tracking relevant business KPIs from inside the relevant functions within the code. For any software project I would recommend to employ a log management system like the Elastic-Stack or an alternative. Since its an open source solution, there are no licence costs attached and it can notably reduce troubleshooting time when debugging your application, specially if your application relies on a microservice architecture. For start-ups with smaller cloud projects going to production, I would recommend trying out a Monitoring-as-a-service solution to analyze your application usage. Especially for smaller businesses going to the cloud, finding out how your customers are using your application is crucial. Discovering which functions and to what extent resources are being utilized can enable operations teams to optimize the cloud resource allocation while providing the business with important feedback about feature reception.
If you are building a complex distributed architecture involving many different components, it might make sense to evaluate more versatile tools like Riemann and take a look at OpenTracing.

Monitoring Strategy

Unarguably, the employment of the right tools or solutions to monitor a distributed system is crucial for the successful operation of it. However, effective monitoring does not only focus on tooling, but rather incorporates concepts and practices derived from the experience of system operations in regard to monitoring, and to some extent, to incident management. If your task is to monitor a software system, the best way to start is to define a monitoring strategy. When doing so, it should consider the following concepts:

  • Symptoms and causes distinction – The distinction of symptoms and causes is especially important when generating notifications for issues and defining thresholds for alerting. In the context of monitoring, symptoms are misbehavior from a user perspective. Symptoms can be slow rendering web pages or corrupt user data. Causes in contrast are the roots of symptoms. Causes for slow rendering web pages can be unhealthy network devices or high CPU load on the application servers.
  • Black-box and white-box monitoring – Black-box monitoring is about testing a system functionality and checking if the results are valid. If results are not what was expected, an alert is triggered. White-box monitoring is about getting information about a component’s state and alerting on unhealthy values. While white-box monitoring can help the operations personnel find causes, black-box monitoring is used to detect symptoms.
  • Proactive and reactive monitoring – In relation to black-box and white-box monitoring are the concepts of proactive and reactive monitoring. Whereas reactive monitoring is concerned about collecting, analyzing and alerting on data that can reflect an outage of the system, reactive monitoring emphasizes on detecting critical situations within the monitored object that can lead to a service performance degradation or complete unavailability. While the results from black-box monitoring can be useful to detect problems affecting users, white-box monitoring techniques provide more visibility into a system’s internal health state. With careful analysis of trends on internal metrics, a prediction of imminent failures can be made, which can either trigger a system recovery mechanism, an auto-scale of system resources or alert a system operator.
  • Alerting – Alerting is a crucial part of monitoring. Alerting however, does not necessarily involve paging a human being. The latter should mostly remain the exception and only be employed when services are already unavailable or failure is imminent without human action. For events that require attention, but do not involve failing systems, alerts can simply be stored as incidents in a ticketing system, to be reviewed by the operations team when no other situation requires their attention. In case of pages, they should be triggered as sparsely as possible, since constantly responding to pages can be frustrating and time consuming for operations employees. When defining thresholds to be notified on, new approaches propose to aggregate metrics and generate alerts on trends, instead of watching simple low-level metrics, e.g. watching for the rate in which available space on a database is filling up, instead of its current free space at a certain point in time. This concept relates to focusing on symptoms rather than causes when alerting.

When implementing the monitoring strategy for your project, it’s best to start with black-box monitoring, covering the areas that affect your users the most (such as being able to log in to your system). When defining alerts, focus should rely on symptoms rather than causes, since causes might or might not lead to symptoms. Also, if your are defining alerts, always focus on the right choice of recipients. Different groups of project employees might be interested on different types of alerts or different granularity. Alert the business about decreasing page views and not about cluster certificates that will expire soon.

Real-life scenario: monitoring a cloud-based vehicle connectivity platform

The connectivity platform I worked on provides infrastructure and core services to develop connectivity services for commercial vehicles. These services enable users to open the vehicles’ doors remotely via smartphone or displaying the position and current state of a vehicle on a map. The platform provided a scalable backend which handled the load of incoming vehicle signals and provided it as JSON data from APIs. It also included frontend components which fetched data from the provided APIs and displayed it in a smartphone in the custom app or in a browser as a web application. The platform was built on top of Microsoft Azure and was based on the IoT reference architecture proposed by Microsoft in the Azure documentation. It was built on a highly distributed microservice architecture that made extensive usage of Azure’s products, both SaaS and PaaS. Its core was made of an Azure Service Fabric cluster running all microservices that implemented the business logic of the connectivity services. The communication between microservices was established either synchronously via HTTP or asynchronously using Azure’s enterprise messaging component Azure Service Bus. In order to implement most use cases the platform’s components also had to communicate with various company-internal systems, as well as third party software services that implemented specific functionality, such as user management.

Monitoring challenges

In order to reach a basic level of monitoring in the platform, at least each component of the platform, which was not being fully managed by Microsoft or a third-party service provider (e.g. a SaaS- component), had to be monitored by the operations team. As stated above, special care was put in the choice of components that demanded the lowest level of administration effort of the aforementioned team. That led to an increased usage of PaaS-components instead of Virtual Machines or other infrastructure alternatives.

Even though the infrastructure of PaaS-components does not need to be managed by the cloud consumer, there are some aspects when using PaaS-components that can generate errors in runtime and must, for this reason, be constantly monitored by the operations team. Some examples are:

A queue in the Service Bus, filling up rapidly because a microservice is not keeping up with the number of incoming messages.

An Azure Data Factory Pipeline that stopped running because a wrong configuration change removed or changed a database’s credentials and the pipeline is not able to fetch new data.

The former scenario can be easily detected, since Azure provides the resource owner with many metrics regarding the high-level usage of each component. In this case, a counter of unread messages is available in the Azure Portal or via Azure Metrics. Since the counter is available in the portal, widgets displaying its data can be pinned to a dashboard and alerts can be created for a given threshold on this metric.

In the data pipeline scenario, an outage can be detected by setting alerts on the activity log, which is the only place where occurring errors within Data Factories are reported. However, an aspect adding risk to the latter scenario is that detecting that a pipeline stopped running is especially difficult when testing the system from the user’s perspective, since it is only reflected by outdated or incorrect data, instead of directly throwing an error.

In this sense, the operations team of the platform should always be notified in the case of a stopped pipeline because user complains might be misleading. Also, the advantage of using PaaS components and having the underlying infrastructure abstracted away from the user comes with the disadvantage that all available logs and metrics, from which monitoring events can be derived, are defined by the component’s provider and cannot be easily expanded, customized or complemented with own information.

The monitoring strategy

When defining the monitoring strategy for the platform from a technological point of view, the first decision was to make extensive use of the SaaS monitoring solutions already provided by Microsoft to monitor applications and infrastructure hosted inside of Azure, before introducing external proprietary solutions or deploying open-source monitoring tools to watch the platform. The tools employed were Azure Monitor, Azure Log Analytics and Azure Application Insights. Implementing white-box monitoring was simpler due to the seamless integration of tool provided by Microsoft to monitor Applications on Azure. Implementing black-box monitoring to be able to tell that the most important functions of the platform were available was way trickier. The next section explains the concept developed as part of my thesis.

End-to-end black-box monitoring

Why monitoring single components is not enough

The problem statement above only describes the challenges of monitoring each component on its own. The services offered by the platform however, are provided only as a result of the correct functioning of a specific set of these subcomponents together. In many cases, the platform’s subcomponents receive messages, process them and then forward them via a synchronous or asynchronous calls to the next subcomponents, forming complex event chains. These event chains can also include third party systems operated within the same organization or third parties, which handle specific technical areas of the connected vehicle logic. From an operations’ perspective, the most valuable information is to know whether a specific, user-facing service is running at the moment or if it is facing technical disruptions. In this case, not only knowing if a service is running in the sense of a passing an integration test is important, but rather having further insights into the system (including its complex event chains) and which part of it is failing to process the requested operation.

This observability can be partially provided by monitoring the platform’s subcomponents on their own, but not entirely. Here is one example:

In case a Service Bus queue filles up, the cause of messages missing in the frontend can be detected by alerting on a threshold on the involved Service Bus queue’s length.

However, here is an example of a case in which finding out the root cause of a service disruption might become less obvious:

A configuration change from a deployment altered the expected payload of a microservice, and incoming asynchronous messages are being ignored by that microservice for being badly formatted.

The error would only be visible by examining the logs generated by the service. If the message drop is not being actively logged by that specific microservice, there is no way for the operations team to find out on which part of the system the messages are being lost. This
observability problem is increased by the fact that PaaS components do not generate logs for every message being processed. That means the last microservice which created logs including the message’s ID is the last point of known correct functioning of the event chain.

This case might not seem probable, but most of the disruptions happening in a production environment come from unintentional changes to the configuration made by an update. Also, as the architecture becomes more and more complex and further components are added to support the main services of the chain, the task of troubleshooting the platform grows in complexity. An end-to-end black-box monitoring system, which monitors the core services of the platform, would substantially reduce effort when searching for misbehaving components since it would alert the operations team as soon as one of the core services is not available. The alert would as well include information about the exact subcomponent responsible for the outage.

What we need is Event Chain Monitoring

The core of the proposed solution is based on event correlation. In this case, an own definition of events and their aggregation was used. The core ideas are inspired by the Complex Event Processing paradigm but its full complexity would go beyond the scope of this blog post. The use of events creates an abstraction level between incoming signals and relevant information for monitoring. Sources of events can be either HTTP requests triggering a component or logs being added to the log repository of the platform or any kind of process that can be converted to an event. To map data to an event, it has to contain information that can be derived to at least the following event metadata:

  1. Event source
  2. Event timestamp
  3. Tracking ID
  4. Event type

The event source is used to map the event to the component which originally generated the event. The timestamp contains the time in UTC at which the event was generated or processed. The tracking ID is used to correlate events that belong to the same action. Using a tracking ID, all events originated by the same operation can be grouped together. The event name states the type of event that was triggered, such as data from vehicle received at gateway or message processed by microservice A.

When triggered, all events have to be routed to central repository for analysis and aggregation. From this central repository, single events can be analyzed for patterns that indicate system failures.
Defining the model for a fully functioning event chain is then simple. It is calculated with a query as the aggregation of any number of events and a syntax stating the order and time frames in which the events need to reach the central repository. The exact count of events and their order depends on the level of granularity to which the event chain is going to be monitored.
A visualization of the proposed solution is presented in the following diagram using a sample event chain:

After the event chain is modelled in the monitoring web application, the flow of events can be tracked. First, signals have to be generated. Since generating real signals would involve driving around in a real car connected to the backend, a vehicle has to be simulated, thus the first event would come from a vehicle simulator. After vehicle data events start being pushed to the platform, the different components of the system will start processing the incoming data and interacting with other subcomponents. For each subcomponent, the telemetry capabilities have to be defined. If it supports generating an event for every data signal and sending it to the central event repository or if the event hast to be generated out of the logs or other information sources that are filled by this subcomponent. Depending on the type of components and the level of administration needed, there is more or less information available. If crucial information is missing from a component, a helper or observer unit (here the “observer” Azure Functions) has to be added to the chain, to provide the missing information.

After events are matched, their results can produce other, higher-level events. These high-level events can be consumed by a monitoring dashboard, or a ticketing system. If consumed by a custom dashboard, monitoring information can then be presented by visually showing a graph of the subcomponents of the system. Depending on the patterns matched by predefined queries, different parts of the event chain, that get recognized as failing, can be highlighted in the dashboard.
In addition to displaying a graph of the event chain, the operations team can subscribe to changes to the use-cases health state. If a disruption of a core service is registered, a text message can be sent to the person on-call. Less crucial incidents can be tracked in a ticketing system for further investigation by the operations team. A further advantage of having the event chains displayed visually to the operations team is that even without deep knowledge of the platform, the operations team is able to reason about symptoms visible by users of the platform and track back many incidents that might have the same origin. That is, the operations team would get direct visibility into the state of the subcomponents of the platform and, in the case of an outage, the team gets direct information about the outage’s repercussion on the operability of the system.

Conclusion

The implementation of modern software architecture patterns like microservices combined with the use cloud services beyond IaaS facilitates a faster development of software services. Despite the reduced effort to manage infrastructure when deploying applications to the cloud, being able to monitor the resulting highly distributed systems is becoming an increasingly complex task. On top of that, customers of modern online applications expect short response times and high availability, which raises the need for accurate monitoring systems. In this post I provided a short introduction to monitoring and an overview of current monitoring tools you can test in your projects. However, I also described how a well-thought monitoring strategy is more important than tools in order to increase your system’s availability. In the second half of this post I provided a real-life example scenario and depicted the challenges related to monitoring a distributed IoT-platform. I then presented the concept that was developed in order to reach a better state of observability into the platform.

References

Some of the ideas summarised in this blog post were originally presented in the following articles or books. They are also excellent reads in case the topic of monitoring got your interest.

  • B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. “ O’Reilly Media, Inc.,” 2016.
  • M. Julian, “Practical Monitoring: Effective Strategies for the Real World,” 2017.
  • J. Turnbull, The art of monitoring. James Turnbull, 2014.
  • I. Malpass, “Measure Anything, Measure Everything,” 2011. link.
  • J. S. Ward and A. Barker, “Observing the clouds: a survey and taxonomy of cloud
    monitoring,” J. Cloud Comput., vol. 3, no. 1, p. 24, 2014.
  • B. H. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” 2010.

Cloud security tools and recommendations for DevOps in 2018

Introduction

Over the last five years, the use of cloud computing services has increased rapidly, in German companies. According to a statistic from Bitkom Research in  2018, the acceptance of cloud-computing services is growing.

Cloud-computing brings many advantages for a business. For example, expenses for the internal infrastructure and its administration can be saved. Resource scaling is easier, it can be done when the appropriate capital is available. In general the level of innovation within companies is also increasing, so this can support new technologies and business models. Cloud-computing can also increase data security and data protection, because the big providers must comply with standards and obtain certification through the EU-DSGVO General Data Protection Regulation, which came into force in May 2018. There is even special cloud providers that offer such services, so these are working based on the regulations of the TCDP (Trusted Cloud Data Protection) certification.
Continue reading

Building a fully scalable architecture with AWS

What I learned in building the StateOfVeganism ?

Final setup for the finished project (created with Cloudcraft)

By now, we all know that news and media shape our viewson these discussed topics. Of course, this is different from person to person. Some might be influenced a little more than others, but there always is some opinion communicated.

Considering this, it would be really interesting to see the continuous development of mood communicated towards a specific topic or person in the media.

Continue reading