{"id":26160,"date":"2024-02-29T16:22:11","date_gmt":"2024-02-29T15:22:11","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=26160"},"modified":"2024-02-29T18:04:51","modified_gmt":"2024-02-29T17:04:51","slug":"why-system-monitoring-is-important-and-how-we-approached-it","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/why-system-monitoring-is-important-and-how-we-approached-it\/","title":{"rendered":"Why system monitoring is important and how we approached it"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\"><strong>Introduction<\/strong><\/h3>\n\n\n\n<p>Imagine building a service that aims to generate as much user traffic as possible to be as profitable as possible. The infrastructure of your service usually includes some kind of backend, a server and other frameworks. One day, something is not working as it should and you can&#8217;t seem to find out why. You spend a lot of time on testing and finally, after a couple of hours, you find the solution &#8211; but already have lost a lot of profit.<\/p>\n\n\n\n<p>This loss of time and profit could have easily been prevented by a simple principle: Monitoring.<\/p>\n\n\n\n<p>Monitoring can be described as the process of collecting and displaying data of your system e.g. in the form of metrics. Metrics are measurements of resource usage or behavior within systems, ranging from low-level operating system data like CPU-Usage to higher-level application-specific information like request rates. [1]<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why should you consider monitoring your system?<\/strong><\/h3>\n\n\n\n<p>It helps analyze long term trends, like the development of daily active users. It will alert you if something is broken and you need to fix it. And you can build a dashboard that describes your system including the four golden signals, the foundational building blocks of an effective monitoring strategy.<\/p>\n\n\n\n<p>According to Google Site Reliability Engineering the four golden signals are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>latency: the time it takes to respond to a request<\/li>\n\n\n\n<li>traffic:&nbsp; the amount of demands that are placed by users to your service<\/li>\n\n\n\n<li>errors: the fail rate of the requests<\/li>\n\n\n\n<li>saturation: for example the amount of memory that is being used at a given moment [2]<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How did we approach monitoring in our project?<\/strong><\/h3>\n\n\n\n<p>There are several tools that can be used for system monitoring, like Prometheus. We choose Prometheus because it is an open-source monitoring and alerting system, easy to integrate in different project architectures. It scrapes metrics (collecting and saving them) and stores them as time-series data. [3]<\/p>\n\n\n\n<p>For context: for our students project we developed a simple hosting provider that enables users to start Docker containers on a remote server using a simple CLI tool.<\/p>\n\n\n\n<p>We have added a simple monitoring process as seen in the picture below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"26163\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/why-system-monitoring-is-important-and-how-we-approached-it\/monitoring-2\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring.png\" data-orig-size=\"927,541\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"monitoring\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring.png\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring.png\" alt=\"\" class=\"wp-image-26163\" width=\"695\" height=\"406\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring.png 927w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring-300x175.png 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2024\/02\/monitoring-768x448.png 768w\" sizes=\"auto, (max-width: 695px) 100vw, 695px\" \/><\/a><figcaption class=\"wp-element-caption\"><em>The monitoring approach in our students project<\/em><\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Our project includes a host server, a terraform service and a go-based backend. To make it possible for Prometheus to pull metrics via scrape, we added a metrics-endpoint to our services with the help of go-gin-prometheus. Choosing go-gin-prometheus was the best choice for us, as our backend is already gin based and it made it easy to integrate our http-metrics-endpoint. Alternatively you could use the official Prometheus go client library. We added a prometheus instance to our docker-compose file, pointing to a prometheus.yml-file that contains our prometheus configurations like scraping-interval or our scraping target.<\/p>\n\n\n\n<p>To display our metrics we used Grafana for creating a dashboard. At the beginning we added a remote_write configuration to our prometheus.yml file. This configuration includes a remote_write endpoint url that directs the metrics to the Grafana Cloud, making it possible for us to access the metrics in an online dashboard and configuring it in the grafana web application. However in the end, we decided to create our own grafana instance in the docker-compose file, as standardized for every service in our project. This makes it possible for us to create a reproducible environment and to define our dashboard within a JSON file, pushing it to our grafana instance.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n\n\n\n<p>In conclusion we can say that monitoring is a very important aspect in system engineering. It ensures a healthy system providing us with the necessary metrics and alerting us in case of any issues happening. As every system is different there is no standard way of implementing it so you have to make sure to integrate the services, configurations and metrics that are the most suitable for your use case.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading\">Further readings<\/h4>\n\n\n\n<p>If you want to know more about the other aspect of monitoring, observing the Log Data with Grafana-Loki, as displayed in the picture above, check out this blog: <a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/combining-zerolog-loki\/\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/combining-zerolog-loki\/<\/a><\/p>\n\n\n\n<p>Other aspects of our projects include user authentication with <a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/using-keycloak-as-iam-for-our-hosting-provider-service\/\" title=\"keycloak\">keycloak<\/a> and starting docker containers with<a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2024\/02\/29\/terraform-x-go-challenges-when-interacting-with-terraform-through-go\/\" title=\" terraform\"> terraform<\/a>.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[1] J. Ellingwood, \u201eAn Introduction to Metrics, Monitoring, and Alerting\u201c, DigitalOcean, 5. Dezember 2017. https:\/\/www.digitalocean.com\/community\/tutorials\/an-introduction-to-metrics-monitoring-and-alerting<\/li>\n\n\n\n<li>[2] Google, \u201eGoogle &#8211; Site Reliability Engineering\u201c, 2017. https:\/\/sre.google\/sre-book\/monitoring-distributed-systems\/.<\/li>\n\n\n\n<li>[3] Prometheus, \u201eGetting Started with Prometheus | Prometheus\u201c. <a href=\"https:\/\/prometheus.io\/docs\/tutorials\/getting_started\/\">https:\/\/prometheus.io\/docs\/tutorials\/getting_started\/<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Imagine building a service that aims to generate as much user traffic as possible to be as profitable as possible. The infrastructure of your service usually includes some kind of backend, a server and other frameworks. One day, something is not working as it should and you can&#8217;t seem to find out why. You [&hellip;]<\/p>\n","protected":false},"author":1191,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1,22,2],"tags":[417,167,418],"ppma_author":[1013],"class_list":["post-26160","post","type-post","status-publish","format-standard","hentry","category-allgemein","category-student-projects","category-system-engineering","tag-grafana","tag-monitoring","tag-prometheus"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":3767,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2018\/07\/30\/end-user-monitoring-establish-a-basis-to-understand-operate-and-improve-software-systems\/","url_meta":{"origin":26160,"position":0},"title":"End user monitoring \u2013 Establish a basis to understand, operate and improve software systems","author":"Alexander Wallrabenstein","date":"30. July 2018","format":false,"excerpt":"End user monitoring is crucial for operating and managing software systems safely and effectively. Beyond operations, monitoring constitutes a basic requirement to improve services based on facts instead of instincts. Thus, monitoring plays an important role in the lifecycle of every application. But implementing an effective monitoring solution is challenging\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"Typical monitoring stack","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2018\/07\/fig_03_typicalMonitoringStack.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2018\/07\/fig_03_typicalMonitoringStack.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2018\/07\/fig_03_typicalMonitoringStack.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2018\/07\/fig_03_typicalMonitoringStack.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":5104,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2019\/02\/26\/end-to-end-monitoring-of-modern-cloud-applications\/","url_meta":{"origin":26160,"position":1},"title":"End-to-end Monitoring of Modern Cloud Applications","author":"je052","date":"26. February 2019","format":false,"excerpt":"During the last semester and as part of my Master's thesis, I worked at an automotive company on the development of a vehicle connectivity platform. Within my team I was assigned the task of monitoring, which turned out to be a lot more interesting but at the same time way\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/diagram-1024x540.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/diagram-1024x540.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/diagram-1024x540.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":12550,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2021\/02\/27\/migrating-from-heroku-to-hetzner-achieving-scalability-with-docker-kubernetes-and-rancher\/","url_meta":{"origin":26160,"position":2},"title":"Migrating from Heroku to Hetzner: Achieving Scalability with Docker, Kubernetes and Rancher","author":"Mario Koch","date":"27. February 2021","format":false,"excerpt":"Dockerizing an existing application and deploying it in a Kubernetes Cluster via Rancher to achieve better scalability and cost minimization. Load Testing with Artillery, Monitoring with Prometheus & Grafana and GitHub Actions for CI\/CD were used in the process.","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2021\/02\/vidar-nordli-mathisen-y8TMoCzw87E-unsplash-1-scaled.jpg?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":24490,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/06\/how-to-scale-an-iot-platform\/","url_meta":{"origin":26160,"position":3},"title":"How to scale an IoT-Platform","author":"Simon Janik","date":"6. March 2023","format":false,"excerpt":"Written by Marvin Blessing, Michael Partes, Frederik Omlor, Nikolai Thees, Jan Tille, Simon Janik, Daniel Heinemann - for System Engineering And Management IntroductionArchitectureScaling & Load TestEstimations and predicted bottlenecksMonitoringLoad TestScalingLessons LearnedMonitoring & Error searchDevelopment ProcessIdentifying and improving bottlenecks Introduction The aim of the project was to develop a system with\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/lh6.googleusercontent.com\/Gsjx91zLxbPn2Uer_-kXHZ68xbA2YmKIrqreIyWNPF9MwowChC5IHi1wx6G6Ctj2MKTRA1n-uZHlwfjxk-dYkjGrzGY10KbnOLN1UKQVbMNaO-RIvm3c7cBFVZQEy5lqi33i_F5TEbln0X7C3CZfL4k","width":350,"height":200},"classes":[]},{"id":10190,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/03\/01\/autoscaling-of-docker-containers-in-google-kubernetes-engine\/","url_meta":{"origin":26160,"position":4},"title":"Autoscaling of Docker Containers  in Google Kubernetes Engine","author":"de032","date":"1. March 2020","format":false,"excerpt":"In this blog post we are taking a look at scaling possibilities within Kubernetes in a cloud environment. We are going to present and discuss various options that all have the same target: increase the availability of a service.","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2020\/03\/1052ebad-d01f-4803-bde6-e943c4598ef9.jpeg?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":9663,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/02\/24\/how-to-increase-robustness-of-a-large-scale-system-by-testing\/","url_meta":{"origin":26160,"position":5},"title":"How to increase robustness of a large scale system by testing","author":"Johannes Mauthe","date":"24. February 2020","format":false,"excerpt":"When a distributed software system grows bigger and bigger, one will end up with a big amount of various components which all need to scale independently. In order to achieve these components working smooth together, it is necessary to figure out at which time a component needs to be scaled,\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/lh3.googleusercontent.com\/8h_z-5W6olzeJeyXw7NwIHYdRJs3FyHcLk-NSsfw_eWM-2oCE1FnZFBxC3qw2IqdnSal43O8bc5uMGFaBvbKLZjhRu4Q2nlitp7AbAeNTc3BOFW2u_6xtpR3jIEvNLDPpsrmL8c9","width":350,"height":200},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":1013,"user_id":1191,"is_guest":0,"slug":"michelle_becher","display_name":"Michelle Becher","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/03433e8d2f3169ccfd90def0a7d412b4ea2594d4d14b2e38348c761979e04f0b?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/26160","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/1191"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=26160"}],"version-history":[{"count":5,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/26160\/revisions"}],"predecessor-version":[{"id":26206,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/26160\/revisions\/26206"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=26160"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=26160"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=26160"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=26160"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}