Written by Verena Barth, Marcel Heisler, Florian Rupp, & Tim Tenckhoff
Architecture
Besides earlier technological considerations, we primarily wanted to change the architecture of the initial project. The first approach was in spite of the applied software patterns and architectural concepts of the Web Application Architecture lecture constructed as a monolith. That means in fact, that the whole logic was written in one software layer, providing worst conditions for scalability or expandability. For this reason, we decided to decouple the logical components of our software.
This distribution resulted in a separation of frontend and web backend, and the relocation of smaller logical parts of the software into microservices: An implemented database service contains logic to process all upcoming database requests to store or to access data. The Crawler service is designed to control the periodic queries to the VVS API to request the delay information. Through the implementation of these microservices, we broke down the functionality of these software components to the smallest possible dimension in accordance with the principle of software oriented architecture (SOA). Thus, we consequently increased the possibility to scale, maintain or exchange subcomponents of our code. Additionally, we reached a separation of components that could possibly slow down each other (e.g. if subcomponents require database access at the same time). The final architecture is shown in figure 1.
Message Broker
So we decided to build a distributed system based on service oriented architecture (SOA). In earlier times building monolithic software, nobody wasted any thought on how to let parts of the software talk to each other. This was easy – the source code to be called was just directly available. Compared to a distributed system, things changed now. As its name already tells you, services in a distributed system are at least separated by software and often also by hardware infrastructure. It’s obvious that any communication between them now must go over the wire. To approach this problem we came along with the technology of message brokers. This piece of third party software can be imagined as a post office. All it does, is accepting messages and then routing them according to the information on the envelope to interested customers. To get experienced with the use of message brokers, we decided to use the RabbitMQ implementation. Some advantages coming up with that choice are the out of the box use, a very good documentation and interface libraries with great tutorials for most programing languages. Other message brokers you might have heard of may be Apache Kafka or ActiveMQ.
For keeping received messages encapsulated from delivery, RabbitMQ uses the pattern of exchanges. By doing so, each message is sent to a specific exchange configured on the broker with a routing key on the envelope. Now the exchange maps the message to a queue looking at the routing key. Due to this, a message could be sent to multiple queues by only sending one message for example. Of course you have to configure things properly ;-). For this reason, RabbitMQ provides multiple different types of exchanges, for example the fanout exchange. This does exactly what mentioned before – sending a broadcast. If you prefer a one-to-one communication I’d recommend you the direct exchange.
As larger your system drives, as much more complex gets the communication architecture. Good thing was, we only had three services which need a traffic connection – the web backend service, the crawler service and a database service (see also below diagram).
There exist two message exchange ways: The crawler must save data into the database and the web backend must be able to query this data from the database. Because of a database query is formerly triggered by an user interaction, this must be separated to avoid unnecessary traffic which may slow down the communication. So each of them gets its own exchange configuration on the exchange. Then, to save the data the crawler has fetched, it sends it to the crawler exchange with the routing key crawler-data. By its configuration the message broker knows, it must be routed to the crawler-data-queue, to which the database service listens to. Here we are using an exchange of the type fanout. This brings the advantage to be able to save the data likewise in multiple databases.
The main difference of the communication between the web backend and the database is now, that the web backend makes a query. This means it also expects an answer with data, compared to the crawler which only sends it. Thus the pattern of an remote procedure call (RPC) is known to solve this issue. Exactly one of this is sent to the dedicated database exchange, when the web backend sends a query. Things happen the same way like explained before, but now the web backend is waiting synchronously for the answer of the database.
In our example, things were still clear and easy to manage. But thinking of a system running ten times as many services with multiple instances of each, you fastly lose sight. At this point you need to think about the maximum utilisation of your system. This is where the queuing theory focuses on. By setting the service disciplines and some parameters, you are able to calculate the produced traffic inside your system.
Time Series Database
The general specifications of our project, which should allow the distributed software to store data periodically and to provide these collected data according to time based queries, led to considerations about the most suitable database system for our use case. Our initial observations resulted in the fact that the required data are on the one hand based on an information about a specific date, containing a train station and the related list of trains and their delays, and on the other hand needed to be accessible by queries that use a time range (e.g. the last week) as search parameter. As time was the key indicator in both the above cases, we began to reflect if the generally used concept of a relational database as SQL would sufficiently fit our needs. Regarding this, we quickly discovered the concept of time series databases. As the name indicates, this concept identifies points in a given data series by the factor ‘time’. Thereby it can be compared to SQL database tables where the primary key is pre-set by the system and is always the time.
Across different suppliers of time series databases we found the InfluxDB, which promises to be able to store a large volume of time-series data and to perform real-time analysis and requests on those data. Apparently, this was exactly what we needed to store and access the required information. Additionally we found a well documented tutorial, which facilitated the integration of the InfluxDB in our Spring Boot project.
The database itself could be installed and configured via the command line for local development. On our production server, we set up a docker container with a pre-configured environment, where the InfluxDB could run. A terminal interface (accessed by entering ‘influx’) allowed the manual maintenance of existing tables and their content. The InfluxDB provides a request syntax which is similar to the already known SQL query language. Therefore we didn’t need to adapt the existing data structure in our Spring Boot project to switch between SQL and InfluxDB during the development process.
One difficulty to overcome consisted of the storage of multiple trains and connections of the same time. As our service crawls data about all trains that are expected to arrive at a specific station within the next minutes, the output to store is a list of data where every object in the list correlates with the timestamp of crawling. Thus, every entry of this list theoretically had the same timestamp – and can in return impossibly be saved in a time series database, where the unique primary key is the time. To solve this, we manually added 100 milliseconds to every further entry of the timestamp and called this list of trains and their related delay a “Stop Event”. As a result, our database could possibly contain duplicate entries of trains that are delayed over a longer period of multiple timestamps. To filter out these duplicates, we needed to implement the functionality to iterate over all entries of a query result, ensuring that only the last and valid delay of a train influences the statistics in our web view.
Leave a Reply
You must be logged in to post a comment.