- Marcus Schreiter – ms479
- Nathanael Paiement – np035
- Lena Haide – lh075
- Simon Haag – sh260
Everyone knows the problem of keeping track of expenses. Many applications offer an overview of all expenses, but entering all data individually can be quite time-consuming. To overcome this task, we have developed SWAI, ‘A Scanner with A.I.’.
The goals of the project were:
- Build a working application
- Learn how to build serverless modules with Python
- Learn how to setup CI/CD systems
- Learn how to host a single page app in the cloud
- Interact with Google Cloud Functions
- Interact with the Google Vision API
- Interact with the Google Photo API
Why Google Cloud Function and not Amazon or IBM
The decision to go for the Google Cloud was based on the fact that Google provides a ready-to-use model for text recognition in images. The most important reason is that Google provides the largest training set of images of any known provider. Therefore, the model should perform very well within the Vision API. Also, up to 1000 images can be requested free of charge within one month. This was an acceptable limit for our tests. Furthermore, there is the possibility to test the model quite easily online at https://cloud.google.com/vision/. Furthermore, we decided to stick with the Google Cloud for our serverless modules, to keep the management and the communication overhead between servers to a minimum. Google also provided helpful documentation with plenty of examples of how a possible request to its Vision API might look like.
Because Google Cloud provides multiple SDKs for different programming languages, we had to decide on which language we should use for our project. Our decision fell on python, even though part of our team had little to no experience with the programming language. Which was not a big problem at all, because we already had the motivation to learn a new programming language. Another bonus with python was the very well documented API for python and the tutorials Google provided for setup and usage of the Google Cloud API. Also, it was as easy as it could get to set up a development environment which is consistent between developers. With the help of pip and a simple requirements.txt that contained the required dependencies of our project.
Why IBM Cloud for deploying the frontend
With the plan to create the frontend as a single page application, the requirements for finding an ideal host were very simple. Namely that static files be available via a single base URL without requiring to own the domain yourself. The IBM Cloud didn’t just offer a simple system for hosting static files that were accessible via a single base URL, but it also offered a practical build pipeline, that made it easy to configure a CI/CD system. More details about that setup can be found in a later section. On the other hand, Google’s cloud didn’t offer such a useful toolchain and its Buckets solution for hosting static files required owning a web domain to allow the single page application to work properly.
As with any service-based application, having a flexible, performant and secure system architecture is very important for the success of that service. A flexible architecture is advantageous because it enables faster development and testing of new features, as well as easier maintenance of existing features. A performant system doesn’t just lead to better user experience, but it also means you can scale better and handle more users with the same infrastructure. Security should never be ignored, but it’s even more critical to have a good security model when creating a service that needs to handle personal user data.
The following figure shows a basic overview of the components making up the service and how a client can interact with it:
The application first requests the Google Photos API for receipt images. The images can then be sent to the cloud function, which interacts with the Vision API from Google.
Designing our system this way resulted in a good level of flexibility, as the public endpoint abstracts away the complexities of the chosen image recognition service, keeps the possibility open to connect more services behind it if needed while keeping the communication with clients as simple as possible. The clients are also very flexible to choose where they get the source images from, whether it be from the user’s device or a third-party service. As long as the client can send a public URL to an image or the raw image data itself, it will work with the public endpoint.
Since the parsing of a single image is a self-contained process, that doesn’t work with any kind of state nor produces any side effects, the system can scale very well. With the public endpoint being based on Google Cloud Functions, each request can work independently of one another and there’s no need to manage any hardware resources since Google takes care of allocating enough CPU and RAM for each request. This design decision fulfills factors VI (processes), VIII (concurrency), and IX (disposability) from The Twelve-Factor App guidelines for building better software delivered as a service.
Regarding security, we have reduced risks by only having the image and the parsed data transferred between the client and the service as well as only storing the team’s access token to the Google Vision API and some basic request logs in the Google Cloud. The token to access the user’s Google Photos library never touches our systems and only stays between Google and the user. That token only gets saved locally in the user’s browser.
Google Vision AI
The Vision API provides you REST and RPC APIs, to access pre-trained computer vision models for image-labeling, object- and face detection, optical character recognition and many more (for a full-feature list check this link: https://cloud.google.com/vision/docs/features-list). For our project, we were especially interested in text detection.
The optical character recognition provided by the Vision API takes an image (either in the form of a URL or binary data) and, if successful, returns a JSON which contains the extracted text.
The returned JSON is hierarchically structured: Pages->Blocks->Paragraphs->Words->Symbols. Those Structures also contain additional information like confidence and position in the image. The actual request to the Vision API can be done within a few lines of code.
In our case we added multiple options for the incoming request data, so we could handle images in different data formats (URLs, Binary and JSON). After we retrieved valid image from the request we could then process the image with the ImageAnnotatorClient provided by the VisionAPI. If successful, we could then process the returned JSON.
Processing the return values
After we got the Requests to the Vision API up and running, we had to find a way to extract our desired data from the returned data.
Our first step was to find a Keyword on the Receipt that can be associated with the total value of your purchase. For example ‘Summe’, ‘Total’ or ‘Gesamtbetrag’.
Because the results of the OCR were not always consistent, we couldn’t rely on the hierarchical structure of the JSON. Sometimes the Keyword for the Sum and the actual Value were in different Blocks or Paragraphs. So we could not create an association between the Keyword and the value that way. Luckily the JSON also contains information about the position of words on the image. With a little help from analytic geometry and the assumption that the keyword and total value are always in approximately the same horizontal, we were able to extract the total value from the receipt.
First, we searched for a matching keyword (red). Then we used the bounding vertices of the keyword (green) to create a line. Then we iterated through the word-objects in the JSON and check if their bottom left vertice lies inside a certain threshold near the line. The Threshold is calculated by the keywords bounding box height so the whole calculation is independent of the resolution of the image. If a word is inside the threshold we check if text format matches the format of a number with two decimal places. If so, we found our total value!
Google Cloud Function
The Google Cloud Function is used for the public endpoint described above in the chapter “Service architecture”. We choose to use an HTTP Trigger Function because it works well together with the output of the Google Photo library. The advantage of the Google Photos library is that we could just receive already finished photo links, which can simply be provided via an HTTP request. Since we do not always have a Google photo link available and to enable applications that don’t work with the Google Photo API, the image can also be transferred in binary form. For the evaluation of the request, it is important to set the ‘Content-Type’ in the request header.
When an application calls our Google Cloud Function the below-described process is triggered:
1. First, the Cross-Origin Resource Sharing (CORS) headers are set to allow applications to access our public endpoint.
2. The next step is to create the Vision Client. With the help of the evaluated request of the application, a request can be made to the Vision API. An error occurs if wrong data is given or a wrong ‘Content-Type’ is set.
3. If the response of the Vision Client is successful it can be processed further. This is done by searching within the returned JSON for keywords (using a keyword list) and for dates (using regular expressions).
4. If one of the keywords is found, then the line object is created. This object and the calculated variance are used to search for the corresponding number again with the help of regular expressions.
5. Finally, the results that have been extracted are returned within a JSON. If the ‘Content-Type’ had the value ‘application/JSON’, an additional ID is set inside of the JSON data, which allows the application to distinguish the responses in case of simultaneous requests.
At the beginning of the project, every new change was tested manually by our unit tests. Once all the tests were successful the changes were deployed to the Google Cloud using the Google Cloud SDK on the developer machine. In case everything worked well a new version of the code has been pushed to Gitlab. Since these steps were quite costly, we decided to implement Continuous Integration and Continuous Deployment (CI/CD) using Gitlab. CI/CD should automate and improve the above steps and save time and improve quality.
After required settings have been configured on Gitlab to enable the CI/CD function, a .gitlab-ci.yml file had to be provided. The exact steps of the CI/CD pipeline are defined inside this file. After the above changes, a new commit in our Gitlab repository triggers the new CI/CD pipeline. Within the pipeline, two jobs have been defined. The first job runs the unit tests. The next job in the pipeline is not started until the tests have been executed successfully. The next job involves deploying the Cloud Function into the Google Cloud.
As already mentioned, our frontend in this project forms a single page web application written with React. It orders to work it interacts with three different web services, namely the Google Identity Platform, the Google Photos API and our bespoke receipt endpoint.
The application logic is a three-step process. First, the user authenticates him or herself with its Google credentials against the Identity Platform. This, in the case of success, returns an OAuth API-token which we use in the second step to fetch the users pictures of receipts from his or her Google photos library. These photos get displayed in the application and the user has the opportunity to deselect any pictures he or she doesn’t want to evaluate or which were mistakenly tagged as a receipt by Google photos. And eventually, the user can send all selected pictures to the receipt endpoint that returns the desired data for every image.
To communicate and work with Google Services, the first step is always to sign-up the project. This sign-up is free of costs and is done in minutes. After registering our project the next step was to get an application token to get the Google authentication to work. In the process of authentication, this token is necessary to not only authenticate the user but also the application itself. As another security fence one has to specify from which URLs the requests will come. In our case http://localhost:3000/, for development and http://swai.eu-de.mybluemix.net/, as the production URL.
It says that we want to access the user’s photo library with read-only access. This declaration later leads to the famous warning that the application wants to access your photos which the user has to agree on in order to continue.
Google Photos API
When the user has agreed on letting our application access their pictures, the request to the Google Identity Platform returns an access token and the user’s profile data. From this data, we extract the name and the profile picture and display them in the header of our application. With the access token, we then make the call to the Google Photos API. In a call to the API, Google gives a handful of filter options including date ranges and content categories, which are automatically generated image labels. One of these labels is called Receipts, which we use in our case. Moreover, we specify the media type with Photo and include archived media with ‘includeArchivedMedia’: true.
The call then returns an array of image objects. An image object among other things contains the picture metadata and a picture URL. With this data, we can display the images together with the date they were taken.
Now, after deselecting all unwanted pictures the user presses the sum up button. This leads to the parallel execution of a call for each image to our cloud function in which we send id and the image URL. Since the cloud function handles each request independently the evaluation of all images only takes as long as the longest evaluation. From the cloud function, we receive the extracted sum, the date that shows on the receipt and the id we sent on the request, so we can assign the received data back to the corresponding image. After that is done we show that data in our application and sum up all the results. In our table listing of the image data, the sum field is an editable input field. In case no or a wrong sum was detected, one can manually enter the correct sum to correct the calculation.
Deployment and Delivery
The deployment process consists of three stages. Once new code is pushed to a specific branch of the Gitlab repository, Gitlab triggers a webhook which notifies an IBM cloud pipeline, that a new build can be made. The build process is required because the raw code is unusable as it is. It first needs to be bundled with Webpack, which is done inside of a Node.js Docker container created by the pipeline. Once the build process successfully finishes, the final step in the pipeline takes over. This step is the actual deploy and delivery stage. It moves the newly built files to a publicly accessible bucket (http://swai.eu-de.mybluemix.net/) configured for serving static files on IBM’s CloudFoundry stack.
A screenshot of the application, which was developed within the scope of this project, can be viewed below:
Doing this project has been a great introduction for the entire team into the world of cloud services and workflows. With the cloud, we had easy access to readily available AI models for quickly parsing receipts, such that we could concentrate our energy on understanding them. With it, we also did not have to worry about any server infrastructure either for the backend service nor for the frontend application. Scaling up or down depending on the demand was all taken care of. The cloud also greatly optimized the development cycle with the ability to automate testing, build and delivery processes.
What about the future of the project?
The project was for the team first of all a prototype to get more familiar with the cloud and we are all satisfied with what we have achieved and what we have learned along the way. If we were to continue that project, these would be the areas where we would like to focus on. On the client-side, we would like to offer applications for more platforms like native mobile applications for Android and iOS, offer more ways to import photos from other photo clouds, letting users directly take photos from their camera or upload them from the file system. We also thought about offering more ways to better analyze user’s purchasing behavior, as well as offering different possibilities to export these statistics. On the backend we planned on optimizing the parsing algorithm, making it work with a larger amount of different receipts in different languages as well as extracting more data from the Vision API. Some improvements could be made in the testing and deployment process, by for example applying a Blue-Green deployment workflow.