Scaling an AI Transcription Model as a Service

Ton-Texter is a Software as a service solution to transcribe audio and video files and get different transcript file formats with filmmakers as our target audience. Therefore we use models based on OpenAI’s Whisper. Additionally we perform a speaker diarization with Pyannote. That way we can deliver state of the art transcription performance. In this blog post, we will explore how we have improved its scalability to handle high demand.

Our existing implementation is based on a Next.js frontend with a dashboard for users to upload their files and download the corresponding transcript files. The underlying infrastructure for the transcription is based on AWS and defined as infrastructure as code with Terraform and deployed via Github actions. The central component of our transcription pipeline is an EC2 machine, a cloud compute instance, with a capable GPU which is required for the usage of our AI models. The main challenge of our initial implementation was to find a way to get “serverless” like GPU power, that scales to 0 to avoid permanent running cost. Existing solutions like Runpod.io did not meet this criteria. Instead, we built our own solution with an EC2 that is launched whenever there is demand and stopped when all tasks are processed. We initialized the EC2 with a so-called user data script, a script that is executed at the initial creation of the instance, to install all required packages and to pull our Python transcription application from its Github repository.

In general, this solution worked fine, but it reaches its limit when there is high demand.

Improving Scalability: Our Road Map

Throughout this project our goal was to increase the scalability and performance of our transcription pipeline. Instead of using a single machine we opted for an AWS Auto Scaling group, that we will later describe in detail, that scales the number of machines dependent on the demand. We visualized the transcription flow in the following figure:

After a user uploads an audio file, a database entry is created and the Lambda function, responsible for scaling the Auto Scaling group, is called. Depending on the demand, the number of machines in the Auto Scaling group is adapted. Each machine calls the Next.js endpoint to fetch new transcription tasks. If no further tasks are in the queue the machine will call another Lambda that will shutdown the specific machine and downscale the desired number of machines in the auto scaling group accordingly. In the following figure we show the most important components of our updated infrastructure:

With the introduction of the auto scaling group, we intend to improve the performance of our transcription pipeline significantly. To test our application’s performance, we created 37 audio files with random lengths and a total length of 10 hours. We set ourselves the ambitious goal to process 10 hours of audio files in just 10 minutes.

To verify and compare the performance of our new architecture in comparison to our old architecture, we implemented centralized logging and set up a testing tool ahead of the introduction of our new architecture, to obtain comparable metrics for both.

To get centralized logs for all our components we introduced AWS CloudWatch, to create an automated load test of our application we used Apache JMeter.

Throughout this blog we will further elaborate on the technical decisions we made and the challenges we encountered that led to the final architecture and implementation.

Centralized Logging and Metrics in AWS CloudWatch

Ton-Texter consists of many different components. To get an overview of what’s happening, it is important to have one central place for logs and all important metrics. Therefore, we use AWS CloudWatch as it offers options to query and filter logs, customize dashboards, integrate custom metrics, and log streams from different sources with ease. Solutions like Grafana offer greater customization options, but for our use case the capabilities of CloudWatch were sufficient.

At the time, only the logs of our Lambda functions to start and stop machines were already visible in CloudWatch. We had to find solutions to add the logs of our Next.js functions, our EC2 instances, and the Python application running on that instance to CloudWatch. Additionally, we wanted to get metrics for the usage of the CPU and GPU of our EC2 machines to make sure that they are effectively used by our transcription application.

ProcessID to Keep Track of a Process Across All Components

It is not only a challenge to collect all logs at one place, but also to put them in relation with each other. To be able to track one transcription process across the entire infrastructure, we introduced a process id that is part of all logs related to the process. Like this, we can query our logs with the process id and get the relevant logs for a specific process.

Our process id consists of the already existing user id and transcript id. Both of these are unique, that way it is ensured that the process id is unique as well.

The user id and transcript id were already available in the EC2 Python application and the Next.js backend functions because they are part of the transcript object, so we just needed to pass this information to the Lambda function to start instances.

Collecting Logs and Metrics with the CloudWatch Agent

The AWS CloudWatch agent comes preinstalled with most AWS AMI’s. We use it to collect logs and metrics from our EC2 machine. We create a cloudwatch agent configuration file with the user data script that gets executed at the initial startup of the EC2 machine.

Pass EC2 Logs to Cloudwatch

The most important part of our transcription pipeline is running on EC2 machines. On our machines, we have three separate log files:

init.log: Log file of the user data script that is executed on the first launch of the machine to install packages, pull our Python application from Github and set up the autostart script to launch the application on the reboot of the machine.
startup.log: Log file of the autostart script that pulls / updates the application and launches it.
Python-Transcription-Application.log: Log file of the Python application for the transcription.

The AWS CloudWatch agent can track changes in local log files and pass these changes on to CloudWatch logs. In CloudWatch, logs can be summarized in log groups. We decided to create separate log groups for the three individual files / log streams. The following image shows the log groups in CloudWatch:

Ahead of collecting these logs in CloudWatch we had to connect to the machine via SSH to understand what happened at the initialization, startup or the execution of our Python application. This simplified our debugging process significantly.

The Introduction of Custom Metrics to Cloudwatch

GPU usage is not a part of the standard AWS metrics for EC2 machines, but it is the most important metric for us as it is crucial for us to see if the machine learning models of our application are running on the GPU, as it would be way slower if they would run on the CPU due to a Pytorch misconfiguration. Additionally, we want to track how efficiently we are using the hardware of our EC2 instance. Because the runtime of our machines can be just a minute or two, we decided to set the metrics_collection_interval to 10 seconds to get a detailed understanding of our GPU usage in the different stages of our transcription process.

In the following graph we can see both, the gpu utilization, and the memory utilization across the transcription of two large audio files:

We can see that the application uses 100 percent of the GPU whilst performing speaker diarization and transcription. From the graph we can clearly see the differentiation between the first step, speaker diarization with Pyannote, which uses 100 percent of the GPUs memory and the second step, transcription with Whisper, which uses around 50 percent of the memory. This confirmed that our application takes full advantage of the g4dn.xlarge instance and its NVIDIA T4 GPU.

How We Retrieve Useful Information from Our Logs

With the Logs insights integrated in AWS CloudWatch we can query our logs / logs from selected log groups. As mentioned in the beginning of the article, all of our logs contain the process id that is the combination of the user id and transcript id.

For our load tests we use a specific test user. To get the logs related to the test run we can query for all log messages with the test users id. The following snipped shows the query we used in Logs insights:

In addition to that, Logs insights automatically detects patterns within the log messages of the query result. By filtering the results based on these patterns, we can get on overview of different steps in our process, for example when our individual transcriptions are done:

Collect Additional Metrics from Next.js

To provide insight into the queue of transcripts waiting to be transcribed, we added the queue length and the total length of all audio files in the queue as an additional numeric metric that is sent from Next.js to CloudWatch each time a new audio file is uploaded or processed. So this time we didn’t log a message, just two numbers and the metric names.

Custom Dashboard

All of the collected and logged metrics come together in a custom CloudWatch dashboard we created. CloudWatch offers a variety of ways to visualize data in different graphical, numerical or mathematical representations, of which we mainly used the line chart. A combination of these then gave us the perfect overview tailored to our project, with queue info, Auto Scaling group and Lambda info, and GPU and memory utilization all at a glance. A good point to keep an eye on during the normal business day to make sure everything is running as smoothly as it should.

How We Scale – Auto Scaling Group and Scaling Algorithm

The now often mentioned AWS Auto Scaling group can scale machines up and down horizontally by starting and stopping duplicates of a machine. Getting started was fairly easy. All you have to do is provide a launch template (to which we get in a second) on which to base the created machines, and a minimum and maximum machine size.

Setup Auto Scaling Group

As with all of our AWS infrastructure, we used Terraform to define the Auto Scaling group.

For this project, we chose to set the allowed scaling between 0 and 10 machines. The actual number of machines active in the Auto Scaling group can then follow scaling methods such as dynamic on-demand, scheduled, or manual. A typical dynamic on-demand scaling method might be CPU utilization, which should remain at 50 percent, for example. If this can’t be met, the auto-scaling group must spin up another machine.

However, since the important value in our case is the GPU during transcription, and it is always fully utilized by the AI transcription model, as described above, such metric-based scaling was not possible for us. Therefore, we chose the manual scaling approach.

Setup Start and Stop Lambda

The basis for manual scaling is the desired capacity of the Auto Scaling group, which can be manipulated using Lambdas, for example. As mentioned earlier, we used Lambdas to start and stop machines. This allowed us to take advantage of the machine management part of the Scaling group, while maintaining control over the scaling logic.

With a call to the stop Lambda, machines would simply initiate their own shutdown if they didn’t receive a new transcription task, and the desired capacity would be reduced by 1. The start Lambda, which is responsible for scaling up, has a slightly more complicated logic.

When invoked, the start Lambda calculates a new desired capacity, which is a combination of the number of audio files in the queue and their total length, the metrics described earlier. Specifically, the threshold was the minimum of the duration divided by 15 minutes or the number of files. If there is only one file, multiple workers didn’t make sense since we weren’t splitting long audio files into multiple parts. The 15-minute threshold was chosen because of the real-time factor of the transcription, which means that with the models we chose, it took about a minute to transcribe 15 minutes of audio, and it also took about a minute to spin up another machine.

This resulted in our scaling algorithm, which can be better understood using this graphic:

Problem with Shrinking Capacity

One weird issue we ran into was that the stop Lambda, when called, would cause another machine to shut down due to a reduction in desired capacity. Although we explicitly used a function of the AWS SDK that should shutdown a specific machine and reduce the desired capacity by one, lowering the capacity did not work in harmony with the machine that we originally wanted to shutdown but triggered the shutdown of yet another machine.

As the function immediately decreases the desired capacity of the Auto Scaling group, the Auto Scaling group might attempt to shut down another machine, while the instance we intend to shutdown is still in the running state. A good solution to this problem was found in the scale in protection of the Auto Scaling group, which prevented all machines from shutting down due to a lowered desired capacity.

Automated Launch Template Creation

As mentioned in the introduction of this blog post, we previously used a single EC2 machine for our transcription that we automatically deployed via Terraform and initialized via the user data script. To create an Auto Scaling group based on this EC2, we initially created an AMI manually based on a snapshot of our machine and then manually created a launch template based on that AMI.

As it is our goal to automate the deployment of our entire infrastructure to be able to make changes efficiently to our transcription pipeline, we had to find a solution to automate the creation of the launch template used by our Auto Scaling group.

For simplicity’s sake, we decided to automate the creation of our AMI and launch template based on our already existing automated EC2 creation. In future work we could compare our solution to Amazon EC2 Image Builder, Amazon’s solution to create images, or Packer, that enables the creation of images for multiple cloud providers.

The migration to Image Builder would be straightforward as it is possible to use the same user data script in the initialization of the image that we use to set up our machine. An advantage could be the integration of tests directly in the image building pipeline.

The migration to Packer could be more difficult, but as our user data script is simply a list of shell commands, we could use a provisioner as in Provision | Packer to set up the packages and our Python application on the image. The advantage of Packer would be to gain platform independence, which could be crucial if we would plan or have to move to another cloud provider in the future.

Launch Template Creation

An advantage of our approach, however, is that we can test our entire application on the exact EC2 machine type we will use in our Auto Scaling group. As displayed in the following image, we initialize our EC2 instance with the user data script that installs all packages, pulls our Python transcription application from GitHub and creates an auto start cron job that launches our application whenever the machine is launched. After the installations and configurations, we run our Python application tests. Only if they are executed successfully will we invoke our Lambda function to create an AMI and launch template based on the machine, else we will log the error to CloudWatch.

Our build image Lambda than executes the following steps:

Shutdown EC2 and wait until the machine is stopped – this is necessary as data integrity is otherwise not guaranteed and we experienced problems with our AI models that were corrupted if we did not wait for shutdown.
Create a Snapshot and an AMI based on the machine
Create or update our launch template and its default version number

We can then simply reference our automatically created launch template via its name in the Terraform setup of our Auto Scaling group and configure it to always use the latest available version of the launch template.

Python Testing

Ahead of introducing a new version of our transcription application to our live system, it is important to ensure that it runs transcriptions successfully. Therefore we created Python tests that test all important parts of our application from file writing, file uploads and downloads, helper functions, up to an entire transcription process including the speaker diarization with Pyannote as well as the transcription with Whisper. The only functions we did not test were the functions that directly interact with our Next.js live system.

An advantage of testing an entire transcription is that the required AI models get downloaded from Hugging Face and are therefore a part of our AMI, otherwise they would have to be downloaded ahead of transcription on every machine launched by our Auto Scaling group. The following chart shows the code coverage of our Python tests:

A Challenge with Slow Snapshot Restoration

So far so good, but when we used the launch template in our Auto Scaling group, the transcription suddenly became incredibly slow. Thanks to our CloudWatch logs we were able to track the issue to the point where our AI models were initialized and figured out that our snapshot is saved as cold storage which led to very slow read and write speeds when attempting to restore data from our snapshot. One option AWS offers is fast snapshot restore which comes at a cost of $540 for a single snapshot in a single region. Even after intense discussions with our managers, we were unfortunately not able to get the budget, therefore we had to find another solution. 🙃

An alternative option is to prewarm the entire storage on initialization of a machine, which means that the entire storage is retrieved from cold storage first, unfortunately this took more than 50 minutes, as shown in the following image, making it unsuitable for our intended use case.

Finally, we managed to get around the problem by increasing our machines EBS storage throughput and IOPS, which also increases cost, but is only billed as long as our machines are running.

Load Testing with Apache JMeter: Time to Verify Our Goals

As described at the beginning, we needed a way to compare the old infrastructure with the new to see if we could meet our self-imposed goal of transcribing 10 hours of audio in now just 10 minutes. We had split the 10 hours of audio into 37 files ranging from 30 seconds to 30 minutes to try to cover the range of lengths we would naturally encounter in production. And since we didn’t want to upload 37 files one at a time using the Ton-Texter dashboard UI, and since we wanted to keep the tests identical and comparable, there was no question that we would do it manually.

So we needed a tool that could automatically perform this load test for us. Although there were many alternative tools that might have had a more user-friendly interface, a simpler workflow, and better debugging support, such as Locust or Artillery, we decided to go with Apache JMeter because we wanted to try an established and robust piece of Apache Foundation software.

In JMeter, we had to programmatically recreate a test user login flow and call all the right endpoints in the right order as our application code would normally do to login, create tasks, upload files, and trigger the transcription progress. While some things took quite a while to figure out or required workarounds, the file upload itself was quite simple using a CSV file. The tricky part was programmatically figuring out the correct tRPC login action to make the login request to, or dealing with Next.js not returning a location header on a HTTP 301 response, which resulted in a JMeter error and forced us to handle the cookies manually.

Comparison of Ton-Texters Performance Before and After Scaling

Looking at the before and after comparison in our CloudWatch dashboard, we were able to do in 10 minutes what used to take over an hour. In the two CloudWatch screenshots below, we can see the occurrence of transcriptions done where we first see slow sequential processing and below the much faster parallel processing with the new infrastructure.

We can conclude, that we indeed managed to meet our self-imposed goal and were able to prove that our new infrastructure based on the Auto Scaling group far exceeds the performance capability of our previous implementation!

Quick Regression Test to Verify new Deployments

To incorporate an end-to-end regression test into our development routine, we implemented another, smaller JMeter test that uploads only one audio file instead of 37, and waits a maximum of five minutes for its successful transcription. Because the JMeter tests were also kept in a separate repository, this test could be included in multiple places, such as GitHub Actions. Even though the test was created in the JMeter UI, we were able to easily run it via its CLI. On every push to our main branch of our Terraform repository we redeploy our AWS infrastructure. To always test if the new deployment works as expected we introduced the JMeter test as a second step to this pipeline, so that a failed test would show up directly as a failed action in the GitHub UI.

Edge cases – the Introduction of a Heartbeat

To make our transcription pipeline more robust and bring it closer to market maturity we thought about how we can deal with edge cases that might lead to failed transcriptions. In theory a failure in transcription should always be catched by our Python application and the state of the transcription should be set to “FAILED” with an API Call to Next.js. But what if the transcription is interruped for some reason, for example, because of an unexpected shutdown of the EC2 machine?

With the existing implementation the transcription would simply remain in the processing state forever. To counteract these edge cases we introduced a so-called “heartbeat”, which is basically another timestamp entry in our transcript database that is updated every 5 seconds whenever a machine is working on the specific transcript. In this way, we have an indicator for unhealthy transcripts – if the timestamp of the heartbeat is significantly older than 5 seconds it indicates that the processing seems to have been interrupted.

Not just a Heartbeat – the Introduction of Progress Metrics in Python

To send a heartbeat every 5 seconds to our Next.js application we implemented a separate thread in our Python application. As we were sending requests anyways we figured that we could use the opportunity to also transfer progress updates to improve the user experience. Therefore we had to find a way for our main thread and heartbeat thread to exchange information. We chose to use a JSON file that contains the current transcript id, state and progress of the transcription, which is updated by our main thread and is consumed by our heartbeat threat to propagate the information to Next.js with its API calls.

When and How the Heartbeat is Checked

Because the receiver of the heartbeats, our Next.js application, is hosted in a serverless environment, a cronjob approach for checking for unhealthy transcripts that haven’t received a heartbeat in a long time would have been impractical. Instead, we implemented a health check whenever a machine is done with a transcription and asks for a new transcription task. Before the next task in the queue is returned, the request handler on the Next.js side checks for unhealthy transcripts and puts any that are found back in the queue.

But what if the last machine running is interrupted and there is no machine left asking for new transcripts?

To also catch this remaining edge case we introduced an extra health check call as soon as the Auto Scaling group is scaled back to 0 to ensure that at this point all transcripts are correctly processed. If that is not the case a machine is launched to process the remaining transcription tasks.

Improved User Experience

The regular health check and progress updates also allowed us to improve the user experience in the Ton-Texter application dashboard by displaying a progress bar.

Now that our infrastructure is capable of processing multiple transcription tasks at the same time, and thanks to the introduced progress bar, we can see parallel progress in our dashboard!

Conclusion

To summarize, we learned a lot of new things as we implemented our new architecture. Whether it was the usefulness of centralized logging, how to deal with AWS launch templates, how to plan and execute a load test, or just how to think about edge cases in advance.

Thanks to the, perhaps unintuitive, approach to implement logging and testing ahead of the actual changes we were able to compare our two Infrastructures and prove that we managed to make significant improvements to our transcription pipeline.

One important takeaway for us, though, is that we built a valid general-purpose architecture for scaling AI models that could be used in one way or another for a variety of use cases in applications that require scalable GPU power.