Audio in the form of podcasts has become increasingly popular as a digital medium in recent years. In the U.S., one in two people have already consumed podcasts and 32% are monthly listeners. In Germany, one in five listens to a podcast at least once a month. In addition to the podcast phenomenon, which has existed and established itself in the market for some time, a new medium is now blossoming – social audio. Unlike podcasts, which are a one-way broadcast medium (they are recorded to be listened to in the future), this is real-time audio. Listeners can actively participate in the dialogue.
Social audio apps like Clubhouse seem to benefit from several factors. Headphones and earbuds have become ubiquitous; Bluetooth-based offerings, like Apple’s Airpods, have built-in microphones, making communication easy and intuitive. The podcast boom also appears to be a driving factor. Audio-based content has an enormous reach, and users are now accustomed to audio formats. Last but not least, digital voice communication has become commonplace, driven by voice messaging and phone calls via messenger platforms like Whatsapp, among others.
The Corona pandemic also seems to have played its part, as people began to experience Zoom Fatigue after only a short time. On the one hand, text messages simply don’t convey the emotions and nuances that human character requires – especially during isolation. On the other hand, Zoom calls and video calls are simply too exhausting, too demanding for us humans in the long run. For example, we mainly look at our own faces to make sure that the vegetables from lunch are not still stuck between our teeth. In general, this can be summarized under the camera effect. Many people are blocked and do not dare to speak if they feel they are being watched.
Currently, there are more than 40 companies or apps that are active in the field of “social audio”. The largest include Clubhouse, Twitter Spaces, Facebook Live Audio, Spotify Greenroom and Reddit Talk. In June 2020, Discord announced a new tagline, “Your place to talk,” in an attempt to make the service seem less gamer-centric to capitalize on people’s need to connect during the pandemic. In response to the hype around Clubhouse, a new feature has now also been implemented that further supports this new direction – Discord Stages. Other notable mentions are Fireside, Cappucino and Angle Audio. Interestingly enough Angle Audio switched to a SAaaS (Social Audio as a Service) offering recently.
ProductHunt reported a significant increase in new audio products for 2020, explaining that face-to-face encounters are currently becoming more difficult to maintain due to social distancing, but social connections and exchange are needed more than ever.
Since audio as a medium is still a niche market in organizations, we at InspireNow wanted to find out whether this hype, this success in terms of audio-based content, can also find resonance in the business and professional environment. We perceive great potential, especially in terms of knowledge transfer.
That is why we have decided to delve deeper into the topic by means of an innovation project. The overall goal of this innovation project was to evaluate whether Enterprise Social Audio works, i.e. whether the added value and appeal of social audio can also be used in the context of companies, institutions and organizations. This was evaluated in the context of our existing SaaS solution InspireNow.
If you're looking for a step-by-step guide on how to include and implement social audio in your own mobile app, I'm afraid this blog post is not for you. Rather I'm going to give a birdseye view over the topic itself and depict the procedure of analyzing requirements for enterprise context. I'll describe some challenges encountered during development and provide insights into security, GDPR and compliance concerns.
In the first step, a feasibility study was carried out on the basis of a proof of concept in order to obtain initial certainty for the technical viability.
After validating the technical viability, a 3-phase question model was then used in order to develop an initial concept. Existing applications in the social audio field were analyzed based on the following questions:
- What is good and can be adopted directly for the enterprise context?
- What can be adapted and tailored for enterprise?
- What doesn’t fit at all – what doesn’t make sense in the enterprise and organizational context?
With those results, we conducted interviews with several customers and contacts from the business world. The gathered feedback was then used to further refine the concept. The goal was to get a first grasp of the answers to the research questions and potentially identify further, previously unknown concerns and needs.
By working closely with the customer and obtaining continuous feedback, the prototype was incrementally developed up to the point of a minimum viable product (MVP).
State of Research in the Field
Some products already exist that are geared more towards work-related audio communication. Worth mentioning here are Tandem, Yac, Walkie and Watercooler. However, their focus is mostly on pure efficiency and productivity enhancement, improvement of collaboration and optimization of meetings.
Due to the maturity of some of these products, it naturally made sense to apply the 3-phase question model in order to benefit from the learnings already made by these companies and to incorporate as many insights as possible into the development of the MVP.
Status of Own Research
As a basis for decision-making for this innovation project, discussions have already been held with several customers and contacts from the business community. In these conversations, a clear interest in an audio functionality similar to Clubhouse could be identified and confirmed.
Preliminary research was conducted to assess technical feasibility and the real-time engagement platform Agora.io was identified as promising. Research work related to existing audio drop-in apps additionally confirmed that Clubhouse itself uses Agora to provide audio functionality.
- Is audio at all interesting for knowledge management and communication in organizations?
- Can audio be used for further education, innovation and also everyday discussions?
- Are there certain concepts or use cases that seem particularly promising?
- Panel discussions?
- Expert discussions?
- Expert debriefing to enable knowledge transfer?
- “Water cooler” discussions with colleagues?
- Non-binding exchange among colleagues
⇒ while running, in the car, sports, bad hair day, etc.?
- Sales training?
- What should audio functionality look like in the app?
- What do customers want?
- Can audio lead to greater inclusion?
→ in principle, there are many reasons why people avoid video.
- How can audio be made as accessible as possible?
⇒ live captions, etc.
- Which features offer the greatest added value?
- Record and Publish
Do organizations want to be able to record broadcasts to make available as podcasts later?
- Live Captions
How important, how in demand is a live transcription of the spoken word into text?
Should the exchange be limited purely to audio or does the additional exchange via text add further value? What might a potential MVP look like?
- Record and Publish
- Are there certain features or rules that need to be implemented and provided to enable better conversational discipline?
→ some kind of baton – whose turn is it?
- Which concepts have to be considered with regard to moderation, security and privacy?
- Which measures are absolutely necessary from the perspective of GDPR?
- Does GDPR possibly also represent a hurdle that restricts functionality and usefulness?
- To what extent could the use of the platform Agora.io cause problems with regard to data protection?
⇒ According to the researchers, this creates a major privacy problem (Stanford analysis in original). This is because the Chinese government has extensive access to IT companies in the country and can demand the handover of data (Link to Handelsblatt Post).
Planned Goals to Be Achieved
- Feasibility study through prototype creation
The underlying goal is to reduce risk. The Prototype additionally serves as an experimental platform to be able to collaborate with customers in an application-oriented manner.
- Evaluation of existing social audio apps & platforms based on the 3-phase question model
By analyzing them, a compilation of all possible features and use cases can be made. This is then used to elaborate potential unique selling points.
- Customer discussions and interviews
A benefit and desire ranking of the elaborated featrues and use cases will be made. Based on the identified research questions a quntitative survey will be created and distributed to clients.
- Extraction of the most important features for the MVP
- Further development of the prototype into the MVP
- Evaluation of the MVP through pilot project at a customer site
- Creation of the prototype fails
Solution: research and investigate alternative technologies
(similar to next risk)
- Use of technology agora.io raises privacy and compliance concerns
This is due to potential access/intervention by Chinese government. Alternatives might be Plivo, an own implementation using WebRTC – which in principle is unsuitable, due to the high overhead and effort – , or other solutions like Antmedia, MirrorFly, EnableX.
- Customers have many different opinions, needs and wishes
Therefore creating one clear concept for the MVP is difficult.
Solution: own prioritization of customer opinions, survey of other / additional customers
- MVP will not be ready in time
Solution: using an agile approach, getting frequent and immediate customer feedback, and limiting to the really most important features helps at least in reducing the risk. A realistic staking out of the scope is also very important.
Significance of the Research
Audio is still a niche medium in most companies, but we see great potential – especially for knowledge transfer. The pandemic in particular has shown how far Germany is lagging behind in digitalization.
The successful implementation of Enterprise Social Audio can therefore make an important contribution to the digital ecosystem in Germany and help prepare companies and organizations for the future.
Enterprise Social Audio offers many benefits and opportunities. For example, executives can be made more approachable through audio and podcasts, organizations can effortlessly build up audio databases, and feedback sessions can be easily set up. There is also great potential in the possibility to connect colleagues who work from home and/or in the office.
From virtual fireside chats to expert circles, to launching initiatives – anything is possible.
A total of approximately 35 contacts from the business community were interviewed. Mainly from upper management – managing directors, department heads and board members. The size of the companies surveyed was mixed – from 40 employees to 300,000.
|Training can be held regardless of location, and employees from different sites can attend together.
|Panel discussions captivate the audience because this is where controversial viewpoints and opinions collide.
Whether experts, interested parties or communities of interest – this is the place for other perspectives, new impressions or to form one’s own opinion.
|Information, data and knowledge are multiplying ever faster. This makes the systematic acquisition of qualitative knowledge (internal and external) all the more important.
“Deep Learning / Understanding” as a clear competitive advantage!
|Expert Debriefing / Teachback
|Whether retirement, job change, sabbatical or parental leave – it is not too early to initiate the transfer of knowledge to potential successors.
It saves a lot of time and keeps special and ephemeral things alive.
|Non-committal Exchange with Colleagues
|In the coffee break, “water cooler” conversations, in the car, doing sports, lunch break, …
|Asynchronous Standup Meetings
|In agile projects, each member of the team gives their input via audio. The given input can then be accessed by all team members at the individually desired time.
|Contacts can be invited (within the InspireNow app) to a scheduled broadcast or even a currently running broadcast.
|Invitation Links & Sharing Broadcasts
|An invitation link can also be sent to those who are not users of the InspireNow app. These external people can then participate in the corresponding broadcast via a web browser (so they do not need an account or the app itself).
|In addition to audio, there is the option for those in the broadcast to chat with each other (for example, questions could be asked in writing to the hosts/moderators).
|The spoken content within a broadcast is translated in real-time and can be accessed in the form of live subtitles (for the hearing impaired, deaf or simply in situations where audio is not possible).
|Record and Publish
|Broadcasts can be recorded and saved as a retrievable audio file. This creates a searchable database of podcasts/recordings that can be filtered based on interests.
Large events or general conversations with added value can thus be retrievable even for those who were not present at the time of the conversation or event.
|Searchable Audio Transcription
|A transcription is available for saved recordings/podcasts. Based on this, the contents of recordings can be quickly searched and filtered textually. For example, the user enters “data protection” or “DSGVO” and receives the broadcasts in which the words data protection or DSGVO were mentioned during the conversation.
|Video Option for Hosts & Speakers
|Optionally the hosts and speakers have the possibility to switch on their cameras.
- Twitter Spaces
- Facebook Live Audio Rooms
- Spotify Greenroom
- Discord Stages
- Reddit Talk
- Angle Audio
- and many more …
Focused on work/enterprise
Security, GDPR and Compliance
At the beginning of the innovation project, the importance of data protection and information security in the context of (enterprise) social audio was rated as very high and essential. Since additional compliance concerns arose after the pre-selection of the technology to be used – Agora.io – an explicit research phase for security, GDPR and compliance was scheduled.
Pre-established hypotheses and concerns were:
- The use of agora.io technology raises privacy and compliance concerns (for companies and organizations)
This is due to possible access / intervention by the chinese government.
Question: To what extent could privacy issues arise from using the agora.io platform?
“Authorities could have recorded discussions.”
⇒ According to the researchers, this creates a major privacy problem (Standford analysis in original). This is because the Chinese government has extensive access to IT companies in the country and can demand that data has to be handed over (Link to Handelsblatt Post).
- GDPR preliminary work is important
Audio is a sensitive topic and data protection is key.
- how far can one go?
- what can be done?
- there must not be a lack of automatisms, otherwise usability suffers
- Does GDPR possibly also represent a hurdle that restricts functionality and usefulness?
- end-to-end encryption may reduce the speed of the process
- definitely increases implementation effort
⇒ AccessTokens have to be distributed to all users, the encryption key for symmetric encryption has to be managed and also correctly distributed to all users
To get an impression of the challenges and requirements in the context of social audio apps, security and GDPR issues at similar apps (Clubhouse, etc.) were analyzed. The results are listed below.
Chinese Data-Sharing Concerns
Further, any unencrypted data that is transmitted via servers in the PRC (Public Republic of China) would likely be accessible to the Chinese government. Given that SIO observed room metadata being transmitted to servers we believe to be hosted in the PRC, the Chinese government can likely collect metadata without even accessing Agora’s networks.Stanford Internet Observatory
This problem may occur under the following conditions:
A: the traffic is sent unencrypted
B: the traffic is routed through servers in China
What did Clubhouse do?
“backend changes that will boost the service’s encryption and prevent user ID pings from being routed through servers in China” (Source)
What can we do on our end to address these concerns?
- Encryption – end-to-end encryption of audio
- Geofencing – restriction of network traffic to european servers
Transmitting of Users Unique Data in Plaintext
SIO has determined that a user’s unique Clubhouse ID number and chatroom ID are transmitted in plaintext, and Agora would likely have access to users’ raw audioStanford Internet Observatory
Detailed Description: Joining a channel, for instance, generates a packet directed to Agora’s back-end infrastructure. That packet contains metadata about each user, including their unique Clubhouse ID number and the room ID they are joining. That metadata is sent over the internet in plaintext (not encrypted), meaning that any third-party with access to a user’s network traffic can access it. In this manner, an eavesdropper might learn whether two users are talking to each other, for instance, by detecting whether those users are joining the same channel.
What can we do on our end to address these concerns?
Agora.io has already fixed this vulnerability, which was discovered by McAfee. Since this fix, data is now also transmitted encrypted when initializing or entering a channel.
Therefore this vulnerability doesn’t exist anymore and we don’t have to take any measures.
Access to users’ raw audio is not possible for Agora if we implement the above-mentioned end-to-end encryption.
User Concerns About Audio
Stored audio formats can be abused more easily than reactions made in a video call, for example.
Word fragments / sentence parts could be cut together and then put together and used differently out of context.
If you speak “yes” as audio, this can be misused in distance contracts.
How did other apps address those concerns?
Clubhouse has a very strict policy on not allowing any audio recording to protect the privacy of its users. Speakers, moderators, and listeners are not allowed to record audio on Clubhouse.
They include the following warning in their terms and conditions “Recording a Speaker’s audio without their permission is a punishable offense, and depending on your location, you could be charged for the same“.
However, a voice memo or voice recorder app can still be started on the phone to easily record the audio. There are no restrictions or measurements to prevent this from happening.
Twitter Spaces seems to want to offer native recording of spaces and audio content soon. However, I haven’t found any info on whether “foreign” recording will be actively prevented for voice or screen recorders.
At Facebook “Live Audio Rooms” conversations can also be converted into a podcast. Thus they also provide native recordings of the conversations. However, I neither found any information on whether “foreign” recording is prevented.
What can we do on our end to address these concerns?
Unfortunately not much besides adding a descriptive and prohibitive section to our terms and conditions. There’s currently no way to detect whether a user is using an audio or screen recorder app on his smartphone. Screen recording itself can be prevented by blacking out the screen, but this wouldn’t prevent the audio from being recorded. Thus rendering this method useless for our purpose.
Other privacy concerns and faux pas of similar apps are e.g. the need for access to the complete contact list (Clubhouse), the creation of shadow profiles based on this or the non-deletion of user accounts (also Clubhouse). However, those issues do not apply to our product, as we do not require access to the contact list, do not create shadow profiles and allow users to delete their account without any delay.
Conclusion: Security Best Practices
Based on the findings and learnings from the research, the following was implemented.
- Audience members start out with a token for the role
- When they make a request to talk and it is accepted they first get a new token from the backend with role
RtcRole.PUBLISHERand only then the
ClientRolein the app is changed to
- When degrading back to Audience the same happens
They get new token with role
RtcRole.SUBSCRIBER, then switch in the app to
- This ensures that only authenticated users are a) able to join a channel and listen to the audio stream and b) that only users with the role Broadcaster are abte to publish audio streams
- Further infos: https://docs.agora.io/en/Interactive Broadcast/faq/token_cohost
Channel Encryption (Symmetric Encryption)
- In our backend an individual 128-bit encryption key is generated for each broadcast and then distributed to all users who want to join this broadcast. If a user is not authenticated or authorized to join the broadcast, he never receives the encryption key
- When joining a broadcast / channel, the encryption key is included to guarantee end-to-end encryption (128-bit AES encryption, GCM mode) right from the start
Requirement for this to work: all users of a broadcast must use the same encryption key
If required, the encryption method can be increased to 256-bit AES encryption, GCM mode (e.g. if due to quantum computers sufficient security can no longer be guaranteed).
- Network traffic is limited to european servers
“Once a customer specifies a region using geofencing, no audio, video, or message can access Agora servers outside that region.”
By analyzing network traffic using Wireshark, it was verified that network geofencing works correctly and also that the vulnerability described above (transmitting data in plaintext traffic) was indeed fixed.
The aim of creating this PoC was a feasibility study of the most important technical requirements. This is intended to reduce the development risk. In addition, the prototype resulting from the PoC serves as an experimental platform for further application-oriented collaboration with customers and for continuously obtaining feedback.
Based on the survey and the analysis of existing and similar apps the following features were extracted as crucial for PoC development and thus implemented.
End-to-End Encryption of Broadcasts
Before joining a broadcast each user sends a request to the backend to get a token that authorizes joining the broadcast. Additionally, the encryption key is provided as well. Using this authorization token and encryption key together ensures that the user is authorized to join the requested broadcast and that upon joining, all data being sent, is end-to-end encrypted.
Only if the auth request succeeds and the provided encryption key is considered valid, the user initializes his connection and joins the broadcast.
The entire infrastructure of InspireNow runs on servers within the EU, with the majority even located in Germany. Therefore it was a logical step the make sure that the network traffic of the broadcast functionality is limited to European servers. Agora assures that – once a region is specified by using geofencing – no audio, video or message can access any Agora servers outside that region.
When a broadcast has been joined, it must be possible to minimize the app, use other apps, or even lock the smartphone and leave it in your pocket. In short: audio needs to keep running all the time, even in an inactive state. To achieve this, a foreground service is needed for Android.
A host or speaker can also use the app in background mode. Thus he can continue to speak. A prerequisite for successful speaking, while the smartphone is in the pocket naturally, is an appropriate headset with a microphone.
Pub / Sub for Events and Messages
The communication between the participants takes place via pub/sub messaging. Each participant subscribes to the channels that are currently relevant for his role.
All messages in regards to broadcast state are sent here. For example if a host/speaker/listener joins or leaves, or if the broadcast has been closed.
Each user has a private channel, in which he listens for messages that are only meant for him personally. For example if he was promoted to speaker or degraded back to audience.
There’s a seperate channel for hosts. Sample events are the speak requests of users (since only hosts / co-hosts can accept those) and the withdrawal of them.
The channels for speakers will be used for smart speak requests (automatic promotion and degredation) that will be implemented in the MVP.
Channel for chat messages that are being exchanged during a broadcast (will be implemented in MVP).
This generic and flexible setup makes it easy to add – if needed – further channels in the future.
Another advantage is that role-based authentication and authorization can be handled based on the channels. E.g. that only hosts can subscribe to the host channel and that no one but the specific user itself can subscribe to a private channel.
- Creation of broadcasts
- A user should see all available broadcasts in his app
- If a broadcast is live, it should be highlighted accordingly
- Creators can start and stop their broadcast
- Any user can join a broadcast once it’s live
- Audience members can raise their hand to indicate a speak request
- The host can promote audience members to speakers
- The host can degrade speakers back to audience
Following is a short video demonstrating the base functionality for a live broadcast.
With the creation of this proof of concept, all key technical requirements were tested for their feasibility. Since all requirements could be successfully implemented, the feasibility of the innovation project and the further development steps has been proven in principle.
Features – MoSCoW Prioritization
Based on the survey, the analysis of existing and similar apps and the results and learnings from implementing the PoC, the following features were extracted as crucial for MVP development.
For clarity, I’ll shortly summarize the meaning of MoSCoW in context to MVP development.
M: Must Have – The MVP cannot be shipped without them.
S: Should Have – Features that are non-impactful, but if added will add significant value.
C: Could Have – Nice to have features that have little or no impact if not included.
W: Will not have: Least critical features that can be planned for the next phase (omitted in this case).
Most of the following features are very high level and include a multitude of other child features and tasks that could further be prioritised with the MoSCoW method. For simplicity and ease of reading those have been omitted though, so only the most high-level tasks remain.
|Association to Tags/Topics
|A broadcast can be associated with existing topics, skills & interests. This enables the possibility that users can be notified if a broadcast relevant to their interests has been created or is about to start.
|Within each broadcast, there’s a live chat in which users can chat. Thereby listeners can for example ask questions to the speakers/hosts without having to join the stage.
|Scheduling of Broadcasts
|Broadcasts can be started spontaneously, but also planned as an appointment in the future.
|Remind Me Logic
|If the user is interested in a broadcast, he may activate the bell to be reminded as soon as the broadcast starts. If the user connected his calendar, this functionality will result in those broadcasts being synced automatically with his calendar as well.
|Reporting of Broadcasts
|Broadcasts can be reported and flagged as inappropriate. Reasons might be hate speech, harassment, spam, offensive language or other.
Due to the nature of iOS App Store evaluation and guidelines, this is a must by default.
|Reporting of Users
|Users can be reported for the same reasons as above, additionally, a user can be reported for having an inappropriate profile image.
|Users can share broadcasts and invite their contacts within the app. This should work both for scheduled as well as live broadcasts.
|In order to share the load of responsibility co-hosts can be invited and assigned to a broadcast. They should be able to start/stop a broadcast, promote audience members and degrade speakers, mute & kick speakers (but not co-hosts or the host), block/kick audience members, and edit the general broadcast information.
|Offline/Unstable Connection Handling
|Some sort of heartbeat logic must be implemented in order to guarantee resilience of the broadcast state. If a user disconnects or has an unstable internet connection, others should be notified about that. Consider a user sending a speech request, if there’s no logic for redundancy, the initial speak request might not reach the hosts, because of connectivity issues. Or if a user joins and fails to send his initial join information, it must be ensured that all other participants will receive information about this user sooner than later. Otherwise, this user might operate in “ghost mode” (he participates, but no one else can ‘see’ him). This is not desired and needs to be addressed.
Additionally, show the user if he himself is offline or has trouble keeping his connection alive.
|Broadcasts can be created at varying privacy levels.
Public = everyone can join
Limited = open to the public, but limited to a threshold, i.e. only 25 participants
Private = only invited users can join (this needs additional logic so users can be invited as participants)
|Smart Speak Requests
|When creating a broadcast, it is possible to specify whether speak requests should be handled automatically & smartly. If so, the host & co-hosts don’t have the burden of promoting and degrading participants.
An algorithm detects if the inquirer can join the stage immediately of if someone else needs to step down first, in order to make room on the stage (since there’s a max stage limit of 15). The users on stage agree among themselves who will give up his seat.
|Offline/Unstable Connection Handling (Chat)
|If the user tries to send messages while he’s disconnected, save them locally and automatically try and resend them once he’s back online.
|If the last host/co-host left the broadcast without closing it, an ultimatum is started which terminates the broadcast automatically after 2 minutes – if no host/co-host returns until then.
This permits accidental disconnects of hosts (i.e. due to connectivity issues) and enables them to rejoin within a 2-minute window, but also ensures that users cannot freely hold the broadcast “hostage” and do whatever they please without a moderator present.
|Joining this Conversation
|For each broadcast show the users that have set a reminder for this broadcast and are interested in joining.
|Invite Links and Web Client
|For broadcasts, also provide the ability to send/share a link for people to listen in via their browser. People joining via the web client do not need an account or the app itself.
This is similar to known functionality from apps like Zoom.
|Guidelines and Tips
|Create rules, i.e. dos and don’ts, so that users get a sense of how they should behave and what is expected from each participant (and role).
Implemented Features and MVP Overview
All features listed in the table above have been implemented. Slight alternations have been made for the features “Privacy Levels” and “Joining this Conversation”. The privacy level limited is not yet implemented, as it was identified as non-crucial for the MVP. For each broadcast, the amount of users that set a reminder and intend to join is shown, but not the user info itself. This decision was made because of privacy reasons.
During development, some additional features were identified that weren’t planned initially, but still seemed practical and useful. Namely, the following have been added:
- First Time Modals
The users receives info when navigating to and using the broadcast module for the fist time.
- Countdown for Broadcasts
When a broadcast is about to start in the next 30 minutes, a countdown is shown.
Additionally a countdown is shown if a ultimatum is active and the broadcast will be closed within the next 2 minutes.
- Universal Links
When sending an invite link to someone and the other user opens this link on his smartphone, he has the option to either open the broadcast directly within the InspireNow-App (if installed) or via his web browser.
- Contact Requests
If there’s users in the broadcast with which the user is not connected yet, he has the opporunity to send a contact request which references the currently joined broadcast.
The slideshow gives an insight into the implemented features and the appearance of the MVP.
Test Runs and Identified Issues
Multiple test runs were crucial for identifying previously unsuspected issues and hardening the architecture and event logic. For example, one user operated in ghost mode during one test run. All other participants saw him join momentarily, but soon after this user “vanished” from the audience list. However the user didn’t leave the broadcast and was still there, listening. It rather was an issue with heartbeats and user events not being sent correctly, thus flagging him as disconnected.
Funnily enough, this user was still able to promote himself to Speaker and all other participants were able to hear him. Imagine being in your room and hearing a voice, but not no one is there, except you. It was just like that 👻😉.
As it was discovered later, the underlying issue was the same user momentarily disconnecting and then reconnecting with a new
UID is a unique id assigned to each user by agora itself, thereby identifying who is in the audio channel. We internally couple this
UID with the profile information about the user (from our database) and then use both in conjunction to send events like heartbeats. During those test runs, it was not expected that one user can use multiple
UIDs in one session. Thus causing weird issues and bugs when having connectivity issues or manually leaving and rejoining.
Another related bug was one user being rendered multiple times because he joined with different
UIDs. However, as soon as the root cause (varying
UIDs) was identified both bugs could be fixed.
During another test run, about 10 people had already joined, when one of the co-hosts accidentally touched the “Terminate Broadcast” button within the app. Since – at that point – there was no additional confirmation logic, the broadcast was closed immediately and all participants were kicked. In the test environment not a big deal, but imagine hosting a panel discussion with 100+ people. One accidental touch that results in kicking everyone and closing the broadcast is a no-go. Therefore an additional confirmation dialog was added afterward.
One of the more mind-boggling and tougher issues to debug was an apparent increase in battery consumption – at least on iPhones. Users reported their battery being drained by about 40% and more after an hour-long broadcast. Since users should be able to join a broadcast and consume it passively, i.e. while going on a walk, and not worry about their battery being fully loaded, this needs to be addressed. Debugging this issue proved really tough and was like looking for a needle in a haystack.
One promising discovery was that events in React hooks seemed to bubble up their hierarchy. Rerenders in a child hook caused the parent component/hook to rerender as well. I discovered that one of the core hooks was rerendering/recalculating about every 50-100ms. This accumulates to about 18.000 rerenders in a timeframe of 30 min. For a simple hook, this might not be an issue, however, this hook had a lot of logic inside of it. Thus it was only logical that this unnecessary and constant recalculation can cause increased CPU usage and therefore battery consumption. Splitting up this hook into multiple different and more specific ones and using them in a more isolated way (so they can’t bubble up and cause parents to rerender), proved worthwhile – resulting in a stark decrease of rerenders and CPU usage. Further performance optimizations, like using memoization and fixing issues within redux and reselect, decreased recalculation and rerendering additionally. Android Studio Profiler and XCode Debug Navigator were used to inspect CPU, memory and network usage.
The implemented measures and improvements seem very promising and resulted in decreased battery usage in the following test runs. However, this issue has to be monitored and evaluated over a larger timeframe, different devices and a multitude of broadcasts. So no final conclusion can be made yet.
Conclusion and Challenges
Revisiting the Research Questions
Is audio at all interesting for knowledge management and communication in organizations?
Yes, the results of the survey indicate a general interest and curiosity. Further discussions and feedback interviews confirmed this base interest additionally. For example, during a recent acquisition meeting, it became apparent that the Broadcasts module (=Enterprise Social Audio) would add the most value for this client.
Can audio be used for further education, innovation and also everyday discussions?
So far, no meaningful results could be collected from the field. Therefore this needs to be further evaluated and tested. However, the above-mentioned client intends to use audio for mostly educational and training purposes.
Are there certain concepts or use cases that seem particularly promising?
Yes – panel discussions, knowledge transfer, and casual feedback talks stand out as the most promising so far.
What should audio functionality look like in the app?
Mostly answered and depicted in PoC and MVP development. Overall it needs to be simple, intuitive and clean.
What do customers want?
Well, this question basically equals the search for the holy grail. Every customer has his own specific context, problems and goals. One universal need seems to be the ability to easily reach and communicate with their specific audience. The audience, in this case, can either be the entire workforce (in a company) or members/residents (in a federation/institution). Audio serves as a modern and efficient approach for tackling this need.
Can audio lead to greater inclusion?
There is no sound evidence yet. In theory, it should be possible though. Provided additional features like live captions (for hearing impaired people) and live translations – both speech-to-speech and speech-to-text – are added.
Which features would offer the greatest added value?
Partly answered and shown in PoC and MVP development. In addition to the core functionality (stage and audience, speech requests, promotions, degradation, et cetera) the following features seem to be most essential: chat, invitations (of contacts and external people) and smart handling of speech requests.
Further features like record and publish, live captions and transcriptions were confirmed as valuable as well. They will most likely be taken up and implemented in the course of further development.
Are there certain features or rules that need to be implemented and provided to enable better conversational discipline?
The implemented “Smart Speak Requests” feature seems promising in order to lift the management burden of the hosts & co-hosts. However, further evaluation is needed if this feature is sufficient and mature enough. For example, it is not yet evaluated how well this feature works with participant numbers of 100+. If there are a lot of audience members wanting to join the stage simultaneously (15+), that might lead to issues not yet identified.
Which concepts have to be considered with regard to moderation, security and privacy?
Being able to report broadcasts and kick, block and report users is crucial for moderation. For security and privacy reasons sophisticated authentication and authorization are needed and were therefore implemented. Additional security and GDPR concerns were discussed in Research – Security, GDPR and Compliance and PoC implementation.
Following are the most notable frameworks, libraries and services that were used within this innovation project. This is not a complete list, if you’re interested in further specifics please feel free to contact me.
- React Native
- Agora React Native SDK
- SocketCluster Client – for pub/sub
- Miscellaneous – redux, reselect, immer, axios etc.
- SocketCluster (Server, Broker and State) – for pub/sub; scaled horizontally and included in Kubernetes
- Agora Access Token – server-side authentication logic and generation of AccessTokens
- Agenda – for job scheduling, i.e. handling of broadcast ultimatum
- Brcypt – for encryption
Infrastructure & CI/CD
TypeScript was used for all projects.
One of the most challenging parts – for me personally – was efficient testing of audio functionality. My approach for “real” testing was to open up 3-4 simulators, attach 2-3 physical devices and possibly add another emulator to the mix as well. This resulted in a maximum of 8 devices, or rather users, that I could simulate a real broadcasts scenario with. This may be sufficient for evaluating and debugging basic functionality, but it’s nowhere near the desired and planned capacity of broadcasts. Broadcasts should be easily possible with up to 15 users on stage and 1000+ users in the audience. So far, I’ve honestly no idea how to simulate that appropriately. One approach that might be worth looking into is End-to-End Testing.
This challenge has been particularly noticeable with more obscure and generic bugs, i.e. the high battery consumption issue or the bug where some users operated in ghost mode (while for other users it worked totally fine).
As already mentioned above, at this point in time, I cannot say with full confidence that the implemented logic is stable and resilient with increasing broadcast size. It should work with the desired capacity (up to 15 stage users and 1000+ in audience), but I have no confirming evidence yet.
In a previous project, we migrated our infrastructure to Docker, Kubernetes and Rancher and used load testing to verify the capacity and increased performance. Similarly, the next step for this project would be to find a way to confidently test the performance and resilience of this newly added broadcast module.
Prospects and Future Development
Depending on the result of the upcoming meetings with the above-mentioned client, the following features are scheduled to be included next (descriptions can be found in the table of the survey results):
- Record and Publish
- Live Captions
- Searchable Audio Transcriptions
Those features were additionally discussed with the existing client who used the beta version (MVP) in his production environment. He too confirms that those features seem quite interesting and valuable.
Additional features that have potential and are tied to audio are:
Impressions, total reach, click/listen-through rate, et cetera.
- Audio Memo’s
Voice recordings can be attached to users profiles or sent via chat.
- Question Voting
In the broadcast chat users can create questions and other users can vote for them. Thereby the most favored questions rise to the top and are easily visible for moderators. Stage users can then go ahead and answer those questions, basically ticking of their agenda, and make room for new ones.
- Broadcast / Podcast Editor
If Record and Publish is implemented, it might make sense to provide an additional editor in which the resulting audio files can be edited and optimized.
- Manually Setting Audio Quality
Give users the possibility to adjust their audio quality. Useful for users with poor internet connection or connectivity issues in general. Or even automatically detecting connectivity issues and adjusting the quality based upon that (i.e. like the option in streaming services to set the video quality to automatic).
- Further Development of the Web Client
Add more features, improve usability, basically making it possible to use either the app or the web client with the same user experience.
Overall this innovation project was very challenging due to the inherent nature of real-time communication and the respectively increased difficulty in debugging. Nevertheless, it was also very rewarding, interesting and educational. And most importantly – to pick up the overarching research question from the start – it can be provisionally stated:
Enterprise Social Audio works.
Even though the hype around Clubhouse only lasted for a couple of months, the topic of social audio still seems to be very hot. In the duration of this project, multiple companies have announced their own version of social audio, several new apps and platforms have spawned and at least one company made the pivot to a completely new product offering – Social Audio as a Service.
Thus I’m very excited to observe the changes and further development in this field – and maybe I could ignite a spark of interest in you as well 😉✨