Project management and technical expertise form a perfect marriage in this tale of implementing a large-scale user verification system.
In September 2023 our tech team decided to implement a user identity verification solution – checking you are who you say you are – for the company’s most important product connected to the reality TV show, Big Brother Brasil. To achieve this goal, we leveraged good project practices in tandem with our focus on quality technology.
The Big Brother brief
Big Brother Brasil has been a long-standing, popular show in my country. Part of the TV program includes a voting segment that allows the audience to eliminate participants during the show. Every user must have their identity verified before voting, which causes a great throughput peak on the verification API.
To lower this strain, we looked to update our architecture to handle a new identity verification system that functions under the weight of one of the largest audiences in the country. Since the show was airing in January 2024, we had roughly three months to complete a major refactor of the API.
The challenges faced and how we aimed to tackle them
To effectively divide and prioritize the project, we needed a deep understanding of the technical requirements and team dynamics: the number of available developers, scheduled vacations, and team seniority. Once we calculated these elements, we started to identify which of our hurdles to tackle.
Verifying user identity
A large hurdle we faced was the fact that only a few companies in Brazil hold the capability to verify user identity. Every citizen has a Cadastro de Pessoa Física (CPF), an identification number held by the government that functions much like a Social Security Number (SSN) in America. Some companies are certified by the government to hold those CPF databases and provide APIs to give access to that information. Therefore, we needed to use some of those companies as external APIs for this verification, but even then, those APIs wouldn’t be able to handle the high traffic from the Big Brother audience. As a result, an additional architecture solution was required to handle this issue: a waiting room.
Technical resilience
Resilience was essential in the core verification API. We already knew we needed to add a waiting room architecture to improve capacity and the user experience, but our last requirement was to enhance user security by adding phone number verification. We, therefore, split the project into three correlating chapters:
- The user identity verification core
- The virtual waiting room solution
- Phone verification
Each chapter would run independently and theoretically be ready for production in January.
More like this
Chapter 1: The user identity verification core
The first chapter’s objective was to deliver a core logic API that superseded the existing one’s limited capabilities. We managed to refactor the API with the following enhancements.
Understanding existing risks in the system
The core API was already integrated with one external partner to deal with the CPF number verification, but we needed to be sure that we would not suffer from any external server downtime. To certify this wouldn’t happen, we integrated the core API with a second partner for CPF number verification. This setup provided a fallback option and helped load balance requests between the two external APIs.
Load balance
During our initial integration tests with external APIs, we realized that one external API was handling far more requests than the other. Without correction, this could overload the system, resulting in a temporary outage of the verification core. Our next step was to load balance according to each API’s capacity.
Backup strategy
To introduce an additional layer of risk management, we made sure that our verification core could handle errors from external API partners. If one API failed, the system would retry verification with another API. By verifying user data across both APIs before showing an error message, this strategy significantly improved the success rate of user identity verification.
Circuit breaker
Circuit breakers reduce traffic whenever there is a sign of system degradation. We implemented them for those cases where our verification core detected an unresponsive external API. At that point, the system would temporarily deactivate that partner and redirect traffic to another. We made sure to add a separate circuit breaker for each partner, helping mitigate downtime in user verification.
Caching
At some point, some users would try and validate their CPF with more than one account. Although the unique voting logic would block such actions, it would still unnecessarily amplify requests to the verification core. To handle the anticipated high audience traffic, we optimized CPF verification by storing each successful and failed verification attempt. This increased the core API efficiency and reduced costs.
Chapter 2: The virtual waiting room solution
Virtual waiting rooms are solutions that enhance the user experience whenever great audience peaks are trying to access a limited resource – in this case, the user identity verification core. If at any time there were more users than our systems were capable of dealing with, the virtual waiting room would line up the users to be processed by the verification core.
Choosing the right devices to help implement this was, therefore, important. We prioritized:
- Scalability: we chose a solution that could handle one of the largest audiences in the company, having the capacity to scale to manage millions of simultaneous users trying to vote on Big Brother Brasil.
- Latency: the entire ecosystem should be able to react to our users as fast as possible, improving the user experience and diminishing the waiting time.
- Metrics: we collected data on the user experience in the waiting room to better understand the ecosystem. This data guided us in making the necessary improvements.
Implementing a waiting room solution can be hard as there are multiple factors to consider, including user experience throughout the entire flow and guaranteeing that the API is correctly tuned to deliver the fastest output for the user (reducing the waiting time).
To find this sweet spot, we needed to balance the API load and user wait time. This is tricky; if you reduce waiting time too much, you risk overloading the API. If you’re too cautious with request flow, users end up waiting longer.
Using the right metrics, such as the number of users queued per minute, the number of users verified per minute, and the average user waiting time, can inform improvements to the user experience. The metrics we derived from the waiting room gave us insight into our users’ average feelings. Ensuring a frictionless voting mechanism was essential for a good product experience. Metrics made it possible to see the user’s perspective, leading to the delivery of not only features but a well-designed product.
The waiting room solution is presented to the user on a flow as follows:
Fig.1. First version of the user identity verification mechanism flow
Chapter 3: Phone validation
The last chapter of our journey addressed potential fraud concerns in the case of leaked CPF databases. The worry was that bad actors would take advantage of leaked personal information to fraud a vote.
As part of this, we implemented a step in the voting process whereby users had to insert a valid cell phone number that we then validated with an SMS code. With that, our final product’s flow looked like the below:
Fig.2. Entire flow of the user identity verification mechanism
The last mile
We completed all three chapters within two months. The entire endeavor involved almost 30 people, including developers, DevOps, and UX designers. But it was the last mile that was the hardest part.
We used the last month to review the completed work, stress testing scenarios, implementing extensive monitoring, and running load tests (large amounts of requests in a controlled environment) on the main APIs. The main challenges were coordinating all the work being done by the team, knowing how to choose our priorities, and how to efficiently delegate the tasks, all while ensuring a quality result in the final product.
That final month was as crucial as the first two. It reinforced the importance of saving extra time ahead of every deadline for a final review.
Final thoughts
How you manage challenging projects like this can be as important as the technology you choose to deliver. It can contribute a lot to the result, especially when there is a short deadline and a vast scope to satisfy.
We managed to split the project into three parts, but identifying the top priority – the verification core API – was essential to guarantee good results and reduce risks. This meant that even in a hypothetical situation where we only had enough time to complete one aspect of the project, we could still deliver commendable results. After completing the highest priority, we worked backward from there, always focusing on the most challenging task at the moment. Each delivery from each chapter would build on features from the previous version.
Even in the worst time crunch, metrics can be a saving grace. In our situation, they were immeasurably helpful when we were optimizing the voting waiting room, but in every project, there will be valuable data to monitor and use as a guide for improvement. Invest some time in finding the right metrics for your project, and the reward will be tenfold.
Do not forget about the last mile, as you will need it to finetune your project.