Where we come from
Like many startups, Applivery was born as a side project. It was a simple but ambitious service focus on the distribution of mobile applications in an efficient and simplified way.
Given its humble origins, the costs of the service had to be tightly controlled, so we decided to start by deploying the first versions on the most affordable and simple Cloud provider of the moment, DigitalOcean, and use also a simple storage system based on AWS S3.
The service was originally conceived as a single monolith whose processes, databases and services ran on a single droplet.
However, that simple but ambitious project gradually became a platform that had to support dozens of customers and thousands of daily active users. In addition, the ambition of the project began to incorporate numerous development ideas, resulting in an extensive roadmap that would extend over the next 2 years. Developing a product of that dimension on a single droplet was starting to not seem like such a good idea anymore. This is why we decided to grow the team and rebuild the project from scratch.
At that point, we decided to move into a microservices-oriented architecture, separating the frontend from the APIs. All services would be deployed in Docker containers through a sophisticated continuous integration pipeline supported on CircleCI. In an exercise to simplify the transition process between the old and the new version, we decided to keep DigitalOcean as the compute and DNS provider, but we took the opportunity to use external services for the database (MongoDB Atlas), the queuing system (CloudAMQP), and the Docker image registry (AWS ECR).
This new microservices system went live at the end of 2019 and made use of Dokku for container orchestration.
However, although we had made a great effort to rebuild and re-architect the entire platform, we were still serving all the traffic from a single virtual machine that acted as a bottleneck. That’s why, shortly after, we decided to make the leap to Kubernetes.
DigitalOcean was still the most affordable Cloud provider for a managed Kubernetes service, so we didn’t hesitate and migrated the container orchestrator to Kubernetes, adding numerous technical, architectural and security improvements along the way, including:
Likewise, to provide the custom domain service in the App Stores we used another virtual machine with an elastic IP, because DigitalOcean did not provide a NAT gateway to control the outgoing IP of the workers.
At that point, the architecture looked like the following:
Growing faster than expected
This whole system built mainly on DigitalOcean and supported by AWS worked decently and with very controlled costs for many months. However, as the volume of customers grew and consequently, the traffic, we started to suffer from occasional problems that we could not control or foresee:
However, despite all efforts, the infrastructure had proven to be complex and sometimes, unstable. Our ambitious expansion plans and product evolution were severely hampered by the time investment in troubleshooting and incident resolution.
We needed to have greater control over the Kubernetes managed service and improved network stability was a must. In addition, given the expansion of the service and product roadmap, we wanted to take advantage of this opportunity to continue to improve our platform in the following business-critical areas:
The jump to Google Cloud Platform
At this point, we had decided that we wanted to leave DigitalOcean and move to a hyperscaler more suited to our reality.
After some weeks investing time in proof of concepts on Amazon Web Services (AWS), Google Cloud Platform (GCP) and Azure, and despite the fact that the technical team’s experience in Cloud environments was mainly focused on AWS, we ended up opting for GCP. The reasons were several:
Greater control over the Kubernetes managed service.
The quality and ease of use of Google Kubernetes Engine (GKE) surprised us from the first moment we tried it. The amount of options and observability it offers by default over the cluster is the best we had tested so far.
Improve network stability
Google gives us the option to use its own network with global connectivity, which provides much better performance and latency than the standard network, resulting in a better experience for all users anywhere in the world.
Reduce the number of external services
Both AWS and GCP offered us a lot of options to internalize the external services we were using but due to our strong integration with the Android ecosystem, we were already strongly integrated with some GCP services (such as Pub/Sub or the Android Management API). On top of all this GCP offered other unique services like BigQuery, Analytics, Workspace or Maps that we planned to use in the near future.
Securing all connections
Due to the limitations we were experiencing at DigitalOcean, we had ample room for improvement in the security aspects. We identified that GCP would help us in two main aspects: first, the VPC peering functionality with automatic import and export of routes would greatly simplify our integration with MongoDB Atlas. Secondly, it offered us the ability to authenticate without key exchange from GitHub Actions via Workload Identities, avoiding the exposure of any kind of access keys.
Previously we had integrated Grafana and Prometheus to manage monitoring and observability, services that ran within the cluster and that we had to maintain. With GKE this was no longer necessary, because the console itself already offered everything we needed, such as access to logs, node and pod visibility, graph dashboards or alerts.
Switching to a more suitable region for our service
Finally, GCP was going to launch its new region here in Madrid, Spain (europe-southwest1), so following our hallmark and after several tests, we went for it. After testing we concluded that we substantially improved latency throughout the Americas without affecting Central Europe or the rest of the world compared to our previous region located in Frankfurt.
How did it go?
We started by rebuilding our Pulumi recipes to build a complete new pre-production environment, making migrations to multiple GCP services, among them:
Also, thanks to the options provided by GCP, we also made changes to our complementary services:
Once the whole stack was tuned and verified, we proceeded to set up the productive environment in a matter of minutes thanks to Pulumi’s recipe. We also started a live migration process of MongoDB Atlas to the new region.
When everything was ready, we stopped the deployments of the old cluster, finished the live migration in MongoDB Atlas and changed the DNS in Cloudflare. All the traffic was already arriving to the new environment. In total we achieved less than one minute of downtime.
What benefits this new architecture has brought us
After all the migration we have managed to improve in all measurable aspects, such as response times, transfer speed or processing capacity.
Applivery has experienced a constant evolution since its birth, new needs and functionalities are continuously being developed. We have always maintained a commitment to innovation, adapting to business requirements to continue to stay at the forefront of technology and offer a reliable and scalable service to all our customers.
This latest iteration with GCP is the latest example of this, making a great effort at all levels to consolidate the foundations of our platform, ensuring its long-term robustness and allowing us to continue to innovate in an agile and sustainable way over time.
What’s next? One shot to the one that says generative AI 😀