Go to bed with Hetzner

ServerBee Blog
5 min readMar 19, 2024

--

https://status.hetzner.com/incident/1db7f5bb-0436-4362-b56a-53539aa9a622

Well, I can say that this Hetzner network failure has become the most serious issue we have faced during ServerBee’s history.

This article is a kind of notes “post-mortem” to analyze the situation and make conclusions that can help to react as soon as possible, or even better, avoid such failures.

Background: One of our clients takes for rent high-capacity dedicated enterprise servers from Hetzner. Our team set up the infrastructure for the project. We installed Kubernetes on-premise using the virtualization and HA best practices that we gathered over the years. The existing k8s clusters worked for years also without some big problems.

The story: Starting from 1 p.m. on March 12 we received alerts about flapping of Failover IP on the external LB servers but we solved it by switching to manual mode. We checked the status panel of Hetzner and there were not any open issues. Anyway, after about 4 p.m. on the same date, we began receiving alerts about network interruptions. Within an hour, there took place a failure in the internal network and we lost the connectivity between some servers for all environments. We immediately submitted a support request to Hetzner, but as it was a real global issue for this hosting provider, the wait time could be significant. So we immediately started fixing the situation on our own.

Thanks to the fact that it was Kubernetes with 5 master nodes running in different zones of the same region (but not in all 5 different zones, thank God), we were able to bring production online rather quickly. With the help of k8s event messages, we identified the working nodes, stopped unnecessary parallel processes, and optimized cluster resource utilization. Within 3–4 hours our system worked fully for clients. The speed was a bit lower than usual but it was working for about 95% of all microservice groups. Also, it was a very good example of how good DevOps methodology is working in this situation. We were working together with the backend team and there were more than 300+ groups of microservices.

Then we continued with following the recommendations for possible actions from the Hetzner support team, such as removing and re-adding servers to vSwitch VLANs and reconnecting, we tried different scenarios but it didn’t give the result. We were working exclusively in the test environment to avoid failure of the production environment and data loss or other damage. I might joke that we literally went to bed with Hetzner, as the work continued until late at night.

After 20 hours, we finally received hands-on support from a Hetzner dedicated network engineer, and the problem was completely solved for all environments.

What was the problem: In the process of struggling with the consequences, I already had a hypothesis about the cause of such a failure. And in the end, it was confirmed. Hetzner uses the virtual router vSwitch feature to manage communication inside all their VLANs. It was precisely an error in the operation of this software in the context of their routers that caused desynchronization (failure) in the communication of servers from different zones of one region, thus blocking the operation of some parts of the k8s nodes.

It is important to know that a feature similar to Hetzner’s vSwitch is also used for routing by other hosting providers. Hence, the likelihood of such a failure always exists, even if it has never happened before.

Conclusions: Kubernetes is a must. If any network error occurs, the primary task is to keep the production environment working. Using Kubernetes significantly facilitates this task and minimizes the consequences of any failure (of course, with the correct architecture chosen). It greatly helped us that we did not use all 5 zones of one region to place 5 k8s masters in different zones. Instead, the masters were only in 3 zones, and we did not lose control over cluster management.

I should say that if we didn’t use k8s on this project and worked, for example, with regular VMs + Packer, our actions would have been very limited (Think about not using zones on the VM level). The operation of all client services would simply have stopped until receiving engineering support from the provider, which could cause data loss, equipment failures, and many worse consequences.

Other infrastructure components also matter, such as monitoring — you can receive error notifications promptly and start working on resolving the issue. Additionally, we invested really a lot of time in the past for HA of all used databases and it helped us as well.

How to prevent or protect your infrastructure from similar failures: So far, I see 4 possible actions for the future:

  1. Ensure dedicated, “priority” support. This could be an additional option to your contract that obliges the hosting provider to prioritize resolving your issue, within certain strictly defined terms not critical for your project. Not 20 hours :) come on!
  2. Avoid the vSwitch function and instead insist on a physical switch for your servers. This may also require additional payment but will help resolve the problem much faster.
  3. Develop and test your own plan “B” solution, using, for example, OpenvSwitch or Linux ip route as native features (GRE over IPsec, etc).
  4. Start using zones even more providently taking into account most of the possible failures. All the time we have to make improvements according to the new experience.

Additional: The DevOps partner often plays a vital role in your business. It’s good if you have a team that takes on all the pain of such situations. And it’s better when this DevOps team can scale up quickly :) For example, under normal conditions, on this project, our 3 DevOps engineers work. For the troubleshooting in the described situation, we could use the knowledge, expertise, and consultations from the most experienced our team members, which allowed us to resolve the problem dramatically faster. And it was all because we had this professionally prepared team already!

Of course, this Hetzner network incident made me nervous, but on the other hand, it was a good experience from both a team management and a technical point of view.

I hope this information was useful. I am open to your feedback and comments, as well as any possible questions.

--

--

ServerBee Blog

We specialize in scalable DevOps solutions. We help companies in supporting critical software applications and infrastructure on AWS, GCP, Azure even BareMetal.