Kubernetes on Azure: Lessons learned from the field

We’ve started few months ago a big project to transform a monolith application running on IIS/SQL Server into a micro services one hosted on a Kubernetes Cluster on Azure. It was an interesting journey where everything was new for developers, ops and support.

This project will be live on production in few weeks now and everything on the infrastructure is running smooth now, but we faced many issues and found a solution for each one. This blog post will relate all of them with more or less details to keep my NDA agreement with my employer.

 

Lesson learned #1 – Never use the portal to deploy a kubernetes cluster !

I love automation, scripting, configuration management so it’s obviously coming from there but the first thing you want to try to test a Kubernetes Cluster on Azure is to use this “wonderful” UI. Gig mistake! Everything is outdated!! And the most of all, you can’t customize your deployment which is really a failure, because you NEED to customize it in order to match your company needs.

A little research on Google will get you many results about how you can really use the power out of Azure Container Services. You got the choice:

  • ACS Engine which is the key behind the scene for ACS/AKS offers on Azure. A powerful tool that’ll build an ARM template with everything you need in it to deploy your cluster as you need it
  • Azure cli the multi-os command line that is capable to deploy ACS cluster or AKS. That is actually a shame that theses commands aren’t available in PowerShell modules.
  • Terraform the hyped tool from Hashicorp has everything you need to deploy a full ACS powered cluster.

 

Lesson learned #2 – Do not use default options when deploying your ACS K8S Cluster !

As we want to deploy our cluster in francecentral and AKS isn’t available right now, we decided to use ACS Engine to get the job done. I already spoke about ACS Engine in this blog. This tool has evolved the past two months like hell. Every new build brings excellent new possibilities. Those possibilities are most of the time asked by user filing bugs or only improving the product to synchronize with Kubernetes itself that is moving further very quickly.

For example in ACS Engine, you can add many plugins to your cluster by design :

  • Networking : Calico, CNI, Kubenet
  • Tooling: Tiller, Horizontal Autoscale
  • Specific Kubernetes version
  • Azure Container Instance plugin
  • Kubernetes dashboard
  • Etc…

But this is not all, you can as well customize many other options like:

  • Use previously created Virtual Network
  • Change the local subneting
  • Use Virtual Machine Scaleset instead of Availability Sets
  • Change the load balancers sku from default to standard
  • Create a full private cluster without any public access
  • Enable RBAC with AAD authorization
  • Mixing Windows and Linux nodes

With all these possibilities you can create your unique cluster matching your security and gouvernance needs.

 

Lesson learned #3 – Do not use the ACS Engine upgrade command !

ACS update is like a Russian roulette, you never know how it will end until you test it. Each time we have tested it, we had a different significative error. Definitely not production ready… In order to solve this problem, we have tested to options:

Build parallels clusters with new kubernetes versions and/or new ACS Engine options we need in the cluster.

This is at the first look the easiest way of doing it. But, when you have customers running their applications on it, it’s truly an exploit to migrate them without downtime even a little one. And the operation can be quite complex for it ops guys specially if no pipeline is deployed. Indeed, you need to change DNS records to be replicated worldwide in order to get your ingress controllers to work has expected. I’m not even talking about NTP problems that can exists.

Deploy our cluster behind Azure traffic managers

With Azure Traffic Managers deployed endpoint in front of your Ingress Controllers public IP and your microservices using the CNAME you created for TM URLs now you can add as many endpoints for each published service deployed in as many clusters as you want. Thanks to that you can easily migrate services between clusters.

This solution offers other intresting possibilities:

  • Geo clustering : if you want services hosted on specific Azure regions, you can build a cluster located where you need.
  • Multi versionning: You can have multiple versions deployed in production and track bugs with monitoring.

 

Lesson learned #4 – Never use Azure Accelerated Network in your k8s cluster !! I really mean NEVER !!

Azure Accelerated Network is a solution that bypass Azure routing to provide direct access to physical switches for your VMs. This is quite powerful for many scenarios, but with Kubernetes cluster, it’s painful because of the DNS service hosted in Kubernetes that is not reached by pods and you’ll face some 5s timeout for your http requests.

Another annoying thing that is happening when if you enabled Accelerated Networks on VM, is that you’ll face NTP issues such as desynchronizing your VM (and pods) by a few seconds and everything that needs exact time synchro, authentication mainly, will be broken.

 

Hope this few lines can help you during your Kubernetes journey in Azure !