Our K8s Cluster
We use Kubernetes for nearly all of our application deployment needs. Previously, DevOps was conducted by manually by sshing into a VM and git cloning code
and environment variables and secrets, followed by restarting any running services. About six months into to the beginning of the Mobility Scooter Web App
project, we realized that we would need stronger security and reliability guarantees due to the sensitive nature of patient data. We experimented with Coolify
and scripted setup of k3s , but found both to be lacking the appropriate reliability guarantees, and ultimately settled on an Ubuntu-based k8s cluster.
What follows is a description of all the features integrated into our cluster. Note that some of these may be broken or incomplete due to a past incident.
Core Features
ArgoCD
Argo is the bread and butter of our cluster. While k8s maintains the desired number of replicas for a given deployment, it seldom provides this feature for other resources. Not only does Argo enable us reduce drift, it also allows us to import helm charts and other remote manifests, which is what powers the GitOps push-to-deploy feature.
It is not without it flaws however. Being git-based can have its drawbacks, including requiring a commit to update image versions when a new release is created. Tools such as Kargo attempt to fill the gap, but it’s important to recognize that no system is perfect.
Argo is pull based, meaning it is constantly scanning for updates within target repos. To reduce the likelihood of getting rate limited, we supply Argo with a scoped Github access token with read-only permissions.
cert-manager
cert-manager is how we provision TLS certificates for public-facing ingresses. In layman’s terms, this is how we get browsers to trust https, as an invalid TLS
cert will actually cause most browsers to prevent you from visiting it without consent. Outside of UX, it’s also critical for maintaining security by encrypting
data in transit between the client (your browser) and the server. Internal apps use something similar known as mTLS, but one of the nice features of using a
backend-for-frontend (BFF) architecture is that the BFF can directly communicate with the API rather than having to deal with another network hop over the public
internet.
CloudNative-pg (cnpg)
CNPG is an operator for creating and managing Postgres clusters on k8s. One feature that stands out is the ability to create databases
(e.g. CREATE DATABASE, not another instance of the db) using yaml manifests - this enables further customization via helm charts.
The primary instance stores data and write-ahead-log (WAL) in two separate volumes on JS2, meaning if the cluster were to go down data is still persisted.
CNPG database instances are less mutable than regular k8s resources. Some properties, such as the size of the storage capacity, are immutable and may require you to resize the volume manually on Horizon.
DevPod
Devpod was implemented out of the need for developers with less RAM to run a full instance of the MSWA locally during development. DevPod is entirely client side; the only cluster side configuration is to limit access of people using the DevPod kubeconfig. Scripts can be found in the infra repo for generating said configs to share with new members during onboarding.
external-dns
external-dns is a tool for automatically provisioning DNS records for Ingresses. Below
is an example of how to use external-dns, cert-manager, and traefik to expose a Service via https to the public internet.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: argocd-ingress
namespace: argocd
annotations:
# *.cis240470.projects.jetstream-cloud.org is our project's domain
external-dns.alpha.kubernetes.io/hostname: argocd.cis240470.projects.jetstream-cloud.org
traefik.ingress.kubernetes.io/router.entrypoints: websecure
cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
ingressClassName: traefik
tls:
- hosts:
- argocd.cis240470.projects.jetstream-cloud.org
secretName: argocd-tls
rules:
- host: argocd.cis240470.projects.jetstream-cloud.org
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: argocd-server
port:
number: 80external-secrets
The External Secrets Operator (ESO) is an operator for syncing secrets imported from external data sources
to our cluster as k8s Secrets. Secret management is perhaps the most difficult part of running the cluster completely only source, with no external
secret provider. Most secrets are imported during cluster bootstrap from local .env files, something we frown upon, but it proved too expensive to
run a separate cluster with Infisical.
Grafana
Grafana is an open source monitoring platform that ships by default in clusters created by OpenStack Magnum. Primarily used for monitoring CPU, RAM and storage usage on cluster nodes, it can also be configured to push alarm messages to Discord or other platforms when certain conditions are met.
Kargo
Kargo is the missing piece from Argo; it allows us to monitor container registries and update Argo Applications to reflect certain
image constraints, such as always using the most recent version of a service. Kargo treats build artifacts as their own resources in k8s, and allows you
to create pipelines called Stages that in turn modify Argo apps, among other things.
A Kargo promotion pipeline ships with every one of our custom helm charts, enabling that sweet deploy-on-push functionality.
Suplemental Features
The following are either work in progress features, or are upstream dependencies of some of the above features.
clustersecret
ClusterSecret is a simple helm chart for replicating Secrets across multiple Namespaces. This is notably used to propagate the randomly
generated password to a connection string for CNPG so it can be accessed by MSWA.
designate-cert-manager-webhook
This helm chart provides a single pod that enables DNS01 for cert-manager via OpenStack Designate. This is the most reliable way to integrate with Let’s Encrypt.
Eraser
Eraser is an operator for managing the storage space on VMs within a cluster. It was integrated after an ICM occurred where two 32GB docker images starved the root volumes of all of our nodes, leading to an entire re-roll. It automatically cleans up unused images and temporary files periodically.
Istio
Istio was created by Google, and serves as a service mesh, or an interface for monitoring and coordinating networking within a cluster. While not currently in use, this would be a great candidate to be implemented alongside the model-service.
KEDA
KEDA is an operator for auto-scaling ReplicaSets based on some criteria. While not currently in use, it will become useful during beta tests of MSWA
when we may wish to increase the number of video workers based on the volume of videos in the queue.
KEDA does not support auto scaling based on HTTP traffic, interestingly. If you are interested in that or other serverless tech for k8s, checkout Knative , but be warned that it requires at least 60GB of RAM base.
Redpanda
The Redpanda Operator is a modern replacement for Kafka, a queue service. Its protocol is 100% compatible with Kakfa, which is why MSWA uses kakfa.js to interact with the queue.
SecretGenerator
kubernetes-secret-generator is a helm chart for randomly generating secrets, which is useful for automating parts of cluster bootstrapping, such as generating a password for our CNPG cluster.
Traefik
Traefik is an incredibly popular proxy for both k8s and standalone applications. It allows routing rules to be defined via annotations on
Ingresses, and provides TLS termination (https -> http).
Valkey
Valkey was created during the Redis closed-source debacle, and is similar to Redpanda in the sense that it is a drop in replacement for Redis that supports clustering and failover.