I Promise Not to Break Production

Promoting Promises through Environments with Kratix

The heart of Kratix, the framework for building platforms, are the Promises. Via Kratix Promises, Platform engineers can codify their organisation's policies and opinions (security, compliance, etc) into their as-a-Service software supply chain, making it easy and straightforward for Application teams to consume the services via the Platform.

As our customers start developing their own Promises, a question we are often asked is:

How should I move my Promise from

Development into Production?

As more Promises are developed, and more teams are relying on those Promises to develop and ship their applications, getting the path to production wrong can be quite impactful:

Teams may be surprised with new, unexpected Promise features impacting their workflows
Teams may be blocked due to unforeseen downtime of the Platform
Ultimately, applications running in production may be impacted

To effectively answer the question, we will need to cover a few points:

How to develop your Promise to support promotion through environments
What “Production” means for the Platform Engineering team
What “Production” means for the users of your platform

In this blog post, we will go through each of these points and highlight some patterns you can follow through the development lifecycle of your Promises, in such a way that the transition from dev to prod is straightforward. You should take this post as a starting point, but do adapt it to the realities of your organisation!

As we go through the sections, we will build an example architecture of how you could build and promote Promises within your organisation.

How to develop your Promises

To answer the dev-to-prod journey question, we need to cover what the development lifecycle of a Promise could be to start with

A Promise is, in a nutshell, a piece of software. As such, it should be developed following the same patterns and processes you would use when building any other piece of software within your organisation. All the standard good practices apply: use version control, write tests at different levels, implement CI/CD for your changes, etc. These are well-understood practices that you are probably (hopefully!) already applying to other software you build.

But what’s a good testing strategy for Promises?

Promise writers usually have the following concerns in their minds when changing their Promises:

Is my Promise API correct? Are my resource requests still compliant/accepted by the API?
Do my Pipeline containers work?
Does my promise installation work, and can a user request a new resource successfully?

Let’s jump into those into more details.

Testing the Promise API

The Promise API is a Kubernetes Custom Resource Definition. You can use the same strategies you use to test CRDs to validate the Promise API.

For example, you can extract the CRD and verify that you can dry-run apply it:

cat promise.yaml | yq '.spec.api' > crd.yaml

kubectl apply --dry-run=server --filename crd.yaml

You may also want to validate that a set of resource requests are still compliant with the new API definition. To do that, you can use tools like kubeconform:

kubeconform \

-schema-location default \

-schema-location promise-schema.json \

/path/to/resource-request.yaml

To generate a JSON schema from a Kubernetes CRD, please refer to the kubeconform documentation.

Once you are satisfied with your Promise API, you can move on to testing your Workflow containers.

Testing the Pipeline containers

A Pipeline container works by accepting a series of inputs, running a series of commands, and generating a series of outputs. These containers are then chained one after the other in a Kratix Pipeline, which is then added to the Promise under the appropriate Workflow.

At this point, just by reading the description, you can probably already come up with a testing strategy for your containers (spoiler: it’s not that different than how you would test a function!). You should:

Define a set of inputs that provide good coverage of your container
For each input, define the expected output
Execute the container
Validate that the generated output is the expected one

Note that you don’t really need a Kubernetes cluster running Kratix to test your containers. In the Improving the Workflows section of the Kratix Workshop, we demonstrate how you could run the tests in Docker.

You can also write your containers in such a way that they are executable locally. For example, your container may get the input and output information from environment variables, falling back to the Kratix provided mount-volumes. For instance, the container script for a "Namespace" could look like this:

#!/usr/bin/env bash

set -euxo pipefail

input_file="${INPUT_FILE:-"/kratix/input/object.yaml"}"

output_dir="${OUTPUT_DIR:-"/kratix/output/"}"

name="$(cat "${input_file}" | yq '.spec.namespaceName')"

kubectl create namespace "${name}" \

--dry-run --output yaml > ${output_dir}/namespace.yaml

In this example, the script assigns the INPUT_FILE environment variable value to the input_file variable if the former is set. If not, the script sets input_file to /kratix/input/object.yaml, which is where Kratix provides the container's input.

By setting these environment variables, you can easily execute the script locally, which in turn makes testing straightforward.

Note that your container may be executing imperative commands against APIs, which may cause it to have side effects when executed. Again, normal testing strategies apply: mocking, stubbing, dependency injection, etc.

Testing it all together

With your API and Pipeline containers properly tested, the next level of testing is whether it will actually work for your users. At this level, you are probably looking at some level of integration or system testing.

For that, you will likely need a Kubernetes cluster, where you can install Kratix and your Promises. How much, in which way, with how many clusters are all decisions you must make. On the most basic scenario, you could have a KinD cluster with Kratix installed on Single-cluster mode, and your tests are basic assertions on the state of the cluster after installing Promises and sending requests.

The example above is exactly how the Promises available in the Kratix Marketplace are tested. For instance, for the Redis Promise, our testing includes two assertions at the system level (test script is available here):

On Promise installation, are the Redis Operator CRDs installed correctly? Does the operator deployment start?
On Redis requested, does a Redis cluster start successfully?

You could include further assertions, for example:

Can I connect and use the created Redis?
Does my resource get the right `status`?
Is it being scheduled to the right clusters?

It all depends on your test pyramid and how much confidence prior tests gave you.

Consider incorporating upgrades into your plan, as most platform users likely aren't creating resources from scratch. Including upgrade tests in your testing suite could be beneficial.

At this stage, your testing infrastructure may look something like this:

Eventually, you will be confident with your Promise and ready to put it in the hands of the developers using the platform. This brings us to the second section of this post.

What “Production” means for Platform Engineers

Platform Engineers should view any environment actively used by developers (the platform users) as their production environment. This includes environments labeled as "dev", "staging", or similar by the developers. This perspective is important because if any of these environments become unavailable, it could block the platform users for the duration of the outage, costing the business time and money.

Therefore, only thoroughly-tested Promises should be promoted to user-facing environments, typically starting with their development environments, as seen in the diagram above. Once promoted, you can run another set of automated tests, monitor metrics as upgrades roll out, and rely on users for raising any issues as they use the new features.

Your development infrastructure may, at this stage, look similar to this:

What “Production” means for Platform Users

Similar to platform engineers, the "production" for platform users refers to where their customer-facing applications operate. For the Platform team, this "production" is simply another one of their production environments. However, the service level agreements (SLAs) and service level objectives (SLOs) may differ between the developer's production and development environments.

Tried-and-tested promises in the development environments can then be promoted to the production environments, following previous patterns. Ultimately, your entire configuration may appear as follows:

One Platform vs Many Platforms

The architecture above is an example of what you could build to enable the promotion of Promises across the different environments. The multiple Kratix instances provide a clear separation and allow for multiple versions of the same promise to co-exist in the ecosystem while they are being tested and developed.

This approach doesn’t come without downsides, though. You will have multiple Kratix instances to manage. There may be drift between instances. Having a single view of everything that’s deployed across all environments may be complex.

An alternative is to have a single Platform. You would still need your own controlled testing environment, but all of your users would be using a single Platform cluster running Kratix. You can leverage Kratix’s Multiple Destination scheduling features to ensure that production and development workloads are isolated from each other. Your environments could look something similar to this:

This architecture would address some of the problems you’d have with the multi-platform approach: there’s only one Kratix installation to manage; the “one platform” provides a single pane of glass to visualize your entire fleet; since there’s only “one Promise,” there’s no drift between the customers' dev and prod environments.

On the other hand, Platform Engineers must be more diligent in the testing of their Promises. Since this model skips the user-testing step, promoting a Promise may directly impact production workloads. That means a more robust testing setup is needed, testing things like upgrades, updates, the effect on running instances, etc.

Expanding on this model would make this blog post extra long. Let us know in the comments section if you’d be curious to read a follow-up!

Conclusion

An architecture like the one proposed in this article allows Platform engineers to truly embrace continuous delivery, automatically integrating their changes with different environments as tests pass throughout their systems.

As discussed, this is not the only possible architecture. You may have your own business requirements and may want to introduce more (or fewer) checkpoints between environments. Kratix includes an array of features that empower Platform engineers to design the process that makes the most sense for their organisations.

Hopefully, this post gives you a good idea of where to start. If you would like to know more, or have any questions, please reach out to us at the Kratix community slack and we'll be happy to help!