RFC 513: Application Specific Staging Resources (#515)

rix0rrr · web-flow · commit 55d07e0470a0 · 2023-12-19T16:23:14.000+01:00
Retrospective RFC #513 Not the best write-up, but it'll give us a location to start discussing. [Rendered version](https://github.com/aws/aws-cdk-rfcs/blob/huijbers/app-specific-bootstrapping/text/0513-app-specific-staging.md) --- _By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license_
diff --git a/text/0513-app-specific-staging.md b/text/0513-app-specific-staging.md
@@ -0,0 +1,305 @@
+# Application Specific Staging Resources
+
+* **Original Author(s):**: @rix0rrr
+* **Tracking Issue**: #513
+* **API Bar Raiser**: -
+
+Currently, to deploy any interesting applications the CDK requires an account to be bootstrapped: it requires the
+provisioning of roles and staging resources to hold "assets" (files and Docker images) before any application can
+be deployed.
+
+If those staging resources could be created as part of a normal application deployment, the requirement to precreate
+those resources is dropped. Users can choose to provision roles if they want to enable CI/CD or cross-account
+deployments, or they can choose not to bootstrap at all if they want to use existing credentials.
+
+## A brief history of synthesizers and bootstrapping
+
+The AWS CDK needs some infrastructure to deploy applications into an account and region. What supporting resources exist
+and what their names are is a contract between the CDK application and the AWS account. "Synthesizers" are the part of
+a CDK application that encode this contract: users prepare their account a certain way, and then pick a synthesizer
+that matches the resources they have provisioned (optionally configuring it with non-default parameters). Synthesizers
+were introduced in CDKv2; before that, there was only "the" default assumptions that the CDK would make about "the"
+account, and none of it was configurable.
+
+The process of preparing an AWS account to be used with a synthesizer is called "bootstrapping".
+
+### V1
+
+In the original bootstrapping stack, we create an S3 bucket to hold files: large CloudFormation templates and assets
+such as Lambda code. ECR repositories are created on-demand by the CLI, if Docker images needed to be uploaded.
+Originally, we added in a Custom Resource to the template that would clean up the ECR repository when the Stack gets
+cleaned up. In 1.21.0, we removed this, and now leave cleanup of dynamically created ECR repositories to users. Asset
+locations are completely controlled by the CLI via parameters.
+
+All deployments are being done with the credentials of the user that runs the CLI.
+
+DOWNSIDES
+
+* Assets take up template parameters, of which there is a limited amount (~50 when we built this system)
+* The dynamism and arbitrary ECR repo creation does not work well in CI/CD systems.
+* The user must have CLI credentials for each account they want to deploy to, and if a single app deployment should
+  go into multiple accounts they must selectively deploy stacks into different accounts using different sets of
+  credentials.
+
+### V2
+
+The bootstrap resources were redesigned as part of the development of CDK Pipelines, an opinionated construct that
+allows trivial deployment of any number of CDK stacks to any number of accounts and regions. The design was designed to
+work for the CLI, a CodePipeline-based solution, as well as other CI/CD solutions in general. It also allows
+cross-region deployments.
+
+To that end, the bootstrap stack now creates (for each account and region combination):
+
+* A single S3 bucket and single ECR repository with well-known names (that need to be reflected in the CDK app if they are non-standard).
+* An encryption key for the S3 bucket
+* An Execution Role for the CloudFormation deployment
+* A role to trigger the deployment, a role to write to the S3 bucket, a role to write to the ECR repository
+* A role to look up context in the account
+* An SSM parameter with a version number of the bootstrap stack
+
+This solution solves for the CI/CD and cross-environment deployments by pre-provisioned roles, and removes
+the need for parameters by rendering the location of each asset directly into the template.
+
+DOWNSIDES
+
+* Some users don’t like the pre-provisioned roles and prefer the v1 situation where their existing credentials were used
+  for permissions.
+* A common complaint about the bootstrap stack is that the resources we create by default do not comply with a given
+  corporate policy, followed by an endless stream of feature requests to add this-and-that feature to the bootstrap
+  stack (block public access, block SSL, tag immutability, image scanning, etc. etc). We solve this by telling customers
+  to take the bootstrap template and customize it themselves, but CloudFormation templates can’t be patched simply and
+  this requires users to effectively “fork” our bootstrap stack and manually keep it up-to-date with incoming changes.
+* Because all staging resources need to be provisioned a priori and need to serve all types of applications, we can't
+  depend on application knowledge. Specifically, we won't know how many Docker images will be used in the application,
+  so we create a single ECR repository to hold all images. This has a number of downsides:
+  * Docker caching relies on pulling the “latest” image from a repository and skipping layers that were already built.
+    This doesn’t work if images built off of various different Dockerfiles are in the same repository.
+  * Lifecycle policies cannot be used because different images from potentially different applications with very
+    different life cycles are all in the same repository. The same was already true for S3, but the problem is
+    less severe because S3 is pretty cheap while ECR is not.
+  * Some people were using the V1 Docker image publishing mechanism not as a vehicle for uploading Docker images to be used
+    by the CDK’s CloudFormation deployment, but simply as a mechanism for building and publishing Docker images, to be
+    used by a completely different deployment later. The lack of control over the target ECR repository breaks this
+    use case (required the development of an `aws-ecr-deployments` construct module, which does give the necessary
+    control but racks up costs by doubling ECR storage requirements, and still does not allow staging resource cleanup).
+  * We always create an empty ECR repository because we cannot know whether apps deployed into the account will need
+    it or not, so the ECR repository may go unused. AWS Security Hub will throw warnings about empty ECR repositories,
+    which makes customers uneasy.
+* Bootstrap stacks are expected to be account-wide, and mix assets from all applications. Some customers that deploy
+  multiple applications into the same account are very sensitive to this mixing, and would rather keep these resources
+  separate. They can do multiple bootstrap stacks in the same account, but this is all a bit onerous.
+
+## A new proposal: application specific staging resources
+
+The bootstrap stack contains two classes of resources: staging resources, which hold assets (bucket and ECR repo), and
+roles, which allow for unattended (CI/CD) and cross-account access. In the new proposal, we will separate out the
+staging resources from the roles. Roles will still be bootstrapped (if used), but staging resources will not.
+
+* Staging resources will be created on a per-CDK app basis. We will create one S3 bucket with different object prefixes
+  for different types of assets (see Appendix A: two types of assets), and an ECR repository per Docker image. Resource
+  access roles can also be created on an as-needed basis. This solves the problems of asset resources of different
+  applications mixing together, and it would also remove the need for garbage collection by allowing use of life cycle
+  rules.
+* Since the roles are now the only things that need to be bootstrapped, that will have a number of advantages:
+  * Bootstrapping will be faster since the heavy resource of a KMS key is no longer involved.
+  * Because roles are a global resource, every account now only needs to be bootstrapped once. First of all the lack
+      of necessary control of regions will work a lot better with Control Tower+automatic Stack Sets (which does not
+      allow region control).
+
+If we can make the bootstrapping resources part of the CDK application, then users now have a familiar way to customize
+them to their heart’s content, so the treadmill of bootstrap stack customization requests is going to disappear, and
+customers will also not need to customize the bootstrap template anymore (assuming their customizations have to do with
+the resources instead of the roles).
+
+A downside is potentially that we lose the ability to have a version number on the bootstrapped resources (because SSM
+is not global), but we might say that’s not necessary anymore since the Roles are unlikely to change often.
+
+> If we wanted to maintain versioning on the Roles, we could say that the stack always must be deployed in `us-east-1`
+> and that’s where we look for the version; however, this may require cross-internet traffic and therefore be considered
+> dodgy from a reliability perspective, and we could only do the versioning check using the CLI, not from the
+> CloudFormation template.  Of course we’ll have to pick the correct leader region per partition, `aws-cn`, `aws-iso`, etc.
+
+### How it will work in practice
+
+Bootstrapping resources are currently designed the way they are because the CLI relies on the assumption that the
+bootstrap resources are present with a well-known name, before the first CloudFormation deployment starts. In other
+words, this is purely a limitation of the orchestration, that we can take away.
+
+Here’s what we’re going to do:
+
+* We will introduce a new Stack Synthesizer, called `AppStagingSynthesizer`.
+* This synthesizer will create a support stack with the bucket, and an ECR repository per Docker image.
+* Assets will have a dependency on the support stack. This is a new concept that doesn’t currently exist because assets
+  are an orchestration artifact that looks independent like stacks are, but they aren't really: in practice the orchestration
+  ignores everything except stacks, and treats assets as being part of a stack.
+  * Docker assets may still be built before the first deployment (although for proper caching we need the repository
+    to exist first), but will only be uploaded when it’s their time in the orchestration workflow.
+* For a minimal diff these resources could have fixed names, but we could add support for Stack Outputs and assets could
+  have support for Parameters, so that we can thread generated bucket and repository names through the system. For now,
+  we will do fixed names for the staging resources.
+
+### What the API looks like
+
+To use the new synthesizer:
+
+```ts
+import { AppStagingSynthesizer } from '@aws-cdk/app-staging-synthesizer';
+
+const app = new App({
+  defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
+    appId: 'my-app-id', // put a unique id here
+    deploymentIdentities: DeploymentIdentities.defaultBootstrapRoles({ bootstrapRegion: 'us-east-1' }),
+
+    // How long to keep File and Docker assets around for rollbacks (without requiring resynth)
+    deployTimeFileAssetLifetime: Duration.days(100),
+    imageAssetVersionCount: 10,
+  }),
+});
+```
+
+For any additional customization (such as using custom buckets or ECR repositories), `DefaultStagingStack`
+can be subclasses or a full reimplementation of `IStagingResources` can be provided:
+
+```ts
+class MyStagingStack extends DefaultStaginStack {
+  private bucket?: s3.Bucket;
+
+  public addFile(asset: FileAssetSource): FileStagingLocation {
+    this.getCreateBucket();
+
+    return {
+      bucketName: 'my-asset-bucket',,
+      dependencyStack: this,
+    };
+  }
+
+  private createOrGetBucket() {
+    if (!this.bucket) {
+      this.bucket = new s3.Bucket(this, 'Bucket', {
+        bucketName: 'my-asset-bucket',
+      });
+    }
+    return this.bucket;
+  }
+}
+
+const app = new App({
+  defaultStackSynthesizer: AppStagingSynthesizer.customFactory({
+    factory: {
+      obtainStagingResources(stack, context) {
+        const myApp = App.of(stack);
+        return new MyStagingStack(myApp, `CustomStagingStack-${context.environmentString}`, {});
+      },
+    },
+  }),
+});
+```
+
+---
+
+Ticking the box below indicates that the public API of this RFC has been
+signed-off by the API bar raiser (the `api-approved` label was applied to the
+RFC pull request):
+
+```
+[ ] Signed-off by API Bar Raiser @xxxxx
+```
+
+## Public FAQ
+
+### What are we launching today?
+
+We are launching a new synthesizer that has fewer demands on the AWS account that CDK apps are deployed into. It only
+needs preprovisioned Roles, and those are only necessary for CI/CD deployments or for cross-account deployments. For
+same-account, CLI deployments no bootstrapping is necessary anymore. If you are using bootstrapped roles anyway,
+they only need to be provisioned in one region, making it easier to use with StackSets.
+
+The new staging resources are specific to an application and can be cleaned up alongside the application. In addition,
+the way the staging resources are structured, they now allow the use of lifecycle rules, keeping costs down for
+running CDK applications over a long period of time.
+
+### Why should I use this feature?
+
+You should use this feature if you:
+
+- Want to take advantage of lifecycle rules on asset staging resources;
+- Do not use ECR and don't want to see the SecurityHub warning that tells you you have an empty ECR repository;
+- Need to deploy to multiple regions in a set of accounts and want to use StackSets to bootstrap the accounts;
+- Want to deploy an application and remove it and be sure that the assets have been cleaned up as well;
+
+## Internal FAQ
+
+### Why should we _not_ do this?
+
+Users generally don't appreciate change, especially if it saddles them with busywork. While the migration path will be
+purely optional, and there are definite benefits to be had, synthesis+bootstrapping is already a sore spot for users
+(it’s hard to explain and therefore a bit under-documented) and introducing more churn may lead to backlash.
+
+### What is the high-level project plan?
+
+- We will release the new synthesizer as an optional feature, first initially only for the CLI.
+- CDK Pipelines support can be added later. When Pipelines support is added, it should be taken into
+  account that the time interval between stage deployments may be significant, especially if it involves manual
+  approval steps. We must take care that the docker images published to the Testing stage are not rebuilt for
+  the Production stage, but are replicated.
+- We have to clearly explain the concept of Synthesizers, the account contract, and Bootstrapping, along with the choices
+  users have and how they should navigate them in the Developer Guide.
+- Customization by subclassing is possible, but we will probably have to selectively expose some protected helper
+  functions to make it more convenient. We will do that when feature requests start coming in.
+- After a tryout period, we will move the synthesizer into the core library and document it as a possible alternative
+  in the developer guide, and we will probably vend a bootstrap template specifically for this synthesizer.
+
+### New bootstrap template
+
+By introducing a new template, we technically have an opportunity to rename roles and get rid of the `hnb659fds`
+identifier that customers hate. However, to make the migration from the current bootstrap stack as smooth as possible,
+we probably should NOT be taking this opportunity and just keep the same role names.
+
+The new bootstrap template will contain exactly the **CloudFormation Execution Role**, **Deployment Role**, and **Lookup Role**
+from the current template, and nothing else.
+
+We can put a version on it for informational purposes, but that version will not be checkable by CloudFormation deployments;
+perhaps it could be make checkable by the CLI during `cdk deploy` time. At least `cdk bootstrap` will be able to look at the
+version to prevent downgrading.
+
+The bootstrap template will be selected by either running `cdk bootstrap` in an app directory that uses the `AppStagingSynthesizer`,
+or passing a command-line flag to CDK bootstrap: `cdk bootstrap --synthesizer=[legacy|default|appstaging]`. If `cdk bootstrap` detects
+it is changing the "type" of bootstrap stack, it will throw up a confirmation prompt with an explanation of the consequences:
+
+```
+$ cdk bootstrap --synthesizer=appstaging
+This operation will change the style of bootstrap stack from "default" version 18 to "appstaging" version 1.
+This bootstrap stack style has been designed for the AppStagingSynthesizer; make sure that you are using that synthesizer
+in the CDK apps you plan to deploy to this environment. For more information, see http://amzn.to/5vjQYrtejA.
+Continue (y/N)?
+```
+
+### Are there any open issues that need to be addressed later?
+
+- The template for the staging resources stack must be small enough to fit into a CloudFormation API call, which means
+  it may not exceed 50kB. Since every ECR repository will add to this size, we have to limit the count. We may need
+  to create multiple stacks using an overflow strategy to lift this limit.
+
+## Appendix A: two types of assets
+
+There are two types of assets:
+
+* “Handoff” assets: these are temporarily put somewhere, so that in the course of a service call we can point to them.
+  The service will make their own copy of these assets. For example, large CloudFormation templates and Lambda Code
+  bundles are an example of this: the CloudFormation template will only read the template once during the deployment,
+  and the Lambda service will make a private copy of the S3 file.
+  * Rollbacks by means of a pure-CloudFormation deployment (so not fresh deployment that involves a CLI call) may
+    require presence of the old handoff asset for a while, so it shouldn’t be deleted right away, but it is reasonable
+    to put a lifecycle policy on handoff assets, equal to the longest period of time a user should still reasonably
+    expect to want to do a rollback in (see the BONES sev2 and damage control campaign from a couple of years ago when
+    the BONES team decided a month was a reasonable period and some service team wanted to roll back to a version of 2
+    months old).
+* “Live” assets: these get continuously accessed in their staged location by the running application. Examples are ALL
+  Docker images (ECS will constantly pull from the user’s ECR container, and never make their own copy), and some
+  asset-assisted conveniences like CodeBuild shellables or CFN-init scripts.
+  * These can in principle only be garbage collected by mark-and-sweep: we must know they are not needed by any
+    current CDK stacks, nor by any CDK stack revisions the user might want to roll back to.
+  * However, for ECR images we can do slightly better: since we have an ECR repository per docker image per
+    application, we can use a lifecycle policy of the form “keep only the most recent 5 images.”
+  * That leaves only certain eccentric types of file assets which are not collectible (until the entire application
+    gets deleted). This might be a “good enough” position to be in.