Considering that the deployment done to GCP at this time is done with a Terraform script, this should be relatively doable, though ephemeral IPs will complicate the steps of CI on the BOSH release.
The problem comes after the terraform calls. The spin-up of CF & cf-mysql on that are processed that take multiple attempts to succeed. No idea of why that is.
Even without having to retry 1-3x, the spin-up times are up to 30 minutes for the entire environment.
Sadly, this is still being worked on as we're hitting issues with the deployments at this time. I am unsure if these are mysterious changes in recent GCP, or something oddly broken with how we're spawning the NATS controller (this isn't doing the nat'ing), etc. Quite annoying.
Turns out I broke the nat-instance-primarys configuration behaviors by moving from Ubuntu Trusty(14.04) to Ubuntu Xenial(16.04) and network interface names changed. This has been fixed, and we're moving on with the deployment & CI scripting.
We're now to the point where we can sanely have bastion, nat-primary and a BOSH director online at any given time, keeping the online load to 3 VMs. This could possibly be reduced to 2 VMs by making the director an item that is also spawned up/down, but we'll see about that feasibility.
We're safely (although time consuming) spinning up Cloud Foundry as well as MySQL and Redis services for CloudFoundry. The fully operational infrastructure is approximately 30 VMs, most of those also having been set to optimized VM sizes based on load-testing and recommendations from GCP. All the same, it is far more cost effective to keep only what is needed online. Spin-up time is between 45 minutes and 2 hours.
Here's a summary of the plan at this point. We're going to automate absolutely as much of this infrastructure as possible, and rely on GitLab CI/CD to do it, too.
gcp-infra
I have created https://gitlab.com/gitlab-pivotal/gcp-infra, which may get renamed in the future. The intent of this repository is to have a Dockerfile that creates the container with the necessary tools for CI/CD deployment of the infrastructure, and a pipeline for creating that infrastructure, but with explicitly manual steps. One for spin up, one for spin down. These two steps will then make API calls to https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release in order to update it with the relevant data for the ability to use the produced infrastructure/environment for use in CI/CD of the bosh release itself.
A VM to provide NAT'ing for containers inside the network
A BOSH Director VM
A deployment of CloudFoundry
A deployment of MySQL services for CloudFoundry
For providing services on-demand to the deployments of GitLab
A deployment of Redis services for CloudFoundry
For providing services on-demand to the deployments of GitLab
By taking these steps we:
Reduce manual steps
Automate spin-up and tear down, which can average between 1h30 and 4h00 depending on a number of factors
Reduce costs of this infrastructure even further. We expect an average lifetime of an environment to be no more than 2 weeks per month.
Centralize the state of the infrastructure, and removing the possibility of conflicting states between developers with access.
gitlab-ee-bosh-release
https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release will be updated with the cf-cli package for Cloud Foundry, and will spin up new instances of MySQL and Redis for each branch, and delete them once tests have completed. This removes having any need for a long standing instance of MySQL and Redis, as is currently in place on the outdated AWS infrastructure.
The GCP targeted bosh deployment manifests will also employ the use of embedded ERB, as we've recently learned that this is possible through the investigation of this new infrastructure. Additonally, with functioning internal DNS, we will use FQDNs instead of IP addresses for all GitLab configuration items. What this means is that we can now sanely stop editing these files manually, and have the deployment's name match the branch that is being worked on, entirely dynamically.
In the gcp-infra, we're using dind docker for a runner instance, because the version of BOSH currently in use is v1, and this does not support any SOCKS or other proxy, and thus has to have network access. In order to handle this case, we're placing the runner directly inside the environment. When BOSH v2 is available, the entirely layout can be changed to a smaller bastion, and the runner can even be a shared runner. v2 supports SOCKS proxies, thus we can use ssh -D to create a dynamic SOCKS5 proxy. BOSH v2 is almost ready for GA, but is not, currently, because documentation is missing. We could switch to it early, but we still have a problem, because large sections of manifests would need to be re-written. This can/will come in a future revision.
I've spoken with Dmitriy Kalinin in regards to BOSH cli v2, and what does/doesn't need changed, and how we can implement things. This has been quite enlightening, as it seems that in reality, very little needs to be changed at all, and allows us to do all of the terraform and bosh(v2) calls from a single runner, with only the single SOCKS5 proxy thanks to gcloud compute ssh.
I am completing the test now, but this will greatly reduce the headaches involved with this entire process.
Okay, so. After spending a few hours over each of the past few works days with Dmitriy over at Pivotal, we're going to change this up a little bit. We'll still have two projects, but we'll cleanup & simplify the gcp-infra project.
We've made necessary modifications so that everything can be deployed with BOSH v2 CLI (2.0.13 at this time), and an ssh dynamic SOCKS proxy tunnel. This will allow us to entirely remove the requirement to jump to the inside of the terraformd bastion host. Now, we can terraform all of the bosh and Cloud Foundry infrastructure in one go. From that point onward, it is a simple matter of keeping one tfstate, one state.json (the director deployment state), and one creds.yml (automatically generated credentials!) that need to be stored and passed around a bit.
We can extract the appropriate portions of the creds.yml and tfstate for sending over to the gitlab-ee-bosh-release project's CI configuration, and have everything that we will need to perform our tasks.
We're hitting a snag tied to GCP for the CI of gitlab-ee-bosh-release, as seen below. I am reaching out to the Cloud Foundry Redis folks to find out why.
$ cf create-service redis shared-vm redis-1Creating service instance redis-1 in org system / space test as admin...FAILEDServer error, status code: 502, error code: 10001, message: Service broker error: redis failed to start: exit status 1
I have tried completely redeploying the entire project (terraform, bosh, cf, cf-redis) from scratch, but for some reason this is coming up. This was not an issue previously, so I have no idea what is happening.
In debugging, and discussing with the CF Redis folks, it seems the go binary is apparently failing to find a free port, for some reason. We're attempting to debug why this has suddenly started occurring.