BOSH infrastructure on demand

Considering that the deployment done to GCP at this time is done with a Terraform script, this should be relatively doable, though ephemeral IPs will complicate the steps of CI on the BOSH release.

@WarheadsSE That is great news, it actually fits perfectly with what the Prod. team does.

The problem comes after the terraform calls. The spin-up of CF & cf-mysql on that are processed that take multiple attempts to succeed. No idea of why that is.

Even without having to retry 1-3x, the spin-up times are up to 30 minutes for the entire environment.

Considering the monthly cost vs. the spin up time, I am perfectly fine with saying: retry until it works.

Sadly, this is still being worked on as we're hitting issues with the deployments at this time. I am unsure if these are mysterious changes in recent GCP, or something oddly broken with how we're spawning the NATS controller (this isn't doing the nat'ing), etc. Quite annoying.

Turns out I broke the nat-instance-primarys configuration behaviors by moving from Ubuntu Trusty(14.04) to Ubuntu Xenial(16.04) and network interface names changed. This has been fixed, and we're moving on with the deployment & CI scripting.

We're now to the point where we can sanely have bastion, nat-primary and a BOSH director online at any given time, keeping the online load to 3 VMs. This could possibly be reduced to 2 VMs by making the director an item that is also spawned up/down, but we'll see about that feasibility.

We're safely (although time consuming) spinning up Cloud Foundry as well as MySQL and Redis services for CloudFoundry. The fully operational infrastructure is approximately 30 VMs, most of those also having been set to optimized VM sizes based on load-testing and recommendations from GCP. All the same, it is far more cost effective to keep only what is needed online. Spin-up time is between 45 minutes and 2 hours.

Here's a summary of the plan at this point. We're going to automate absolutely as much of this infrastructure as possible, and rely on GitLab CI/CD to do it, too.

gcp-infra

I have created https://gitlab.com/gitlab-pivotal/gcp-infra, which may get renamed in the future. The intent of this repository is to have a Dockerfile that creates the container with the necessary tools for CI/CD deployment of the infrastructure, and a pipeline for creating that infrastructure, but with explicitly manual steps. One for spin up, one for spin down. These two steps will then make API calls to https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release in order to update it with the relevant data for the ability to use the produced infrastructure/environment for use in CI/CD of the bosh release itself.

The container/scripting will rely on:

gcloud sdk for interaction with GCP
bosh-init for deploying the BOSH Director
bosh-cli for interacting with the BOSH Director to deploy supporting infrastructure
cf-cli for interacting with CloudFoundry and configuration of the MySQL and Redis integration
jq for data manipulation and passing to environments.

These spawned infrastructures will include:

A bastion host
Including Docker, for dind GitLab Runner using the docker engine.
This will auto-register as a GitLab Runner for https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release
A VM to provide NAT'ing for containers inside the network
A BOSH Director VM
A deployment of CloudFoundry
A deployment of MySQL services for CloudFoundry
For providing services on-demand to the deployments of GitLab
A deployment of Redis services for CloudFoundry
For providing services on-demand to the deployments of GitLab

By taking these steps we:

Reduce manual steps
Automate spin-up and tear down, which can average between 1h30 and 4h00 depending on a number of factors
Reduce costs of this infrastructure even further. We expect an average lifetime of an environment to be no more than 2 weeks per month.
Centralize the state of the infrastructure, and removing the possibility of conflicting states between developers with access.

gitlab-ee-bosh-release

https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release will be updated with the cf-cli package for Cloud Foundry, and will spin up new instances of MySQL and Redis for each branch, and delete them once tests have completed. This removes having any need for a long standing instance of MySQL and Redis, as is currently in place on the outdated AWS infrastructure.

The GCP targeted bosh deployment manifests will also employ the use of embedded ERB, as we've recently learned that this is possible through the investigation of this new infrastructure. Additonally, with functioning internal DNS, we will use FQDNs instead of IP addresses for all GitLab configuration items. What this means is that we can now sanely stop editing these files manually, and have the deployment's name match the branch that is being worked on, entirely dynamically.

I just want to tack in here.

In the gcp-infra, we're using dind docker for a runner instance, because the version of BOSH currently in use is v1, and this does not support any SOCKS or other proxy, and thus has to have network access. In order to handle this case, we're placing the runner directly inside the environment. When BOSH v2 is available, the entirely layout can be changed to a smaller bastion, and the runner can even be a shared runner. v2 supports SOCKS proxies, thus we can use ssh -D to create a dynamic SOCKS5 proxy. BOSH v2 is almost ready for GA, but is not, currently, because documentation is missing. We could switch to it early, but we still have a problem, because large sections of manifests would need to be re-written. This can/will come in a future revision.

I've spoken with Dmitriy Kalinin in regards to BOSH cli v2, and what does/doesn't need changed, and how we can implement things. This has been quite enlightening, as it seems that in reality, very little needs to be changed at all, and allows us to do all of the terraform and bosh(v2) calls from a single runner, with only the single SOCKS5 proxy thanks to gcloud compute ssh.

I am completing the test now, but this will greatly reduce the headaches involved with this entire process.

Okay, so. After spending a few hours over each of the past few works days with Dmitriy over at Pivotal, we're going to change this up a little bit. We'll still have two projects, but we'll cleanup & simplify the gcp-infra project.

We've made necessary modifications so that everything can be deployed with BOSH v2 CLI (2.0.13 at this time), and an ssh dynamic SOCKS proxy tunnel. This will allow us to entirely remove the requirement to jump to the inside of the terraformd bastion host. Now, we can terraform all of the bosh and Cloud Foundry infrastructure in one go. From that point onward, it is a simple matter of keeping one tfstate, one state.json (the director deployment state), and one creds.yml (automatically generated credentials!) that need to be stored and passed around a bit.

We can extract the appropriate portions of the creds.yml and tfstate for sending over to the gitlab-ee-bosh-release project's CI configuration, and have everything that we will need to perform our tasks.

We are primarily done with this at this stage, and are now working making the necessary changes to gitlab-ee-bosh-release to use the new infrastructure. Which is also almost complete https://gitlab.com/gitlab-pivotal/p-gitlab-ee/issues/20

We're hitting a snag tied to GCP for the CI of gitlab-ee-bosh-release, as seen below. I am reaching out to the Cloud Foundry Redis folks to find out why.

$ cf create-service redis shared-vm redis-1
Creating service instance redis-1 in org system / space test as admin...
FAILED
Server error, status code: 502, error code: 10001, message: Service broker error: redis failed to start: exit status 1

I have tried completely redeploying the entire project (terraform, bosh, cf, cf-redis) from scratch, but for some reason this is coming up. This was not an issue previously, so I have no idea what is happening.

In debugging, and discussing with the CF Redis folks, it seems the go binary is apparently failing to find a free port, for some reason. We're attempting to debug why this has suddenly started occurring.

Complete. https://gitlab.com/gitlab-pivotal/gcp-infra/pipelines

Use pending https://gitlab.com/gitlab-pivotal/gitlab-ee-bosh-release/merge_requests/14

assigned to @marin

added BOSH Infrastructure labels

closed

BOSH infrastructure on demand

Designs

Child items ...

Activity

gcp-infra

gitlab-ee-bosh-release

Admin message

Admin message

BOSH infrastructure on demand

Activity

gcp-infra

gitlab-ee-bosh-release