Commit 72b1bdbb authored by Daniele Valeriani's avatar Daniele Valeriani
Browse files

Merge branch 'pc-add-arch-diagram' into 'master'

Added architecture diagram

See merge request !7
parents 50a7ee87 fe484de2
......@@ -2,11 +2,6 @@
[These are not the runbooks you are looking for](
## Design decisions
* [Storage design](design/
* [Network design](design/
## Where and how to look for data
### General System Health
......@@ -25,3 +20,7 @@
* [Postgres Queries]( use this dashboard to understand if we have blocked or slow queries, dead tuples, etc.
* [Business Stats]( shows many pushes, new repos and CI builds.
* [Daily overview]( shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.
## Production Architecture
![Architecture](img/GitLab Infrastructure Architecture.png)
# GitLab Infastructure Design
* [Storage](design/
* [Networking](design/
# Networking
## Edge Routing
We will take delivery of two diverse 1GB network connections, each recieving a full BGP feed.
Routers will need to terminate 1GB ethernet handoff w/ uplink into core network.
## Core Routing & Switching
We will have a two node collapsed core architecture comprised of 40GB Open Network Switch
hardware running Cumulus Networks OS. The ASIC Chipset should be a Broadcom Tomahawk or Broadcom Trident2+.
## Host Connectivity
Hosts will be dual connected to each of the core switches by 40 GB interconnects.
Hosts will run Cumulus Quagga for end-to-end L3 connectivity and dynamic routing.
# Storage
## CephFS in hardware
This is our general plan and reasoning
* We are moving forward with the CephFS cluster in hardware.
* Our general architecture goes in the way of using
* 12 cores processors
* 2 sockets
* 96 GB of RAW storage
* Minimum spindle count of 16 drives
* Minimum HBA count of 2
* 2 drives for the OS as a RAID 1
* NVMe drive on the PCIe bus for Ceph Journal and Frequently Used Ceph PGs
* 40GB nic card for general networking
* 1GB nic card for management
* As a backup for git repos we will use GitLab GEO feature pushing into a secondary node hosted at Amazon with an EFS drive (we don't care if it's slow)
* This makes Amazon DirectConnect a critical feature for our colo as we will need to have high bandwidth to it.
* We will start backfilling this Amazon instance as soon as we finish draining CephFS, so when we are done we can start moving from Amazon to the colo.
* To prevent a total loss in the case of another MDS meltdown we will create snapshots periodically so we can recover (hourly, daily, whatever makes sense)
* We will push forward with the GEO feature to use an object storage, in which case we will use RADOS as the object storage to simplify our installation.
* CephFS supports having some osds in BlueStore and XFS at the same time (but don't leave it like this), we will move to BlueStore when it's stable and available.
## CephFS in the cloud
** Don't do it. **
* Latencies will kill you.
* Random hosts going down at any time will double your workload.
* Network attached storage, as premium as it is, is shared and slow.
* CephFS will lock when it can't write to the journal.
The good side:
* CephFS survives locking and injecting latencies remarkably well.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment