Skip to content
Snippets Groups Projects
Commit fe484de2 authored by Pablo Carranza's avatar Pablo Carranza
Browse files

Added architecture diagram

parent 50a7ee87
No related branches found
No related tags found
No related merge requests found
Loading
Loading
@@ -2,11 +2,6 @@
 
[These are not the runbooks you are looking for](https://gitlab.com/gitlab-com/runbooks)
 
## Design decisions
* [Storage design](design/storage.md)
* [Network design](design/networking.md)
## Where and how to look for data
 
### General System Health
Loading
Loading
@@ -25,3 +20,7 @@
* [Postgres Queries](http://performance.gitlab.net/dashboard/db/postgres-queries) use this dashboard to understand if we have blocked or slow queries, dead tuples, etc.
* [Business Stats](http://performance.gitlab.net/dashboard/db/business-stats): shows many pushes, new repos and CI builds.
* [Daily overview](http://performance.gitlab.net/dashboard/db/daily-overview): shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.
## Production Architecture
![Architecture](img/GitLab Infrastructure Architecture.png)
# GitLab Infastructure Design
* [Storage](design/storage.md)
* [Networking](design/networking.md)
# Networking
## Edge Routing
We will take delivery of two diverse 1GB network connections, each recieving a full BGP feed.
Routers will need to terminate 1GB ethernet handoff w/ uplink into core network.
## Core Routing & Switching
We will have a two node collapsed core architecture comprised of 40GB Open Network Switch
hardware running Cumulus Networks OS. The ASIC Chipset should be a Broadcom Tomahawk or Broadcom Trident2+.
## Host Connectivity
Hosts will be dual connected to each of the core switches by 40 GB interconnects.
Hosts will run Cumulus Quagga for end-to-end L3 connectivity and dynamic routing.
# Storage
## CephFS in hardware
This is our general plan and reasoning
* We are moving forward with the CephFS cluster in hardware.
* Our general architecture goes in the way of using
* 12 cores processors
* 2 sockets
* 96 GB of RAW storage
* Minimum spindle count of 16 drives
* Minimum HBA count of 2
* 2 drives for the OS as a RAID 1
* NVMe drive on the PCIe bus for Ceph Journal and Frequently Used Ceph PGs
* 40GB nic card for general networking
* 1GB nic card for management
* As a backup for git repos we will use GitLab GEO feature pushing into a secondary node hosted at Amazon with an EFS drive (we don't care if it's slow)
* This makes Amazon DirectConnect a critical feature for our colo as we will need to have high bandwidth to it.
* We will start backfilling this Amazon instance as soon as we finish draining CephFS, so when we are done we can start moving from Amazon to the colo.
* To prevent a total loss in the case of another MDS meltdown we will create snapshots periodically so we can recover (hourly, daily, whatever makes sense)
* We will push forward with the GEO feature to use an object storage, in which case we will use RADOS as the object storage to simplify our installation.
* CephFS supports having some osds in BlueStore and XFS at the same time (but don't leave it like this), we will move to BlueStore when it's stable and available.
## CephFS in the cloud
** Don't do it. **
* Latencies will kill you.
* Random hosts going down at any time will double your workload.
* Network attached storage, as premium as it is, is shared and slow.
* CephFS will lock when it can't write to the journal.
The good side:
* CephFS survives locking and injecting latencies remarkably well.
img/GitLab Infrastructure Architecture.png

108 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment