Added architecture diagram

fe484de2 · Pablo Carranza · 50a7ee87 · fe484de2 · 50a7ee87 · 50a7ee87
Commit fe484de2 authored 7 years ago by Pablo Carranza
--- a/README.md
+++ b/README.md
@@ -2,11 +2,6 @@
  
 [These are not the runbooks you are looking for](https://gitlab.com/gitlab-com/runbooks)
  
-## Design decisions
-
-* [Storage design](design/storage.md)
-* [Network design](design/networking.md)
-
 ## Where and how to look for data
  
 ### General System Health
@@ -25,3 +20,7 @@
 * [Postgres Queries](http://performance.gitlab.net/dashboard/db/postgres-queries) use this dashboard to understand if we have blocked or slow queries, dead tuples, etc.
 * [Business Stats](http://performance.gitlab.net/dashboard/db/business-stats): shows many pushes, new repos and CI builds.
 * [Daily overview](http://performance.gitlab.net/dashboard/db/daily-overview): shows endpoints with amount of calls and performance metrics. Useful to understand what is slow generally.
+
+## Production Architecture
+
+![Architecture](img/GitLab Infrastructure Architecture.png)
--- a/design/README.md
+++ b/design/README.md
-# GitLab Infastructure Design
-
-* [Storage](design/storage.md)
-* [Networking](design/networking.md)
--- a/design/networking.md
+++ b/design/networking.md
-# Networking
-
-## Edge Routing
-
-We will take delivery of two diverse 1GB network connections, each recieving a full BGP feed.
-Routers will need to terminate 1GB ethernet handoff w/ uplink into core network.
-
-## Core Routing & Switching
-
-We will have a two node collapsed core architecture comprised of 40GB Open Network Switch
-hardware running Cumulus Networks OS. The ASIC Chipset should be a Broadcom Tomahawk or Broadcom Trident2+.
-
-## Host Connectivity
-
-Hosts will be dual connected to each of the core switches by 40 GB interconnects.
-Hosts will run Cumulus Quagga for end-to-end L3 connectivity and dynamic routing.
--- a/design/storage.md
+++ b/design/storage.md
-# Storage
-
-## CephFS in hardware
-
-This is our general plan and reasoning
-
-* We are moving forward with the CephFS cluster in hardware.
-* Our general architecture goes in the way of using
-  * 12 cores processors
-  * 2 sockets
-  * 96 GB of RAW storage
-    * Minimum spindle count of 16 drives
-    * Minimum HBA count of 2
-    * 2 drives for the OS as a RAID 1
-  * NVMe drive on the PCIe bus for Ceph Journal and Frequently Used Ceph PGs
-  * 40GB nic card for general networking
-  * 1GB nic card for management
-* As a backup for git repos we will use GitLab GEO feature pushing into a secondary node hosted at Amazon with an EFS drive (we don't care if it's slow)
-  * This makes Amazon DirectConnect a critical feature for our colo as we will need to have high bandwidth to it.
-  * We will start backfilling this Amazon instance as soon as we finish draining CephFS, so when we are done we can start moving from Amazon to the colo.
-* To prevent a total loss in the case of another MDS meltdown we will create snapshots periodically so we can recover (hourly, daily, whatever makes sense)
-* We will push forward with the GEO feature to use an object storage, in which case we will use RADOS as the object storage to simplify our installation.
-* CephFS supports having some osds in BlueStore and XFS at the same time (but don't leave it like this), we will move to BlueStore when it's stable and available.
-
-
-## CephFS in the cloud
-
-** Don't do it. **
-
-* Latencies will kill you.
-* Random hosts going down at any time will double your workload.
-* Network attached storage, as premium as it is, is shared and slow.
-* CephFS will lock when it can't write to the journal.
-
-The good side:
-
-* CephFS survives locking and injecting latencies remarkably well.
-
--- a/img/GitLab Infrastructure Architecture.png
+++ b/img/GitLab Infrastructure Architecture.png