Key Design Concerns
System Design Principles
Section titled “System Design Principles”Scalability
Section titled “Scalability”The ability of a system to handle growing amounts of work. Read more about Scalability on Google Cloud.
- Vertical Scaling (Scale Up): Adding resources (CPU/RAM) to a single node. Limited by hardware ceilings.
- Horizontal Scaling (Scale Out): Adding more nodes to a pool. Theoretically infinite but requires stateless application design.
Availability & Reliability
Section titled “Availability & Reliability”- SLA (Service Level Agreement): Contractual uptime guarantee (e.g., 99.9% = ~43m downtime/month). Learn about SLAs from Atlassian.
- Redundancy: Eliminating Single Points of Failure (SPOF).
- Active-Passive: Standby node takes over on failure.
- Active-Active: All nodes handle traffic; load redistributed on failure.
- CAP Theorem: In a distributed system, you can only pick two: Consistency, Availability, Partition Tolerance. CAP Theorem explained by IBM.
Observability
Section titled “Observability”Understanding the internal state of the system based on external outputs. What is Observability? (New Relic).
- Logs: Discrete events (e.g., “User logged in”). Centralized via ELK/Splunk.
- Metrics: Aggregated numerical data (e.g., “CPU usage”, “Requests/sec”). Visualized via Grafana/Datadog.
- Tracing: Lifecycle of a request across microservices. (OpenTelemetry).
Latency
Section titled “Latency”- TTFB (Time to First Byte): Server processing time + Network latency. What is TTFB? (Cloudflare).
- RTT (Round Trip Time): Time for a packet to go client->server->client. What is RTT? (Cloudflare).
- Edge Computing: Reducing latency by moving compute closer to the user (Cloudflare Workers).