Key Design Concerns

System Design Principles

The ability of a system to handle growing amounts of work. Read more about Scalability on Google Cloud.

Vertical Scaling (Scale Up): Adding resources (CPU/RAM) to a single node. Limited by hardware ceilings.
Horizontal Scaling (Scale Out): Adding more nodes to a pool. Theoretically infinite but requires stateless application design.

SLA (Service Level Agreement): Contractual uptime guarantee (e.g., 99.9% = ~43m downtime/month). Learn about SLAs from Atlassian.
Redundancy: Eliminating Single Points of Failure (SPOF).
- Active-Passive: Standby node takes over on failure.
- Active-Active: All nodes handle traffic; load redistributed on failure.
CAP Theorem: In a distributed system, you can only pick two: Consistency, Availability, Partition Tolerance. CAP Theorem explained by IBM.

Understanding the internal state of the system based on external outputs. What is Observability? (New Relic).

Logs: Discrete events (e.g., “User logged in”). Centralized via ELK/Splunk.
Metrics: Aggregated numerical data (e.g., “CPU usage”, “Requests/sec”). Visualized via Grafana/Datadog.
Tracing: Lifecycle of a request across microservices. (OpenTelemetry).

TTFB (Time to First Byte): Server processing time + Network latency. What is TTFB? (Cloudflare).
RTT (Round Trip Time): Time for a packet to go client->server->client. What is RTT? (Cloudflare).
Edge Computing: Reducing latency by moving compute closer to the user (Cloudflare Workers).