Scaling — Flash Reference

What is scalability?

Ability of a system to handle increased load by adding resources without a full architectural overhaul.

Metric	What it is
RPS	API calls per second
Concurrent users	Users active at same time
Throughput	Data transferred per unit time (e.g. GB/s)
QPS	Database queries per second
Message rate	Messages through queues per second

Load	Response time	Verdict
2x → ~same or slightly up	Excellent — sublinear, caching works
10x → linear increase	Acceptable — predictable
10x → spike / timeout	Critical — bottleneck or breaking point

Goal: Linear or sublinear degradation; avoid superlinear spikes.

Definition: Add more power to the same machine (more CPU, RAM, SSD, network).

Pros: Simple, no code change, low latency, no distributed complexity.
Cons: Hardware ceiling, single point of failure, cost curve (bigger = more than 2× cost), upgrade downtime.
Use when: DB before sharding, strong consistency, early-stage, predictable moderate growth.

Definition: Add more machines; load balancer distributes traffic.

Pros: No hard limit, fault tolerance, often cheaper, can place nodes near users.
Cons: Complexity, data consistency, network latency, stateless app servers usually required.

Stateless	Stateful
No session on server; any server can serve any request	Session on one server → all requests must go there
Session in shared store (Redis, JWT, S3)	Sticky sessions; hotspots; hard to remove servers
Easy to scale	Hard to scale

Make stateless: Redis/Memcached for sessions, JWT, object storage (S3) for files.

Tier	Ease	Main levers
App	Easiest	Stateless + LB + auto-scale + multi-AZ
Database	Hardest	Read replicas, sharding, or NoSQL
Cache	Easy	Redis Cluster, consistent hashing, cache-aside
Queues	Easy	Decouple producers/consumers; partition (e.g. Kafka)

Read-heavy (10:1–100:1 read:write): Read replicas — primary writes, replicas read. Trade-off: replication lag.
Write-heavy or huge data: Sharding — partition by key (range / hash / directory). Trade-off: no cross-shard queries, rebalancing is hard.
Need both / flexibility: Sharding + replicas, or consider NoSQL (built-in sharding, eventual consistency, no joins).

Sharding strategies: Range (A–H, I–P…), Hash (key mod N), Directory (lookup table).

Stage	Users	Change	Next bottleneck
1	0–10K	Single server (app + DB)	CPU/memory contention
2	10K–100K	DB on separate server	DB read load
3	100K–500K	Add Redis cache	Single app server
4	500K–2M	LB + multiple stateless app servers	DB again
5	2M–10M	Read replicas (primary writes, replicas read)	Write throughput
6	10M+	Sharding (partition by user ID etc.)	Cross-shard queries, rebalancing

Vertical: Same box, bigger box; simple but capped.
Horizontal: More boxes; needs stateless design and data strategy.
Always find the bottleneck before scaling (more app servers don’t help if DB is the limit).
Patterns: LB, caching, async queues, read replicas, sharding show up in almost every scalable system.
Scalability = handle more load; availability = stay up when things fail (next topic).