🔒 Distributed Locking

In a single-machine environment, we use local locks (mutexes) to protect shared resources. In a distributed system, we need a Distributed Lock to ensure that only one node performs a specific action at a time.

1. Why do we need it?

Mutual Exclusion: Prevent multiple instances of a background job from running simultaneously.
Resource Protection: Ensure only one service is writing to a specific file or database record.

2. Common Implementation Patterns

A. Redis-based (Redlock)

Using Redis SET NX (Set if Not Exists) with an expiration time.

Pros: Extremely fast, simple.
Cons: Can be unreliable in certain network split scenarios.

B. ZooKeeper-based

Using ephemeral nodes. If the client disconnects, the node is automatically deleted, and the lock is released.

Pros: Highly consistent (CP system).
Cons: Slower than Redis, more complex to manage.

C. Etcd-based

Similar to ZooKeeper, uses a keep-alive lease system.

Pros: Basis for Kubernetes coordination, very reliable.

3. Critical Considerations

Lock Expiration (TTL)

If a process crashes while holding a lock, it must eventually be released. However, if the TTL is too short, the process might still be running when the lock expires, allowing another process to acquire it.

Fencing Tokens

To prevent the "late write" problem (where a process whose lock expired finally finishes its task), every lock acquisition should return a Fencing Token (a monotonically increasing ID). The storage system should only accept writes with a token greater than the last one seen.

4. Summary

Choosing a distributed lock depends on your consistency requirements. For high-speed but "good enough" locking, use Redis. For mission-critical correctness, use ZooKeeper or Etcd.