Before we explain the solution, let’s examine the parts of the system that need the reliability guarantee.
Reliability analysis
Conceptually, everything a node does is around two concepts; data and computation. As long as the data is durable, it’s easy to recover the computational result by running the same code on another node. This means we only need to worry about the reliability of data because if data is lost, it is lost forever. The reliability of the system is mostly about the reliability of the data.
There are four types of data in our system.
File-based command
File-based event
File-based state
State snapshot
Let us take a close look at how to ensure the reliability of each type of data.
State and snapshots can always be regenerated by replaying the event list. To improve the reliability of the state and snapshot, we just need to ensure the event list has strong reliability.
Now let us examine command. On the face of it, an event is generated from the command. We might think providing a strong reliability guarantee for the command should be sufficient. This seems to be correct at first glance, but it misses something important. Event generation is not guaranteed to be deterministic, and also it may contain random factors such as random numbers, external I/O, etc. So command cannot guarantee the reproducibility of events.
Now it’s time to take a close look at the event. The event represents historical facts that introduce changes to the state (account balance.) The event is immutable and can be used to rebuild the state.
From this analysis, we conclude that event data is the only one that requires a high-reliability guarantee. We will explain how to achieve this in the next section.
https://bytebytego.com/courses/system-design-interview/digital-wallet