reliability in Software

Today most systems are data intensive as compared to compute intensive. CPU power is rarely a limiting factor. Whatever is the limiting factor today is amount of data, complexity of data and speed at which the data is changing. When designing data intensive application, we should ask ourselves this questions

How do you make sure data remains correct and complete even when things go wrong internally
How do you provide good performance to your clients even when parts of your system are degraded?
How do you scale to handle an increase in load
What does a good API for the service look like?

When dealing with Data intensive application, we must look at the following:

Reliability
Maintainability
Scalability
Reliability

The ability of a system to continue working correctly (Performing the correct function at the desired level of performance) even in face of adversity (hardware fault, software faults, Human errors)

Expectations of reliability in software

The application performs the function the user expects
Can tolerate the user making mistakes or user using the software in an unexpected ways
Its performance is good enough for the required use case under the load and data volumes
The system prevents any authorized access or abuse

A system that can work well even when things go wrong is called a fault-tolerant system or resilient system. A fault is not a failure. A fault is when part/component of the system is deviating from its specs.

A failure is when the whole system stops providing the services the user expects from it.

A failure is caused by a fault. The real deal is to minimize or to control faults. It is not practically possible to have a system with no fault, and that is why we need to design systems that even when the fault occur, the the system can still guarantee to continue to work well.

What causes systems to be unreliable?

Hardware faults
Software faults
Human errors

Hardware Faults

When we talk about hardware faults, we are saying that we might have servers going off, or computers shutting down unexpectedly, disks failing, RAM failing etc, This happens more often than we can imagine, wrong network cable being plugged out. On average in any data center we expected that the Mean Time To Failure (MTTF) of a disk is about 5 to 10 years so in a data center with 10,000 disks, we expect at least 1 disk to fail a day. What we usually do most of the time to correct this is by adding a redundant layer, we put on more disks, “hot swap” CPU so that when one fails one picks up or even arrange disks in RAID. This has been true until recent years when more complex components have been added in the system and so we need to work on redundancy of this complex components which is not easy as it seems. This means there is more Point of Failure. Most of the systems are now adopting software fault-tolerant in preference or in addition to hardware redundancy.

Software faults

What about software faults? It is easier for software faults to cause trouble more than hardware. And the reason for that is because, we can easily say that hardware does not happen all at once, when one disk is broken, it does not influence other disks to break. Maybe if there is a unifying factor such as vibration in the server rack or server rack temperature. However for software, bugs happen, and when they do, they cause trouble. And the trouble is not easy to resolve. They are cascading issues. This usually is hard to get because bugs usually stay unnoticed until a certain condition triggers them. And when they get triggered, they start to behave unexpectedly. This happens because, the system usually makes an assumption about its environment, and this assumption is usually true until the assumptions suddenly and unexpectedly becomes false. The only way to resolve this is a series of small steps.

try to think around the assumptions
isolate processes
Measure and monitor
let the program to run and fail so that you can check the failure points
Thorough testing.

Human Errors

Humans are known to be unreliable. In fact most of the outages are made by human through configuration issues than hardware faults. We cannot avoid that as human is to error, the only way we get to make this right is by doing the following

Having a test environment like a sandbox that will not affect the production code
Design system that minimizes human error and encourage doing the right thing
Making it easier to rollback from human error to avoid more issues.
Set up detailed and clear monitoring
Implement good management practices and training

We’ve only scratched the surface of software reliability—the foundation upon which scalable systems are built. Understanding how systems remain correct and resilient under failure is the first critical step.

Next, we’ll go deeper into scalability: what it really means, how to measure it effectively, and how to recognize when your system has reached its limits. More importantly, we’ll explore the practical strategies used to handle growth in real-world systems.

Understanding Reliability in Data Intensive Applications

Comments

More from this blog

Data Structures and Algorithms in Go

Command Palette

Comments

More from this blog