Problems in Distributed Computing
When talking of a distributed system, it is a system having more than one component on different machines that communicate with each other by sending messages. Distributed systems are of great importance. These are the systems that help in moving data from one machine to another to fulfill a required function.
On a single machine if there an internal fault occurs; it crashes despite giving an error. While in a distributed system this is not the case there, there are some systems which are
down due to some faults and some systems work. We call this partial failure. In distributed systems, the network through which the message travel is very important. Delay, message drops and other problems in the network are the problems that we need to solve to have smooth functioning systems. These are the problems that make it difficult to work with distributed systems. Though there are several problems, here we will discuss two main problems.
Distributed systems are coordinating and communicating through the network which is unreliable. Suppose a machine sends a request.
- Request lost or delayed
- Request queued
- Remote node is not present, disconnected
- Response to this request is delayed
- Response lost
To solve these problems the concept of Timeout was introduced in distributed systems. With the introduction of the timeout, the problem of the unreliable network is not solved completely, another problem occurs which is to decide what should be the length of timeout. If the timeout is so long then durn this time system is unreliable. If time is too short and at that time the node is busy with some other action, this will be replaced by another node and this action can happen twice. When a node is detected as dead the all the responsibilities of that node are transferred to other nodes. Most of the delays in the network are caused by the Queues when the traffic is high because these all are using the same resources like CPU, pipe, etc. when data reaches to the destination where it is to process if all the CPUs are busy operating system queues this data till the machine become free and entertains this data.
In distributed systems, there is more than one machine and they are communicating through a network. So, for every two machines communicating with each other need a network. For multiple machines there will be multiple network delays and message delays as already discussed. Message delay is not uniform for every network, it is variable for every network and making time handling tricky.
Every machine has a hardware clock. There is a minor difference in calibration due to which some are slightly slower and some are slightly faster than others. These clocks are synchronized by GPS with the help of the Network Time Protocol (NTP).
When we talk about computer systems, they have two clocks. The first one is the Time of day clock, which gives the current wall-clock time. This type of clock is also synchronized using NTP.
The second type is the Monotonic clock. These types of clock have time which always grows. Each CPU has its clock and those systems which have more than one CPU have multiple clocks. The idea of resetting this type of clock is to slow it if its fast and vice versa. Several causes create synchronization issues. Especially in virtual machines, the clock working is also virtual. If you have paused the virtual machine at that time the clock is also paused. When you restart the virtual machine, this creates a jump.
Now our work is mostly distributed and we have distributed systems like a cloud. It’s the time of AI so, how can we minimize these kinds of problems in distributed systems?