Consistent hash - techintroduce

Introduction

The consistent hashing algorithm was proposed in the paper Consistentthashingandrandomtrees in 1997, and it is widely used in distributed systems. Consistent hashing is a kind of hashing algorithm. Simply put, when removing or adding a server, this algorithm can change the mapping relationship between the existing service request and the processing request server as little as possible, so as to satisfy the monotonicity as much as possible. Sexual requirements. In an ordinary distributed cluster, there can be a one-to-one correspondence between service requests and processing request servers, which means that the mapping relationship between service requests and processing servers is fixed, and a certain request is processed by a fixed server. This method cannot load balance the entire system, and may cause some servers to be too busy to handle new requests. Other servers are too idle, the resource utilization of the overall system is low, and when a server in the distributed cluster goes down, some service requests cannot be processed directly.

Further improvements can use the hash algorithm to map the relationship between the service request and the processing server to achieve the purpose of dynamic allocation. The service request is converted through the hash algorithm, and the converted result is a modulo operation on the server node value, and the modulo value is the request processing server corresponding to the service request. This method can cope with node failure. When a distributed cluster node goes down, service requests can be redistributed to other available servers through the hash algorithm. This avoids the situation where the request cannot be processed.

But the shortcomings of this method are also obvious. If the server saves the data corresponding to the service request, if the hash value of the request is recalculated, a large number of requests will be relocated to different servers. Causes the data to be used by the request to be invalid, which is very bad in a distributed system. A well-designed distributed system should have good monotonicity, that is, the addition and removal of servers will not cause a large number of hash relocations, and consistent hashing can just solve this problem.

The consistent hash algorithm maps the entire hash value space into a virtual circle, and the value range of the entire hash space is 0~2³²-1. The entire space is organized in a clockwise direction. 0~2^32-1 coincides in the direction of the zero point. Next, use the following algorithm to map the service request, use the hash algorithm to calculate the corresponding hash value of the service request, and then look up clockwise along the circle according to the location of the hash value. The first server encountered is the corresponding processing request Server. When a new server is added, the affected data is only the data between the newly added server and the previous server in its ring space (that is, the first server encountered in the counterclockwise direction), and everything else Will not be affected. To sum up, the consistent hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, which has good fault tolerance and scalability.

Working principle

The consistent hash algorithm is one of the current mainstream distributed hash table protocols. It modifies the simple hash algorithm and solves the hotPot ) Problem, its principle is divided into two steps:

Consistent hash

First, the hash value of the storage node is calculated, which abstracts the storage space into a ring, and configures the storage node on the ring. All nodes on the ring have a value. Secondly, hash the data and map it to the nearest node in a clockwise direction. When a node fails offline, according to the mapping method of the algorithm, only the data objects in the interval between the faulty node on the ring and the next node are affected, and these objects themselves are mapped to the faulty node. . When there is an increase in nodes, for example, adding a node H between nodes A and B, only the data objects between node H traverse counterclockwise until B are affected, and these are remapped to H. Therefore, when a node changes, the data on the entire storage space will not be remapped, which solves the problem of inefficiency caused by the simple hash algorithm to add or delete nodes and remap all data.

As an important algorithm in the field of distributed storage, the consistent hash algorithm basically solves a key problem in the storage environment represented by P2P-how to perform data processing in a dynamic network topology. Distribute and select routing. In the storage topology formed by the algorithm, each storage node only needs to maintain a small amount of information about neighboring nodes, and when a node joins/leaves the system, only a small number of related nodes participate in the maintenance of the topology, which makes consistency The Greek algorithm has become a practical DHT (DistributedHashTable) algorithm. But the consistent hashing algorithm still has its shortcomings. First, in the query process, the query message has to go through O(n) steps (n represents the total number of nodes in the system) to reach the node being queried. It is not difficult to imagine that when the scale of the system is very large, the number of nodes may exceed one million, and such query efficiency is obviously difficult to meet the needs of use. Second, when a new physical node is added or deleted in a distributed storage system that uses a consistent hash algorithm, the data related to the next node must be migrated, and the query hit rate and storage efficiency will decrease, affecting the overall system. Performance.

The relationship with the hash algorithm

The consistent hash algorithm is proposed on the basis of the hash algorithm. In a dynamically changing distributed environment, the hash algorithm should satisfy Several conditions: balance, monotonicity and dispersion.

①Balancing means that the result of the hash should be evenly distributed to each node, which solves the problem of load balancing algorithmically.

② Monotonicity means that when adding or deleting nodes, it does not affect the normal operation of the system.

③Decentralization means that data should be stored dispersedly on each node in a distributed cluster (the nodes can have backups themselves), and it is not necessary for each node to store all the data.

Advantages

Scalability. The consistent hashing algorithm ensures that when servers are added or reduced, the data storage changes are the least, which greatly saves the overhead of data movement compared to traditional hashing algorithms.
To better adapt to the rapid growth of data. Use consistent hashing algorithm to distribute data. When data continues to grow, some virtual nodes may contain a lot of data, resulting in uneven distribution of data on virtual nodes. At this time, virtual nodes with a lot of data can be split. This split is only It divides the original virtual node into two without re-hashing and dividing all the data. After the virtual node is split, if the load of the physical server is still unbalanced, you only need to adjust the storage distribution of some virtual nodes among the servers. In this way, the number of physical servers can be dynamically expanded with the growth of data, and the cost is much smaller than the redistribution of all data by traditional hash algorithms.

Application

The distributed storage system HepyCloud is a massive data storage system independently developed by the Institute of High Energy, Chinese Academy of Sciences. The system uses key-value technology. Realize the fast storage, positioning and high scalability of massive data, and support EB-level storage. The system proposes the idea of a unified layout and improves the consistent hash algorithm.

The HepyCloud system adopts an improved consistent hashing algorithm to achieve uniform distribution and rapid positioning of data. When choosing a hash function, it is mainly considered from the following two aspects: (1) Operation efficiency; ( 2) The hash is uniform. Operational efficiency means that the selected hash function has high computational efficiency, realizes rapid data positioning, and achieves a good user experience; hash uniform means that the selected hash function has good distribution, ensuring that the data is stored Even distribution on the device. The Davies-Meyer algorithm is a better choice. On the one hand, the efficient operation efficiency ensures the rapid positioning of data; on the other hand, the uniform hash distribution ensures the even distribution of data. From the actual use point of view, the improved consistent hash and Davies-Meyer algorithm are applied to the HepyCloud system to realize the even distribution of data on the storage device. The system has 23 storage devices with a storage capacity of 186TB and 14478054 files. The number of files on each device is about 629410 (total number of files/number of devices). In terms of data positioning, after testing and actual use, its performance is comparable to other distributed file systems, and it is sufficient to meet the performance requirements of the storage system.