Sharding CAT

From CloudScale
Jump to: navigation, search

A Sharding pattern is used for enabling scalability in case of storing and accessing large volumes of data. The Sharding Pattern improves system performance and scalability by separating data into partitions called shards. Since workload is distributed over more storage nodes, sharding helps in resolving challenges resulting from computing resources and throughput limits of the single storage node.

Contents

Also Known As

Database Sharding, Horizontal Data Partitioning

Problem

Large-scale cloud applications may use large volume of data, resulting in exceeding limits of single-server storage space. Adding more disc capacity can solve this problem only temporarily since the data storage requirements can increase over time and the system will eventually reach a hard limit for expanding storage space. Similar problems occur with supporting a large number of concurrent users requiring more computing resources and network bandwidth than a single server can offer in order to serve users without degrading performance resulting in extended response times and time-outs. Scaling vertically can provide an initial solution for this problems, but it can only do so to a certain extend, which might be insufficient for a commercial cloud application.

Solution

An example of sharding pattern implementation using lookup strategy. Each data is mapped to a shard based on the shard key.

A more efficient solution to the problem is dividing the data store into horizontal partitions (i.e., shards). All shards have the same database schema and they are structurally identical, with the rows of data distributed across the shards. In this way each partition holds a subset of data, and data storage is not based on a single storing node. This approach enables load balancing between shards, hence improving system performance. Scalability is achieved through adding supplementary shards, avoiding the limits of vertical scaling.

The partitioning of the data is done based on data attributes that form the shard key. In order to store the data to the appropriate shard, as well as to identify the shard that the data is stored in while retrieving it, a sharding strategy is used.

Sharding Strategies

Sharding strategy is used to point the appropriate shard for the data that is being stored or retrieved. The sharding strategy is commonly implemented in the application layer, although some storage systems offer an integrated sharding support. The choice of the sharding strategy should be made based on the most commonly used queries and business requirements. There are three most commonly used strategies for distributing the data across shards.

The Lookup strategy

The Lookup sharding strategy uses the sharding key to map the data to the appropriate shard. The mapping can be based on physical shards (each sharding key refers to a physical storage partition) or virtual shards (each sharding key refers to a virtual partition that is mapped to a physical partition). This approach enables more control over the data shards. Also, if a sharding key is referring to the virtual partition, modifying shards becomes much easier since the application source code does not have to be modified. However, this approach can introduce overhead as a result of locating the appropriate shard. Moving the data across the shards, or adding a new shard requires updating mapping information, usually during the period of lowered user activity.

The Range strategy

The Range sharding strategy distributes the data based on the range of the sharding key value. The data within a specific key value range will be stored in the same shard, ordered by the shard key. Typically, sharding key can be composed of the part used for identifying the shard, and a part used for identifying a specific row of data in the shard. This sharding strategy is most convenient if commonly used queries retrieve data records within a given range of a key values (e.g., retrieve all receipts created in a certain time period). Using the Range strategy can make the data management easier, but it might result in unbalanced load among the shards, since a particular range of data can be accessed more often. Adding a new shard, or migrating data across the shards can require merging and splitting the data, and updating information about mapping of ranges to the storage partitions.

The Hash strategy

The Hash strategy determines the appropriate shard based on the computed hash of the attributes in the sharding key. The hashing function should be defined with the goal of achieving equally distributed data size and load for each of the shards. The main benefit of this strategy is achieving balance among shards. However, computing the hash can introduce an overhead. One of the advantages of this strategy is that it is not necessary to update the mapping information in case of adding a new shard, since the shard information is determined from the hash function.

Implementation Considerations

Several considerations should be taken into account while implementing the sharding pattern.

  • Shard key should be composed of attributes that hold unvarying data. Changing the shard key can require migrating the data to another shard and may create an update overhead.
  • Using auto-incrementing fields for a shard key should be avoided since it may result in shard keys not being unique across the shards.
  • Sharding strategy and a shard key should be designed according to the most commonly used queries.
  • Sharding strategy should be designed in a way that will reduce and avoid retrieving data from different shards. If necessary, denormalizing data can be considered to keep the data that is often retrieved together.
  • Sharding strategy should aim to maintain the balance of the workload among the shards. Updating data often may require shards re-balancing.
  • Retrieving the data from different shards can be done using fan-out queries, i.e., retrieving data in parallel, and combining it into the final result
  • Reducing the number of operations that update data in different shards will make the process of maintaining consistency between shards easier
  • Creating more smaller shards can be more convenient than fewer larger shards, since having greater number of shards can be more beneficial for load balancing. Also, smaller shards are more easily migrated.
  • Distribution of data across the shards can also be based on the data isolation and privacy issues

Known Uses

The sharding pattern should be used for improving system performance and scalability by balancing load and decreasing contention in a data store. It should be applied in case it is necessary to ensure an application will be able to scale beyond the possibilities offered by a single storage node, regarding any type of computing or network resource.

Benefits

  • System scalability enabled by adding more shards
  • Avoiding possibility of reaching the hard limit for capacity expansion
  • Balancing the workload across shards improves performance
  • Shards can be geographically distributed and hence located near the users accessing the data

References