Dynamo: Amazon’s Highly Available Key-value Store 论文笔记

Dynamo: Amazon’s Highly Available Key-value Store

  1. System Assumptions and Requirements in this case

    1. High write availability (this is based on their use cases like shopping carts, user should be able to update the shopping carts anytime). So the design is also writable and resolve conflicts when read.
    2. Query model is simple read and write operations to a data item which is uniquely identified by unique keys. No need for relational schemas. (Which is also based on the observation of some Amazon’s services.)
    3. ACID(Atomicity, Consistency, Isolation, Durability) are not strictly followed since it targets applications that tolerant weaker consistency, which is called eventually consistency.
  2. Design Considerations

    1. When to resolve update conflicts? Read or Write?
      1. Since it focus on high write availability, so it pushes conflict resolution to reads (which unlike many traditional DBs which execute conflict resolution during writes and has simple policy for reads)
    2. Who to resolve the conflicts? The data store or application?
      1. The application is responsible to resolve conflict updates. Since data store only has simple police like “last write wins” to resolve conflicts while application has more knowledge of each different situations and could have different strategy to resolve conflicts.
    3. Incremental scalability
      1. Add/Delete one node at a time without having a huge impact on both read/writes of the system.
    4. Symmetry
      1. No outstanding nodes. Each node should have the same responsibilities as its peers.
  3. Architecture

    Problem

    Technique

    Advantage

    Partitioning

    Consistent Hashing

    Incremental Scalability

    High Availability for writes

    Vector clocks with reconciliation during reads

    Version size is decoupled from update rates.

    Handling temporary failures

    Sloppy Quorum and hinted handoff

    Provides high availability and durability guarantee when some of the replicas are not available.

    Recovering from permanent failures

    Anti-entropy using Merkle trees

    Synchronizes divergent replicas in the background.

    Membership and failure detection

    Gossip-based membership protocol and failure detection.

    Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

    sosp-figure2-small

    1. Partitioning (Consistent Hashing)
      1. Both node and key are mapped to the same hash space (eg. 00~FF)
      2. Key K is stored in B, which means B is responsible for K
      3. Pros:
        1. Load balance (each node would get roughly similar number of keys)
        2. Scalability (add/delete one nodes, only its neighbors would be affected)
    2. Replication  
      1. Dynamo is setup, N is assigned as a parameter indicating each data item is replicated on N nodes.
      2. Each key contains a list of nodes which is responsible for its read/write operation. Which is called Preference List. Length of the preference list should be larger than N just in case nodes failures.
      3. Using the consistent hashing, each node finds its coordinator, who is responsible to replicate the data to N-1 clockwise successor nodes.
    3. Versioningsosp-figure3-small
      1. Vector Clock is used to show if there are update conflicts. Mainly used in key-value storage which doesn’t have locks for writes to pursue better performance.
      2. D5([Sx, 3],[Sy, 1],[Sz,1]) means data item 5 which was updated by Sx 3 times, Sy 1 time, Sz 1time. Using the vector, it is easily to find out if two different version are parallel.
      3. When reads the data, the vector clock is also included in the data item.
      4. Deep understanding and examples, please check here
      5. Cons: Vector Clock some times could be too long if there are many different servers involved in writes. But in real cases it should not happen since writes are generally handled by top N nodes in the preference list of that key. Even if it happens, we can have a upper bound size of the vector clock and get rid of the old vectors depending on the timestamp, which might potentially cause problems when trying to resolve conflicts.
    4. Get() & Put() operation
      1. Only first N healthy nodes in the preference list are involved. (those are down and inaccessible are skipped)
      2. W + R > N (W/R: number of nodes which should success for writes/reads)
      3. When put(), the coordinator generates the vector clock with the new version and writes the new version locally. Then replicates the new version to first N reachable node in the preference list. Consider write successful as long as there is W-1 nodes respond.
      4. Similarly, for get(), the coordinates request the data from first N reachable nodes from the preference list and as long as there are R-1 response it will then returns all version of the data.
    5. Failure handling (Hinted Handoff)
      1. Check the Dynamo ring above, if node A is down, the data item which is supposed to written to A is now written to D (suppose N=3) along with the metadata (indicating which node it is supposed to be at) which is stored separately in D
      2. Once such hint is discovered, and A is recovered, D will send the replica to A and then delete the replica from itself.
      3. Hinted Handoff ensures read/writes won’t be rejected due to single node down or network failure.
    6. Recovering from permanent failures、Membership and failure detection待进一步整理。

Reference: http://blog.ddup.us/2011/11/07/amazon-dynamo/

FacebookTwitterGoogle+Share