Scaling out Postgres, Part 2: Routing
In the first part of this post I described our thought process for evolving connect.me from a single database to a partitioned database with a fixed number of partitions. We decided to partition on the dimension of user profiles because we are expecting this dimension to exhibit the most growth. We call partition a cell and we use an external key provided during authentication to choose a cell where a profile is or will be located. Such a key is a combination of a provider name, such as twitter or facebook, and a unique identifier within a provider, which we call a proid. Internally we identify a profile by a 64-bit integer identifier called a puid. We use puids to express relationships between profiles, as in vouching. Each cell handles a dedicated range of puids from which it allocates new ones sequentially starting with some puid base. It is possible to quickly find a cell for a puid by checking puid ranges. Routing is the process of resolving a provider-proid pair or a puid to a corresponding cell.
With the design as described so far we have a fixed number of cells. In production we may start with some initial number and place them on a small number of servers. We will be able to scale out until each cell has two dedicated servers—one for each cell replica. The process of moving databases between servers has a potential for adding downtime. Even if configurable SAN is used to store data files we may have performance issues with cold database caches. We needed the ability to add new cells without repartitioning. In such a setting existing profiles and their related data would remain stored and accessed in their original cell. Only new profiles would be allocated in new cells. This means that old cells would still need the capacity to keep up with the growth of existing profiles. We call the set of newly added cells a generation. When databases are initially provisioned with are at Generation 0. Each new generation has certain number of new cells. This number is usually larger, but also can be equal to the number of cells in previous generation. Let us see how we evolved the design with a fixed number of cells to support these requirements.
In order to route a provider-proid pair to a corresponding cell we apply a hash function that produces a 32-bit hash value. We place cells on a ring that represents the complete space of 32-bit values. For example in the case of unsigned integer interpretation of hash values the ring will start at 0 and connect back to 0 after a full circle with the last value of 2^32-1. Each cell has an equally sized segment on a ring—meaning all provider-proid pairs that produced a hash value in the range of a cell’s segment will be located in this cell. This way we make sure that with a sufficient number of profiles in the system they are spread out evenly between cells. We assumed that profiles and related data have a reasonably small standard deviation of size and workload and that we can store a reasonably large number of profiles in one cell. With these assumptions we have a well balanced partitioning.
Similarly we can visualize the puid space. The bigint type in Postgres is a 64-bit signed integer. To represent the complete space of its values as a ring we start at -2^63 and connect back with last value of 2^63-1. Each cell has a base for puid allocation. We place these bases on a ring with equal amount of space between them. Interestingly, if we assume that each cell has a capacity of 2^32 profiles we would need 2^32 cells to make their puid ranges adjacent on the ring. With much smaller practical number of cells and smaller practical capacity of a cell we have significant space between their ranges on a ring.
Lets see how we can add a new generation of cells to our provider-proid ring. The existing Generation 0 cells have their hash ranges covering the full ring. The new Generation 1 ranges can be layered on top of it. If Generation 1 is doubling the number of cells from Generation 0 then for each Generation 0 range there will be two Generation 1 ranges layered on top. All new profile allocations are handled by the highest generation—we call it current generation. This way new profiles are being spread evenly in the current generation cells. But what happens with profiles located in older, non-current generations? The provider-proid pair routing now potentially requires multiple hops to locate the an existing profile. The routing starts with the current generation. It tries to locate a profile in a cell that has its range matching the provider-proid pair hash value. If the profile was not found in Generation G we repeat the query in Generation G-1. We terminate routing when we either located the profile or we could not locate it in Generation 0, meaning the profile does not yet exist and needs to be allocated in the current generation. With each hop being a roundtrip to the database and an index lookup, how many hops do we expect to have in the worst case? If we grow the number of cells exponentially the number of hops will be at most the logarithm of the ratio of the number of cells in the current generation to the number of cells in Generation 0. For example if we start with 16 cells and grow to 16384 cells in the current generation by repeatedly doubling the number of cells we will have at most 10 hops. This will be a serious performance issue if performed for all profile lookups internally. But remember that the routing of a provider-proid pair, which is the external key, happens only during authentication or when external references to profiles need to be resolved. Both cases are biased toward a subset of profiles, such as active users or highly connected social profiles. This provides an opportunity for caching.
For internal identification of profiles we use puids. Let us see how we add new generation of cells to the puid ring. Assume we increase the number of cells in Generation 1 by multiplying the number of cells in Generation 0 by two. Now the odd cells in Generation 1 have their unique puid ranges, which utilize a small fraction of space between ranges in Generation 0. But the even cells in Generation 1 will have their ranges collide with Generation 0. To avoid this collision we again rely on the fact that space between ranges is significantly larger than the width of a single range. In each next generation we shift the ranges on a ring by adding the range width to their puid base and puid maximum. This way ranges don’t collide and we can keep adding new generations of cells. The non-collision of cell ranges in the puid space gives us a great property: a routing function with constant complexity. Also such routing only depends on the cell configuration and does not require database roundtrips.
In summation, we now have a partitioning scheme with the ability to add new cells without stopping the system. We incur the logarithmic complexity when routing the external profile identifiers—provider-proid pairs, which does not happen often and can be further optimized by local caching of resolved routes. We achieve the constant complexity when routing internal profile identifiers—puids, which we rely upon for expressing relationships between profiles. With trivial local caching for a rarely changing cell configuration we have a very fast implementation for mapping internal identifiers in bulk to their corresponding cells. We believe that this approach is fairly reusable. Within connect.me we are currently applying it to additional dimensions in our data model where we expect significant growth. We also hope that this design can generally be applied to other applications with similar separation between external and internal identification and the need for scalability. We would love to hear your thoughts and comments.