Why Erlang matters
I still remember vividly my first encounter with an SMP computer. This was around 1996 and it was a dual socket Pentium Pro monster. It blew my mind. Growing up along with x86 evolution I was always curious about hardware operation. My first 8088 board was simple enough to understand each signal line in it. Now this monster machine had two processors, two sets of L1 and L2 caches and RAM shared between them. I was wondering how it was possible to keep caches and RAM in sync as well as how would someone write software for this incredible beast.
Fast-forward to 2012. Two things happened: Internet and power wall.
The Internet ignited renaissance of distributed computing. We often cite а scale of operation that became unprecedented in pre-Internet information systems. But this is true for just a handful of companies. What truly undid traditional information systems is how fast one’s capacity requirements can grow when а product hits the hockey stick. This made old systems terribly uneconomical. The scale up of expensive proprietary system was soon replaced with scaling out of generic consumer-level kits.
The power wall killed the great hope that we can scale single-threaded performance forever. While the superscalar improvements were still marginally increasing instructions executed per cycle, vendors started an SMP renaissance. As with the theory of distributed computing back in 1970s, parallel computers were well understood both practically and theoretically in the 1980s.
Scaling out Postgres, Part 2: Routing
In the first part of this post I described our thought process for evolving connect.me from a single database to a partitioned database with a fixed number of partitions. We decided to partition on the dimension of user profiles because we are expecting this dimension to exhibit the most growth. We call partition a cell and we use an external key provided during authentication to choose a cell where a profile is or will be located. Such a key is a combination of a provider name, such as twitter or facebook, and a unique identifier within a provider, which we call a proid. Internally we identify a profile by a 64-bit integer identifier called a puid. We use puids to express relationships between profiles, as in vouching. Each cell handles a dedicated range of puids from which it allocates new ones sequentially starting with some puid base. It is possible to quickly find a cell for a puid by checking puid ranges. Routing is the process of resolving a provider-proid pair or a puid to a corresponding cell.
With the design as described so far we have a fixed number of cells. In production we may start with some initial number and place them on a small number of servers. We will be able to scale out until each cell has two dedicated servers—one for each cell replica. The process of moving databases between servers has a potential for adding downtime. Even if configurable SAN is used to store data files we may have performance issues with cold database caches. We needed the ability to add new cells without repartitioning. In such a setting existing profiles and their related data would remain stored and accessed in their original cell. Only new profiles would be allocated in new cells. This means that old cells would still need the capacity to keep up with the growth of existing profiles. We call the set of newly added cells a generation. When databases are initially provisioned with are at Generation 0. Each new generation has certain number of new cells. This number is usually larger, but also can be equal to the number of cells in previous generation. Let us see how we evolved the design with a fixed number of cells to support these requirements.
Scaling out Postgres, Part 1: Partitioning
At connect.me we started building our web application with one relational database implemented in Postgres. The idea was to quickly iterate on the feature set to solidify our data model. In the later stage we planned to work on a scale-out approach that would allow us to add server machines without major service disruption. The objective was to support exponential growth, which we could face with viral elements in our product. It turned out to be an interesting and challenging problem and I want to share our thought process and the resulting design in this blog post.
The essense of connect.me data model is to represent people profiles and vouches. A vouch is a directed edge in the reputation graph—there is one giving and one receiving profile. Also the vouch always has a specified tag, such as “guitarist” or “cyclist”. There are other aspects that make it more complicated, but for the purpose of this discussion this simplified model is sufficient. With a single relational database this model has a trivial implementation. The profiles table has a unique 64-bit integer id—we called it a puid—along with other profile attributes. A new puid is generated with a sequence starting at 1. The tags table also has a sequence generated 64-bit integer id along with a tag name. The vouches table has source puid, target puid and tag id.
When a database is limited to running on a single computer, only certain load can be served with acceptable latency. Adding hardware to this single computer will help only so much. Doubling or tripling the load may require significantly more than twice or thrice cost of hardware to scale up. Such approach is expensive to scale and it eventually hits its limit. Ideally, we would start with a single inexpensive computer and as load increases we would keep on adding same inexpensive computers resulting in a near-linear function between load and cost. Such horizontal scaling out is common place in today’s web applications because it provides a more predictable cost model.
Clusters of inexpensive commodity hardware lead to a disruption in the database ecosystem. What further amplifies this disruption in falling prices of RAM and solid state storage. A number NoSQL of databases are truly leveraging this disruption. At Meshin, one of our key requirements is the lightning-fast delivery of query results. Think of Google Instant search. Showing results with “as you type” latency enables a whole new class of use cases. This lead us to considering various in-memory database engines. Redis, with its simple and elegant data model and very transparent performance characteristics came out on top. For in-memory databases, single-threaded design becomes an obvious choice, essentially removing significant overhead common to traditional database architectures. Another assumption in Meshin design is that if the index is partially or completely lost it can be recovered by re-indexing original data. Sure, this introduces short-term inconsistencies, but on the other hand it allows to relax durability and further simplify the design. Such a trade-off is a perfect fit to the Redis persistence strategy.
Single-threaded design of Redis brings up another interesting issue. Scaling of a single instance of Redis is not limited just by the computer it runs on, which would amount to RAM size, but is further limited by the single core or hardware thread it will utilize. So if your load exceeds the single hardware thread or size of available RAM, you need a second instance of Redis. This requires some approach to clustering. Effective clustering is all about figuring out the right partitioning scheme. Ideally, you need to split load into identically-sized partitions. If load becomes biased towards one or more partitions you will have a bottleneck. Such a bottleneck will limit your system’s scalability or in other words will make your load-to-cost function non-linear. It is important that partitions are as independent as possible. If an operation spans a group of partitions its scalability will be limited by the number of such groups in the cluster. In extreme cases of spanning all partitions in a cluster, the scalability will be as good as running on a single partition.
Another aspect is differentiating the load by reads and writes. If the Redis hardware thread capacity is exceeded by reads, it is easy to scale out by putting additional read-only replicas of the same data. It is much harder to scale out writes by replication where trading off consistency is often required. Redis again takes simple a approach with its master-slave replication. A writable master replica asynchronously updates one or more read-only slave replicas. Right after replying to a write operation the master notifies all replicas. No acknowledgment is required by the master. This means that there is short period of time when slave replicas may return old data. This provides with write-write consistency guarantees that are as strong as without replication. However, the write-read consistency guarantees are now less strong or “eventual”. It is important to note that replication not only helps scalability of reads, but also improves reliability when replicas are placed on different machines. For the Meshin application eventual write-read consistency is acceptable tradeoff for higher reliability.
With the Meshin application, we have a large number of user indexes, each of roughly equal size. Meshin keeps a sliding window index of email, Twitter, Facebook etc. messages to maintain predictable maximum size of index. This provides a great opportunity for partitioning. If a partition handles roughly an equal number of users, we have balanced storage and load requirements and at the same time made the most frequent operations directed at only one partition. One approach is choosing between N partitions is hashing a unique user identifier, for example an email address, such that the hash space is in the [0..N) range. If the quality of the hash function is sufficient, we will get a well-balanced distribution.
Redis based triple database
Meshin application relies on back-end triple store for holding person semantic index and for processing front-end queries. This back-end store is built on top of open source in-memory key-value database Redis. Before getting into details of how triples are represented and queried I will briefly introduce essential Redis features. Fill free to skip following paragraph if you are familiar with Redis.
Redis is key-value store where keys are binary strings and values can be either simple binary strings or higher order data structures. These data structures include ordered lists, unordered unique sets, secondary level hash tables (hsets) and weight sorted sets (zsets). Redis exposes its functionality through simple text based protocol. Protocol defines number of commands and corresponding replies. Commands are either general for all kinds of key-values or specialized for value type. For example SET A X associates binary string value X with key A.
Under the hood of personal semantics application
Meshin application attempts to unify various personal communication streams and provide value on top of that. User registers streams from cloud-based carriers and installs Meshin app on their smartphone. Cloud-based carriers including Gmail, Facebook messages, Twitter DMs, LinkedIn messages and smartphone data sources like phone book, call history and SMS messages are continuously crawled and indexed by Meshin. Separate index in maintained for each user. Index includes aggregated contacts and each contact’s corresponding message stream across all the carriers and smart phone data sources. Contacts represent both people and organizations. Person contact is derived from display name of email address or similar metadata in other carriers. Organization contact currently is derived from email address Internet domain name. Once initial indexing is performed Meshin smartphone app allows user to group and filter communications in various ways. We call it slice and dice streams. Simple ones include streams from specific contacts — people and organizations. Also it is possible to define explicit groups of contacts. Each slice and dice stream of messages allows further filtering by time and keyword search. With every message Meshin associates read/unread status, so that user can see what’s new in each stream. Currently Meshin read/unread status is maintained independently of carrier read/unread status. In addition to explicit user defined grouping of contacts Meshin creates implicit group of contacts with high importance. Importance metric is calculated by proprietary algorithm and is based on interaction patterns accross all carriers. Factors like frequency of communication, recency, directness and reply timing are used.