Why Erlang matters
I still remember vividly my first encounter with an SMP computer. This was around 1996 and it was a dual socket Pentium Pro monster. It blew my mind. Growing up along with x86 evolution I was always curious about hardware operation. My first 8088 board was simple enough to understand each signal line in it. Now this monster machine had two processors, two sets of L1 and L2 caches and RAM shared between them. I was wondering how it was possible to keep caches and RAM in sync as well as how would someone write software for this incredible beast.
Fast-forward to 2012. Two things happened: Internet and power wall.
The Internet ignited renaissance of distributed computing. We often cite а scale of operation that became unprecedented in pre-Internet information systems. But this is true for just a handful of companies. What truly undid traditional information systems is how fast one’s capacity requirements can grow when а product hits the hockey stick. This made old systems terribly uneconomical. The scale up of expensive proprietary system was soon replaced with scaling out of generic consumer-level kits.
The power wall killed the great hope that we can scale single-threaded performance forever. While the superscalar improvements were still marginally increasing instructions executed per cycle, vendors started an SMP renaissance. As with the theory of distributed computing back in 1970s, parallel computers were well understood both practically and theoretically in the 1980s.
In a distributed system we have two or more processes that talk to each other over the network. Process A sends message to process B by copying it over from memory of process A over the network wire to memory of process B. Now it seems that in a SMP computer we don’t have this problems because we can share a memory block between threads. Wrong! In a SMP system when thread B reads a memory block written by thread A, there is actual copying going on from A’s CPU cache over the interconnect to B’s CPU cache. Cache coherence tries to pretend this is not happening and memory is really shared. It gets worse if you are running a NUMA system where your message from thread A to B travels over multiple hops in the interconnect—much like network packet travels over multiple hops in switched data center. The more CPU cores we place on the interconnect the less it looks like shared memory and more like separate networked computers—cache coherence does not scale.
So here we are with these awesome systems built out of processes talking over the network and threads pretending to share memory. Threads are used when communication is more frequent and/or amount to shared data is substantial. Networked processes are used when scaling requirements are less predictable and/or fault tolerance is required. Some applications exhibit embarrassing parallelism. There, we can nicely package computation to saturate a given SMP computer and scale out—and speed up—by adding more computers.
Unfortunately, these applications are rare. Most web applications are forming trees of waiting networked processes with some leaves having to do disk IO on occasion. This makes it really hard to nicely saturate today’s entry-level SMP computer. Enter another renaissance—virtualization. With economical efficiency in mind we would love to co-locate multiple parts of an application on the same physical computer. The caveat is that they may have different or even conflicting dependencies within an operating system. Virtualization gives us full isolation by running many instances of an operating system on a single physical computer. I mentioned renaissance, because utility of virtualization was well known before today. Mainframes used virtualization extensively for decades. They simply became way too powerful to run anyone’s single application and it made sense economically to start virtualizing. Mainframes still exist, but they were largely disrupted by minicomputers in the 1970s and then minicomputers were further disrupted by PC servers in the 1990s.
It’s arguable that PC servers are becoming too powerful to run distributed systems economically. Much like in the advent of PC servers era when the economy of scale in personal computers made them superior to the previous generation of computers, today’s smartphones will do the same to future ARM-based servers. At the same time the speed of local networking will be approaching the speed of CPU interconnect, while the CPU interconnect complexity will be heading towards complexity of switched packet network. They may coalesce as data-center-on-a-chip or high density computing whichever way you look.
What abstractions do we have today to deal with this type of computer? Processes over TCP/IP network and threads over shared memory seem like two very different ways of dealing with one problem—communication between concurrent workflows. Both arguably are antiquated for this new computer. TCP/IP was designed for unreliable global networks and carries significant overhead for fast local interconnects. Shared memory does not scale to large number of CPU cores. But all we really want is to send a message from workflow A to workflow B. Enter Erlang.
Erlang was designed in the 1980s by group of engineers at Ericsson. They specifically were trying to address shortcomings of existing languages with respect to handling highly concurrent telephony applications with extreme reliability requirements. Concurrency meant handling millions of small processes that would occasionally communicate with each other. Reliability meant guarding against hardware failure and, more importantly, against bugs in the program. Erlang designers made a great tradeoff from the onset: immutability and single assignment—very uncommon in conventional programming languages—is enforced by the VM. On top of immutability the Erlang VM defines lightweight processes that can send and receive messages. This enables the level of isolation for much nicer garbage collection—processes can be garbage collected independently. Further, Erlang promotes the least amount of error handling in favor of quickly crashing processes. To address error recovery Erlang defined a supervision tree—a hierarchy of watchdog processes whose only responsibility is to restart failed worker processes. Another goodness is the ability to send and receive messages between processes in VMs running on connected computers and to hot-load code without stopping processes.
I would argue that the Erlang execution model is very well suited to run directly on future data-center-on-a-chip without an operating system and virtualization. Immutability removes the need for cache coherence (although signaling inside the VM may need one or should use specialized hardware). Erlang VM maps one process scheduler to one hardware thread already. Processes are load balanced to evenly saturate available CPU cores and maintain cache locality. Messages are precisely what they are—immutable memory blocks shuttled between caches over a switched interconnect. Supervision trees can span hardware topology and handle both hardware failures and software bugs for great fault tolerance. It would likely be a very evolved Erlang to run on this future computer. It may well be the Erlang VM that hosts other languages on top, but yet enforces the important semantics. Or it may be a new software altogether. Erlang matters today because it demonstrates how these semantics can be elegantly packaged in one language, execution model and virtual machine.
HN discussion: http://news.ycombinator.com/item?id=4779703