The ERTS memory allocators manage memory blocks in two types of raw memory chunks. We call these chunks of raw memory carriers. Singleblock carriers which only contain one large block, and multiblock carriers which contain multiple blocks. A carrier is typically created using mmap() on unix systems. However, how a carrier is created is of minor importance. An allocator instance typically manages a mixture of single- and multiblock carriers.
Before OTP R16 when an Erlang code module was loaded, all other execution in the VM were halted while the load operation was carried out in single threaded mode. This might not be a big problem for initial loading of modules during VM boot, but it can be a severe problem for availability when upgrading modules or adding new code on a VM with running payload. This problem grows with the number of cores as both the time it takes to wait for all schedulers to stop increases as well as the potential amount of halted ongoing work.
In OTP R16, modules are loaded without blocking the VM. Erlang processes may continue executing undisturbed in parallel during the entire load operation. The code loading is carried out by a normal Erlang process that is scheduled like all the others. The load operation is completed by making the loaded code visible to all processes in a consistent way with one single atomic instruction. Non-blocking code loading will improve real-time characteristics when modules are loaded/upgraded on a running SMP system
An easy way to handle memory allocation in a multi-threaded environment is to protect the memory allocator with a global lock which threads performing memory allocations or deallocations have to have locked during the whole operation. This solution of course scales very poorly, due to heavy lock contention. An improved solution of this scheme is to use multiple thread specific instances of such an allocator. That is, each thread allocates in its own allocator instance which is protected by a lock. In the general case references to memory need to be passed between threads. In the case where a thread that needs to deallocate memory that originates from another threads allocator instance a lock conflict is possible. In a system as the Erlang VM where memory allocation/deallocation is frequent and references to memory also are passed around between threads this solution will also scale poorly due to lock contention.
The process table is a mapping from process identifiers to process structure pointers. The process structure contains miscellaneous information about a process, as for example pointers to its heap, message queue, etc. When the runtime system needs to operate on a process, it looks up the process structure in the process table using the process identifier. An example of this is when passing a message to a process.
The process table has for a very long time just been an array of pointers to process structures. Since process identifiers internally in the runtime system are 28-bit integers it is quite easy to map a process identifier to index into the array. The 28-bits were divided into two sets. The least significant set of bits was used as index into the array. The most significant set of bits was only used to be able to distinguish between a number of identifiers with which map to the same index in the array. As long as process table sizes of a power of two was used we had 2^28 unique process identifiers.
When the first SMP support was implemented, the table still was kept more or less the same way, but protected by two types of locks. One lock that protected the whole table against modifications and an array of locks protecting different parts of the table. The exact locking strategy previously used isn’t interesting. What is interesting is that it suffered from heavy lock contention especially when lots of modifications was being made, but also when only performing lookups.
In order to be able to detect when it is safe to deallocate a previously used process structure, reference counting of the structure was used. Also this was problematic, since simultaneous lookups needed to modify the reference counter which also caused contention on the cache line where the reference counter was located. This since all modifications needs to be communicated between all involved processors.
The port table is very similar to the process table. The major difference, at least in concept, is that it is a mapping from port identifiers to port structures. It had a similar implementation, but with some differences. Instead of being an array of pointers it was an array of structures, and instead of being protected by two types of locks it was only protected by one global lock. This table also suffered from lock contention in various situations.
Early versions of the SMP support for the runtime system completely relied on locking in order to protect data accesses from multiple threads. In some cases this isn’t that problematic, but in some cases it really is. It complicates the code, ensuring all locks needed are actually held, and ensuring that all locks are acquired in such an order that no deadlock occur. Acquiring locks in the right order often also involve releasing locks held, forcing threads to reread data already read. A good recipe for creation of bugs. Trying to use more fine-grained locking in order to increase possible parallelism in the system makes the complexity situation even worse. Having to acquire a bunch of locks when doing operations also often cause heavy lock contention which cause poor scalability.
Management of processes internally in the runtime system suffered from these problems. When changing state on a process, for example from waiting to runnable, a lock on the process needed to be locked. When inserting a process into a run queue also a lock protecting the run queue had to be locked. When migrating a process from one run queue to another run queue, locks on both run queues and on the process had to be locked.
This last example is a quite common case in during normal operation. For example, when a scheduler thread runs out of work it tries to steal work from another scheduler threads run queue. When searching for a victim to steal from there was a lot of juggling of run queue locks involved, and during the actual theft finalized by having to lock both run queues and the process. When one scheduler runs out of work, often others also do, causing lots of lock contention.
Knowing When Threads Have Completed Accesses to a Data Structure
When multiple threads access the same data structure you often need to know when all threads have completed their accesses. For example, in order to know when it is safe to deallocate the data structure. One simple way to accomplish this is to reference count all accesses to the data structure. The problem with this approach is that the cache line where the reference counter is located needs to be communicated between all involved processors. Such communication can become extremely expensive and will scale poorly if the reference counter is frequently accessed. That is, we want to use some other approach of keeping track of threads than reference counting.
Knowing That Modifications of Memory is Consistently Observed
Different hardware architectures have different memory models. Some architectures allows very aggressive reordering of memory accesses while other architectures only reorder a few specific cases. Common to all modern hardware is, however, that some type of reordering will occur. When using locks to protect all memory accesses made from multiple threads such reorderings will not be visible. The locking primitives will ensure that the memory accesses will be ordered. When using lock free algorithms one do however have to take this reordering made by the hardware into account.
Hardware memory barriers or memory fences are instructions that can be used to enforce order between memory accesses. Different hardware architectures provide different memory barriers. Lock free algorithms need to use memory barriers in order to ensure that memory accesses are not reordered in such ways that the algorithm breaks down. Memory barriers are also expensive instructions, so you typically want to minimize the use of these instructions.
Before OTP R16 when trace settings were changed by erlang:trace_pattern, all other execution in the VM were halted while the trace operation was carried out in single threaded mode. Similar to code loading, this can impose a severe problem for availability that grows with the number of cores.
In OTP R16, trace breakpoints are set in the code without blocking the VM. Erlang processes may continue executing undisturbed in parallel during the entire operation. The same base technique is used as for code loading. A staging area of breakpoints is prepared and then made active with a single atomic operation.
Erlang ports conceptually are very similar to Erlang processes. Erlang processes execute Erlang code in the virtual machine, while an Erlang port execute native code typically used for communication with the outside world. For example, when an Erlang process wants to communicate using TCP over the network, it communicates via an Erlang port implementing the TCP socket interface in native code. Both Erlang Processes and Ports communicate using asynchronous signaling. The native code executed by an Erlang port is a collection of callback functions, called a driver. Each callback more or less implements the code of a signal to, or from the port.
Even though processes and ports conceptually always have been very similar, the implementations have been very different. Originally, more or less all port signals were handled synchronously at the time they occurred. Very early in the development of the SMP support for the runtime system we recognized that this was a huge problem for signals between ports and the outside world. That is, I/O events to and from the outside world, or I/O signals. This was one of the first things that had to be rewritten in order to be able to do I/O in parallel at all. The solution was to implement scheduling of these signals. I/O signals corresponding to different ports could then be executed in parallel on different scheduler threads. Signals from processes to ports was not as big of a problem as the I/O signals, and the implementation of those was left as they were.
Each port is protected by its own lock to protect against simultaneous execution in multiple threads. Previously when a process, executing on a scheduler thread, sent a port a signal, it locked the port lock and synchronously executed the code corresponding to the signal. If the lock was busy, the scheduler thread blocked waiting until it could lock the lock. If multiple processes executing simultaneously on different scheduler threads, sent signals to the same port, schedulers suffered from heavy lock contention. Such contention could also occur between I/O signals for the port executing on one scheduler thread, and a signal from a process to the port executing on another scheduler thread. Beside the contention issues, we also loose potential work to execute in parallel on different scheduler threads. This since the process sending the asynchronous signal is blocked while the code implementing the signal is executed synchronously.
Post Footer automatically generated by wp-posturl plugin for wordpress.