You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>The most direct and naive way to approach this is to allocate one thread per connection. When a client connection comes in, we allocate a thread and hand off the connection for handling, and that thread will live until the response is fully generated. When we make network calls to downstream services, this thread will block on the responses, giving us a nice linear application control flow.</p>
<p><imgsrc="../img/media/fibers/many-threads.png" alt="loads of threads diagram"></p>
135
134
<p>Implementation-wise, this is very easy to reason about. Your code will all take on a highly imperative structure, with <em>A</em> followed by <em>B</em> followed by <em>C</em>, etc, and it will behave entirely reasonably at small scales! Unfortunately, the problem here is that threads are not particularly cheap. The reasons for this are relatively complex, but they manifest in two places: the OS kernel scheduler, and the JVM itself. </p>
136
135
<p>The JVM is a bit easier to understand here. Whenever you create a new thread, the garbage collector must establish an awareness of that thread and its current call stack so that it can accurately determine what parts of the shared memory space are reachable and what parts are not (literally, figuring out what memory needs to be freed). For this reason, threads are considered <em>GC roots</em>, which basically means that when the garbage collector scans the heap for active/inactive memory, it will do so by starting from each thread individually and will traverse its referenced object graph starting from the call stack and working upwards. Modern garbage collectors are extremely clever, but this bit of juggling between the GC and the actively-running threads is still a source of meaningful overhead, even today. The more threads you have, and the larger their respective object graphs, the more time the GC has to take to perform this scan, and the slower your whole system becomes.</p>
137
136
<p>In practice, on most <em>practical</em> hardware, you really can't have more than a few hundred threads before the JVM starts to behave poorly. Keeping things bounded to around a few <em>dozen</em> threads is generally considered best practice.</p>
138
137
<p>This is a huge problem, and we run face-first into it in architectures like the above where every request is given its own thread. How can we possibly serve tens of thousands of requests simultaneously when each request implies the allocation and maintenance of such an expensive resource? We need some way of handling multiple requests on the <em>same</em> thread, or at the very least <em>bounding</em> the number of threads we have floating around. Enter thread pools and the famous "disruptor pattern".</p>
<p><imgsrc="../img/media/fibers/few-threads.png" alt="thread pool diagram"></p>
143
141
<p>In this kind of architecture, incoming requests are handed off to a scheduler (usually a shared lock-free work queue) which then hands them off to a fixed set of worker threads for processing. This is the kind of thing you'll see in almost every JVM application written in the past decade or so.</p>
144
142
<p>The problem is that it <em>still</em> doesn't quite resolve the issue. Notice how we're still making that network call to <strong>downstream</strong>, which presumably doesn't return instantaneously. We have to block our <em>carrier thread</em> (the worker responsible for handling our specific request) while waiting for that downstream to respond. This creates a situation generally known as "starvation", where every thread in the pool is wasted on blocking operations, waiting for the downstream to respond, while new requests are continuing to flood in at the top waiting for worker to pick them up.</p>
145
143
<p>This is extremely wasteful, because we have a scarce resource (threads) which <em>could</em> be utilized to make some progress on our ever-filling request queue, but instead is just sitting there in <code>BLOCKED</code> state waiting for <strong>downstream</strong> to respond and just generally being a nuisance. The classic solution to this problem is to evolve to an <em>asynchronous</em> model for the network connections, allowing the downstream to take as long as it needs to get back to us, and only using the thread when we actually have real work to do.</p>
<p><imgsrc="../img/media/fibers/async.png" alt="async pool diagram"></p>
150
147
<p>This is much more efficient! It's also incredibly confusing, and it gets exponentially worse the more complexity you have in your control flow. In practice most systems like this one have <em>multiple</em> downstreams that they need to talk to, often in parallel, which makes this whole thing get crazy in a hurry. It also doesn't get any easier when you add in the fact that just talking to a downstream (like a database) often involves some form of resource management which has to be correctly threaded across these asynchronous boundaries and carried between threads, not to mention problems like timeouts and fallback races and such. It's a mess.</p>
151
148
<p>However, this is essentially what modern systems look like <em>without</em> Cats Effect. It's incredibly complicated to reason about it, you pretty much always get it wrong in some way, and it's actually <em>still</em> very wasteful and naive in terms of resource efficiency! For example, imagine if any one of these pipelines takes a particularly long time to execute. While it's executing on a thread, everything else that might be waiting for that thread has to sit there in the queue, not receiving a response. This is another form of starvation and it leads to the problem space of <em>fairness</em> and related questions. When this happens in your system, you could see extremely high CPU utilization (since all the worker threads are trying very hard to answer requests as quickly as possible) and very <em>very</em> good p50 response times, but your p99 and maybe even p90 response times climb into the stratosphere since a subset of requests under load end up sitting in the work queue for a long time, just waiting for a thread.</p>
152
149
<p>All in all, this is very bad, and it starts to hint at <em>why</em> it is that Cats Effect is able to do so much better. The above architecture is already insane and effectively impossible to get right if you're doing it by hand. Cats Effect raises your program up to a higher abstraction layer and allows you to think about things in terms of a simpler model: fibers.</p>
<p>This diagram looks a lot like the first one! In here, we're just allocating a new fiber for each request that comes in, much like how we <em>tried</em> to allocate a new thread per request. Each fiber is a self-contained, sequential unit which <em>semantically</em> runs from start to finish and we don't really need to think about what's going on under the surface. Once the response has been produced to the client, the fiber goes away and we never have to think about it again.</p>
158
154
<p>Of course, you can't map fibers 1-to-1 with threads, otherwise we end up recreating the first architecture but with more extra overhead in all directions. The solution here is to implement a <strong>Scheduler</strong> which understands the nature of fibers and figures out how to optimally map them onto some tightly controlled set of underlying worker threads. In Cats Effect 2, as in almost all asynchronous runtimes on the JVM, this was done using an <code>Executor</code> which simply maintained an internal queue of fibers. Each time a worker thread finished with a previous fiber, the queue would dictate which fiber it needs to work on <em>next</em>.</p>
159
155
<p>Critically, fibers are free to <em>suspend</em> at any time. In our example, fibers suspend whenever they talk to the <strong>downstream</strong>, since they're semantically waiting for a response and cannot make any forward progress within their sequential flow. Fiber suspension is safe and efficient though, because the scheduler responds to this by simply removing the fiber from the work queue until such time as the <strong>downstream</strong> responds, at which point the scheduler seamlessly re-enqueues the fiber and a new thread is able to pick it up. This ensures that the threads are always kept busy and any suspended fibers are truly free: the only resource they consume is memory.</p>
<p>Cats Effect 3 has a <em>much</em> smarter and more efficient scheduler than any other asynchronous framework on the JVM. It was heavily inspired by the <ahref="https://tokio.rs">Tokio</a> Rust framework, which is fairly close to Cats Effect's problem space. As you might infer from the diagram, the scheduler is no longer a central clearing house for work, and instead is dispersed among the worker threads. This <em>immediately</em> results in some massive efficiency wins, but the real magic is still to come.</p>
169
164
<p>Work-stealing schedulers are not a new idea, and in fact, they are the default when you use Monix, Scala's <code>Future</code>, or even the upcoming Project Loom. The whole idea behind a work-stealing scheduler is that the overhead of the scheduler itself can be dispersed and amortized across the worker pool. This has a number of nice benefits, but the most substantial is that it removes the single point of contention for all of the worker threads.</p>
170
165
<p>In a conventional implementation of the disruptor pattern (which is what a fixed thread pool <em>is</em>), all workers must <code>poll</code> from a single work queue each time they iterate, fetching their next task. The queue must solve the relatively complex synchronization problem of ensuring that every task is handed to <em>exactly</em> one worker, and as it turns out, this synchronization is extremely expensive. To make matters worse, the cost of this synchronization increases <em>quadratically</em> with the amount of contention! As you increase the number of working threads, the number of synchronization points between threads increases with the square of the threads. This is actually conceptually obvious, since each thread will create contention faults with every other thread, and only one thread will actually "win" and get the task. This overhead is somewhat imperceptible if you have a small number of workers, but with a larger number of workers it begins to dominate very quickly. And note, the ideal number of workers is exactly equal to the number of CPUs backing your JVM, meaning that scaling up is almost impossible with this kind of design.</p>
171
166
<p>Work-stealing, for contrast, allows the individual worker threads to manage their own <em>local</em> queue, and when that queue runs out, they simply take work from each other on a one-to-one basis. Thus, the only contention that exists is between the stealer and the "stealee", entirely avoiding the quadratic growth problem. In fact, contention becomes <em>less</em> frequent as the number of workers and the load on the pool increases. You can conceptualize this with the following extremely silly plot (higher numbers are <em>bad</em> because they represent overhead):</p>
172
-
<p><!-- plot of work stealing overhead vs standard disruptor pattern -->
<p><imgsrc="../img/media/fibers/overhead.png" alt="plot of work stealing overhead vs standard disruptor pattern"></p>
174
168
<p>Work-stealing is simply very, very, very good. But we can do even better.</p>
175
169
<p>Most work-stealing implementations on the JVM delegate to <code>ForkJoinThreadPool</code>. There are a number of problems with this thread pool implementation, not the least of which is that it tries to be a little too clever around thread blocking. In essence, the work-stealing implementation in the JDK has to render itself through a highly generic interface – literally just <code>execute(Runnable): Unit</code>! – and cannot take advantage of the deep system knowledge that we have within the Cats Effect framework. Specifically, we know some very important things:</p>
0 commit comments