Using functional paradigm with immutable types does take care of a lot of the co...

sirclueless · on June 3, 2013

The kicker for me that makes me think that working too hard to make easily parallelizable constructs isn't worth it is that communication is expensive and single cores are actually pretty darn powerful on their own. The implication is that you don't want to be parallelizing at too low a level. You want to be doing it at a much higher level if you possibly can.

If you have an 8-core machine and thirty independent tasks at the top-level, then all the functional programming research and parallel algorithm wizardry in the world isn't going to make you want to parallelize anything except the top-level tasks. So even if that's a verbose and error-prone task due to procedural programming constructs, at least you only need to do it once. Heck, in my own (admittedly limited) experience 90% of the hard work I have needed done has been handled by my OS's process scheduler and Redis.

The fact is that business applications, the web, and consumer software are all perfectly happy to accept the current limitations of hardware -- 12 processes running on 4 cores is efficient and easy, even with no parallelization at all below the OS process level (with the possible exception of GUI threads in desktop applications).

The only really interesting work happens in areas like HPC where you have 1 task and 900 cores and if you can't parallelize you are dead in the water, and latency-critical applications where the sacrifices made to run intra-task parallelization pay big dividends.

tracker1 · on June 3, 2013

I think with the increasing availability of more, weaker CPUs with a decent thermal envelope, and efficient processors like ARM, then distributing work makes even more sense... in a SOA system, where a lot of processing may well be IO bound, why not break out the load even more... I think the future will be thousands of compute nodes/workers along with hundreds of service nodes handling millions of simultaneous requests.

Functional constructs make such scaling nearly effortless once the practical issues of breaking up work are handled. Yes, communications has a cost, but there are faster channels available than what are used... and distributing data persistence into redundant, sharded clusters can yield a lot of other benefits.

Given, most line of business applications are fine on current hardware... the problem is scaling to 10x or 100x the workload. You can do this by creating a system that can scale horizontally, or to be more performance oriented with current hardware. One solution gains you a single generation of increased output... another gets you N scale expansion.

I don't think it's just macro parallel tasks... but many-micro tasks that can work within such a system.

_pmf_ · on June 3, 2013

You are missing the per-CPU parallelization via vectorization that lies at the core of every modern video decoding library. The conundrum with C is that it is one of the very few languages that allow (via assembler callouts) to directly use CPU vecorization instructions, but at the same time does so in a manner that absolutely prevents automatic optimizations (i.e. by observing data dependencies).