More complex jobs cause the nodes to fail and it gets into infinite loops of res...

    More complex jobs cause the nodes to fail and it gets
    into infinite loops of restarting the nodes, replaying 
    the job, and them dieing again.

This is probably a bug in your code. Debugging these cluster applications does take some getting used to. You'll want to look at the stderr output of the failing executor, and you'll probably see that it's dying due to some kind of exception. You can do this by visiting port 8080 of the master node over HTTP, i.e. http://mymaster:8080. Feel free to email the Spark users list if you have any questions: https://spark.incubator.apache.org/mailing-lists.html

It's true that you do have to understand the programming model and some details of how it's implemented to use Spark effectively. However, any abstraction that was "pure" and perfectly non-leaky would necessarily sacrifice some performance and transparency to achieve that goal. Spark aims to be both high-level and high-performance.

Full disclosure: I'm on the Spark team at the UC Berkeley AMPLab.