Categories
Code camp Coding Distributed Systems First programming job Getting Started in Coding JAVA LinkedIn reddit Scala Software Engineering Spark Uncategorized Web

Why Spark can’t foldLeft: Monoids and Associativity.

cover (3)
Apache Spark is the the elephant in a room full of data processing engines, yet Spark does not supply a foldLeft() or foldRight() method on its RDD class. Strange right? Such a fundamental collection method. How could it be forgotten? Or, was this not an accident?

null

scoreAverageByPlayer(), which would take an RDD, and return an RDD of tuples of each player with their average score. Note: foldLeft() is not an available method on the scores RDD class.

Remembering associativity

Lets dig deeper, because the realization of the answer is more useful than the answer itself.

Back to Algebra class we go! Associativity is one of the many algebraic properties defining functional mathematics and therefore functional programming. Without its understanding we cannot truly appreciate parallelism in computing and its limitations.

Mathematical Associativity:
"When the order in which the operations are performed does not matter as long as the sequence of the operands is not changed. That is, rearranging the parentheses in such an expression will not change its value."
The following expressions are associative:

null

Even though the parentheses were rearranged in the equation for res2, the values of res1 and res2 remained equivalent. It can then be said that the act of addition of real numbers is an associative operation.

How Spark achieves Parallelism

In order for Spark to become a leader in computational speed, it needed to incorporate operational parallelism. Parallelism will ultimately be the reason foldLeft is not found on the RDD class.

Clustering

At a high level, Spark clusters computational “worker” nodes or machines, partitions the data to be computed on in the master, distributes the data partitions from the master to the worker nodes where the computations are done on each node’s respective shard of data, then aggregates the resulting dataset(s) on the master node.

Parallelization

You can force Spark to parallelize computation on an RDD by using parallelize() on a SparkContext.

val scores = Array(68, 71, 73)
val parScores = sc.parallelize(scores)

Below is a function f being applied to an input dataset concurrently on a spark cluster. This can be thought of as a map transformation.

Parallelization Visualized

Spark-Clustering

Parallelizing reduce() in Spark

Let’s look at how spark parallelizes the reduce operation on an RDD.

reduce() from the Spark Documentation

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

“The function should be commutative and associative so that it can be computed correctly in parallel.”

Signature of reduce()

def reduce[A](op: (A, A) => A): A

This reads:

Execute the function “op” on each element (type A), with the result of the previous op computation (accumulator of type A) and respective element (type A) as inputs, eventually returning the resulting accumulator value from the last iteration (type A).

Spark’s reduce() in action

Now let’s say we have a set of “score” integers and want to determine the lowest score. We can execute a reduce action on the RDD with a monoid findMin() (more on monoids later) as an operational parameter to solve this.

In code we would solve this like:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
def findMin(first: Int, second: Int): Int = first.min(second)

val min = distData.reduce(findMin)

This would evaluate in Spark as:

Spark-Reduce

Monoids

We know what parallelism looks like in Spark, but why can’t we use foldLeft()? This will come together, but we need to understand Monoids first.

The laws surrounding Monoids are tightly coupled to associativity and state a Monoid operation:

  • Is of some type A
  • Consists of an operation, op, taking two values of type A, combining them into a single: op(op(x,y), z) == op(x, op(y,z)), where the type of x, y and z is A
  • Has an identity for the operation that maintains: op(x, zero) == x and op(zero, x) == x for any x of type A

Balanced Folds

reduce() can be categorized as a balanced fold, or a fold that allows for parallelism. Compare the following for the Sequence (a, b, c, d).

foldLeft() with the operation op would look like:

op(op(op(a, b), c), d)

While a balancedFold looks like:

op(op(a, b), op(c, d))

Can you see how this operation could be parallelized with a fork-join data structure?

Thinking of reduce() as a Balanced Fold

If we look back at the reduce of the findMin operation, its operational execution looks like:

Parallelizing reduce - Visualized

This is a balanced fold! Which can be written as either:

List(
  List(66,72).reduce(findMin),
  List(81,68).reduce(findMin)
).reduce(<strong>findMin</strong>)

or:

(66,72,81,68).reduce(findMin)

foldLeft()/foldRight()

foldLeft/right are methods made available on many monadic collections. However, let’s focus on the List collection which provides the following signature and implementation of foldLeft():

def foldLeft[B](z: B)(op: (B, A) => B): B
  if (this.isEmpty) z
  else op(head, tail.foldRight(z)(op))

This reads:

foldLeft is of type B, takes an initial element z, performs the operation op on each element in the Traversable object, returning a Type A in each accumulator iteration, and eventually returns a type B

foldLeft()’s operation is NOT a Monoid

We know that foldLeft’s predicate operation has a non-Monoidal signature, as it breaks all three monoid laws, but why again is this not a transformation supported by Spark? Simply put, it’s because foldLeft is not sufficiently parallelizable!

Looking back at the third Monoid law, it states the following must be true:

op(op(x,y), z) == op(x, op(y,z))

This law is what drives the ability to parallelize. Spark can fork a monoidal operation across a dataset into n number of operations and join the resulting values within the master. This fork-join parallelization results in a best-case decrease in execution time by a factor of n.

foldLeft() cannot be Parallelized

If we pretend foldLeft was a transformation available on Spark collections, and visually walked through its execution, maybe we could more easily understand it’s limitations.

Given the following collection and transformation lets see it in action…

val nums = List(2.2, 3.3, 4.4)
nums.foldLeft(1)((agg, next) => (agg * next).toInt)

Parallelizing foldLeft - Visualized

We can see why foldLeft() was never implemented within Spark, as the fork-join execution model would still result in serial blocking of computation!