Scala for Machine Learning - Part 2

Apache Spark

Besides Akka, Apache Spark is one of solutions for parallel computing on several cores within one machine or on several nodes within a cluster of computers.

Apache Spark solves limitations of MapReduce in Apache Hadoop. It offers better performance and in-memory approach.

Note: Apache Hadoop is a platform for distributed computation which includes a distributed file system (HDFS), libraries for the MapReduce processing and a tool for managing resources and jobs in clusters (YARN).

MapReduce

MapReduce is a simple method for data distribution to partitions, which performs parallel calculations on nodes and subsequently aggregates results to the final output. This approach has two basic operations:

  • Map: It consumes an input and emits key-value pairs. Data are distributed to nodes by keys.
  • Reduce: It takes a key and list of all values connected with this key defined in the Map step. We aggregate all values for the particular key and emit a new key-value pair from all values.

Finally, the MapReduce process collects all results from the Reduce method by a key and produces a final result.

|-|           /(k1,v1)\                                     |-|
|I|          /         >(k1,(v1,v2))           (k1,v5)      |O|
|N|         /--(k1,v2)/             \         /       \     |U|
|P|--->|MAP|                         >|REDUCE|         |--->|T|
|U|         \--(k2,v3)\             /         \       /     |P|
|T|          \         >(k2,(v3,v4))           (k2,v6)      |U|
|-|           \(k2,v4)/                                     |T|

Spark solution

Apache Spark improves this MapReduce approach. It reduces the number of read/write cycles to disk and prefers in-memory solutions, therefore it can be 10x - 100x faster than Hadoop MapReduce model.

Apache Spark is built on the Akka framework in Scala and its architecture has many connections with the Scala structures:

  • Spark distributed dataset (RDD) is very similar to parallel collections in Scala. Once we create some Spark dataset from an input we can process it in parallel as well as transformed standard Scala collections to parallel collections.
  • Apache Spark is written in Scala and uses common techniques based on the functional programming and the Scala practice, such as immutability, lambda expressions with anonymous functions, lazy collections, methods without side effects.
  • The operations on a Spark dataset (RDD) are very similar to Scala high-order methods.

Instead of Map/Reduce operations Apache Spark uses transformations and actions methods on an RDD object.

|-|                                                            |-|
|I|                 |-|                       |-|              |O|
|N|                 |R|                       |R|              |U|
|P|-->|LOAD DATA|-->|D|-->|TRANSFORMATIONS|-->|D|-->|ACTION|-->|T|
|U|                 |D|                       |D|              |P|
|T|                 |-|                       |-|              |U|
|-|                                                            |T|

RDD - Resilient Distributed Dataset

The core element of Spark is a resilient distributed dataset (RDD), that is a special collection of anything which allows to partition data across nodes of a cluster and launches calculations in parallel (it is similar to parallel collections in Scala).

Once we have loaded data in an RDD we can call parallel transformation or action operations defined in this class. These operations may be performed on a local computer on several cores, or on a cluster (using Hadoop YARN, Mesos or the Spark standalone cluster).

Data loading

We can load data into an RDD from any local file, Scala collections, Hadoop HDFS file or others database systems (Cassandra, HBase, Amazon S3, etc.).

Creation of an RDD object from a data source is the lazy operation. It means the RDD object is prepared but no data have been loaded into the memory. Data are practically loading into partitions during the main computing process.

val sc = new SparkContext(configuration)
/*
It creates an RDD from a local text file.
This RDD represents a lazy sequence of all lines in the text file.
*/
val rdd = sc.textFile("data.txt")

Transformations/Actions

On an RDD object we have defined two types of operations: transformations, which create a new RDD object from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

  • Transformation: Conversions, filters, manipulations on partitions
  • Action: Aggregations, collecting, reduces from all partitions

All transformations in Spark are lazy, in that they do not compute their results right away. Any defined tranformation is applied once we call an action method.

Transformation Action
map(func) reduce(func)
flatMap(func) count()
filter(func) first()
distinct() take(n)
reduceByKey(func) foreach(func)
//load data - lazy operation
val lines = sc.textFile("data.txt")
//tranform a line string to a line length number - lazy operation
val lineLengths = lines.map(s => s.length())
//perform loading and transformations and reduce all lengths by the summation action
//the result is a total length of the input file
val numberOfChars = lineLengths.reduce((a, b) => a + b)

Other Spark features

In Spark we can also use an SQL syntax for querying on an RDD. These queries are implicitly converted into RDD transformation and action operations (it is similar to Hive for Hadoop MapReduce).

We may also use Apache Spark in Python, R and Java.

Any RDD can be persisted into the disk or memory for repeated usage (persist() or cache() methods).

Althought we should prefer the immutable access and the functional paradigm, in some cases and for better performance, we sometimes need to operate with shared variables among partitions. Spark offers two types of shared variables - broadcast and accumulator.

MLlib

Apacha Spark provides an official library with implemented machine learning algorithms adapted for Spark called MLlib. This library offers following tools:

  • ML Algorithms: common learning algorithms such as classification, regression, clustering, frequent pattern mining and collaborative filtering
  • Featurization: feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  • Persistence: saving and load algorithms, models, and Pipelines
  • Utilities: linear algebra, statistics, data handling, etc.

With this tools we can easily use machine learning algorithms for big data in Scala, Python, R and Java.

Example for association rules mining in Spark:

new FPGrowth().setItemsCol("items")
              .setMinSupport(0.5)
              .setMinConfidence(0.6)
              .fit(dataset)
              .associationRules
              .show()

Resources

  • NICOLAS, Patrick R. Scala for machine learning. Packt Publishing Ltd, 2015.
  • MENG, Xiangrui, et al. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 2016, 17.1: 1235-1241.
  • https://spark.apache.org