Scala for Machine Learning - Part 1

About Scala

Scala is language, which integrates object-oriented and functional programming. It is designed to express common programming patterns in a concise, elegant, and type-safe way.

Scala is compatible with Java (Java libraries and frameworks can be used without glue code or additional declarations).

Scala is great for concurrency and distributed usages. It offers lots of powerful libraries for parallel computing, such as Akka.io or Apache Spark.

Scala is good choice for big data processing and data mining and machine learning on clusters. Spark applications written in Scala may also be run in Hadoop.

Why is Scala good for concurrency?

  • Pure functions without side effects
  • Immutability

Pure functions

Scala is designed for functional programming and allows us to write a program mostly from pure functions; in other words, functions that have no side effects.

A function has a side effect if it does something other than simply return a result, for example:

  • Modifying a variable
  • Modifying a data structure in place
  • Setting a field on an object
  • Throwing an exception or halting with an error
  • Printing to the console or reading user input
  • Reading from or writing to a file
  • Drawing on the screen

In general, when writing a function (or method), your goal should be to write it as a pure function!

//pure function
def addOne(number: Int) = number + 1

//impure function
var i = 0
def addOne() = i = i + 1

Best practice in Scala: separate the code to two parts. The first one is the functional part which consists only of pure functions (the core of application logic and main calculations). The second part is on top of the functional base and handles the user interface, printing, database interactions, and other methods that have “side effects”.

Immutability

Pure functions never mutate the state of an input, therefore we can not change any field or variable within any object in our functional application. This fact tends to make immutable objects.

Immutable object is any data structure, which can never be changed as well as its inner state. So how can we change an object or variable if we want to update some field? Simply, create a new one with the changed field. In FP we should prefer immutable objects!

In Scala we can use both pure and impure function, mutable and immutable objects. But general advice is:

Prefer vals, immutable objects, and methods without side effects. Reach for them first. Use vars, mutable objects, and methods with side effects when you have a specific need and justification for them.

Example in Scala:

var a = 0 //define a variable with zero which is mutable
a = a + 1 //this changes a variable by adding one

val a = 0 //define a value with zero which is immutable (same as final in Java)
a = a + 1 //error!!!
//function setName has side effect
//class User is not immutable, because its inner state can be changed by setName function
class User {
  private var name = ""
  def setName(newName: String) = name = newName
}

//function withName returns a new object of the User class with a new name (it is a pure function)
//User class is immutable - it can not be changed
class User(val name: String) {
  def withName(newName: String) = new User(newName)
}

Benefits of Scala

If our Scala code uses pure functions and immutable objects it minimizes the need for locks in multi-processor programming - any thread can not change any immutable object (no dealing with locks and synchronization). This shift in approach leads to keep the thread-safety and make transformative code that has fewer defects and is easier to maintain.

These two things, pure functions and immutability, are one of the main reasons why Scala is so popular for multithreading and distributed computing.

Other reasons to use Scala for common programming:

  • Scala is a type-safe JVM language (compatible with Java)
  • Modular mixin-composition for classes - multiple inheritance via traits
  • High level type system with variance annotations, compound types, lower and upper bounds for types
  • Usage of inner and anonymous functions and classes (lambda expression)
  • Implicit conversions
  • Pattern matching
  • Fast implementation

Parallel computing in Scala

Scala has implemented several well known paradigms for paralel computing. The first one is a parallel collection:

Parallel collections

If we have created some scala collection we can easily transform it to the parallel variant and do computing operations in parallel across all core units. The parallel collection has implemented several common methods for data processing, such as: map, flatMap, reduce, fold, filter, etc.

//first we transform a list to the parallel collection
//then we multiply all values of the collection with two - in parallel
//finally we sum all values - in parallel
//result is 30
List(1, 2, 3, 4, 5).par.map(x => x * 2).sum()

Future

The Future data structure allows us to simply write an asynchronous block of code in a separated thread.

This structure has several methods for the asynchronous transformation and calling of reactive callback functions, such as onComplete, onSuccess and onFailure. Callback functions have side effects, but sometimes are suitable for asynchronous message forwarding and high throughput, especially in Actor systems.

//this operation is non-blocking and creates a new thread where the async block of code is performed
val futureResult = Future {
 (0 to 1000000).map(x => x * 2).sum()
}
...
//do something here while the previous block is still in progress!
...
//we can map a result of Future object (the map method is also running asynchronously)
//after completion of the asynchronous block, the callback function onSuccess is fired!
//all of these operations are also non-blocking
futureResult.map(x => x + 10).onSuccess(x => println(x))
//this is a blocking operation which is waiting for a result of the asynchronous block
//if the result is not computed within 5 seconds, TimeOut exception will be thrown
Await.result(futureResult, 5 seconds)

Actor

Sometimes we need to keep a state among threads and be able to change behaviour of some objects when we obtain some special instruction from outside (http requests, interface events etc). But in Scala we prefer immutable and pure functions which do not allow us to keep states and react to user inputs. For this purpose there was designed a thread-safe structure, called Actor, which is able to keep/change states and behaviour and react to user requests.

Actor is an object which only can obtain messages and optionally respond to them. Based on input messages we can change states inside an actor, forwarding messages and answer to other actors or start/terminate them. We have certainty that messages are not processed concurrently - we can serve only one message at a time, other messages are queued. Therefore we may safety change a variable inside an actor when we process a message, because messages go in series and everything is thread-safe.

In a nutshell: The mutable state is encapsulated in Actor, that guarantees the exclusive access to a variable only by one thread which reacts to queued incoming messages. Changes of states are safe without any threads collisions and without dealing with locks and synchronized blocks.

class CounterActor extends Actor {
  //this variable causes mutability of this class, but it is an Actor that guarantees the exclusive access
  var i = 0
  //this actor can receive several text messages
  def receive = {
    case "addOne"  i = i + 1 //this message increases the variable by one
    case "printResult"  println(i) //this message only prints the current state of the variable
    case "getResult"  sender() ! i //this message responds to an ask request with the current variable
    case "stop"  context stop self //this message stops this actor/thread
    case _  println("received unknown message") //for any other incoming message we print this message
  }
}

//create actor system
val system = ActorSystem("mySystem")
//create and start the actor
val myActor = system.actorOf(Props[CounterActor])

//all these operations are non-blocking
//send "addOne" messages to the actor
myActor ! "addOne"
myActor ! "addOne"
myActor ! "addOne"
//send the "printResult" message to the actor
myActor ! "printResult"
//ask message "getResult" returns a Future object
//the asynchronous callback "onSuccess" function prints result as soon as it is available
val futureResult = myActor ? "getResult"
futureResult.onSuccess(x => println(x))
//we send an uknown message
myActor ! "unknown"
//we send a stop message
myActor ! "stop"
//during processing of this message the actor is stopped and can not be handled
//the message is forwarded to the DeadLetter mailbox
myActor ! "printResult"

Console output:

> 3
> 3
> received unknown message
> DeadLetter log

Actor structure is implemented in Akka.io which is a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala. It also offers streaming data processing, computing in a cluster and Actor HTTP facade for creating web services.

Resources

  • NICOLAS, Patrick R. Scala for machine learning. Packt Publishing Ltd, 2015.
  • CHIUSANO, Paul; BJARNASON, Rúnar. Functional programming in Scala. Manning, 2015.
  • https://docs.scala-lang.org/overviews/
  • https://akka.io/