Scala for Machine Learning - Part 1

SBT Scala project in IntelliJ Idea

  1. Download the Scala project directory “scala-apriori-parallel” from github repository
  2. Run IntelliJ Idea
  3. Open the project “scala-apriori-parallel”
  4. Run SBT Console
  5. type: compile

Exercise 1 - Frequent Itemsets Mining in Scala

First we introduce the Apriori algorithm for the frequent itemsets mining task.

  • Input: a database of transactions and a minimal support threshold.
  • Output: a list of itemsets which have a support greater than or equal to the minimal support threshold.

Glossary of terms:

  • Transaction: a set of items
  • Item: any value
  • Database: a sequence of transactions
  • Itemset: any subset of the set of all items which occurs in the database
  • Frequent itemset: an itemset that has support greater than or equal to the minimal support threshold
  • Support of an itemset: the number of transactions in the database that are superset of the particular itemset

Example of a database:

Transaction # Transaction
1 {a, b, c}
2 {a, b, d}
3 {d}
4 {b, c}
5 {c, d}
  • Set of all items: {a, b, c, d}
  • Minimal support: 2
  • Possible itemsets (bold itemsets are frequent):
            {a}                {b}              {c}              {d}
         /   |   \            /   \              |
     {a,b} {a,c} {a,d}     {b,c} {b,d}         {c,d}
     /   \      \             \
{a,b,c} {a,b,d} {a,c,d}      {b,c,d}

Apriori algorithm enumerates this itemset space by the breadth-first method. If some itemset is not frequent then all supersets of this itemset are not also frequent. For this case we needn’t count a support for: {a,b,c} {a,b,d} {a,c,d} {b,c,d} and {a,b,c,d}.

Sequential Apriori algorithm in Scala

In the “scala-apriori-parallel” project there are defined structures for database, transactions, items and itemsets. This program is able to create a database of transactions from the tabular format in CSV.

Input data: KO Bank - a table of clients which have loans in the bank

  • loan_id
  • age
  • salary
  • district
  • amount
  • payments
  • duration
  • rating: quality of repayment

In the NonParallelApriori object there is an implementation of the Apriori algorithm using just one main thread. This solution is not parallel and may take several tens of seconds.

Your task: Try to run the NonParallelApriori process and watch the progress.

  1. Go to the SBT console of the “scala-apriori-parallel” project
  2. Type: run
  3. Choose NonParallelApriori

Exercise 2 - Parallel Frequent Itemsets Mining in Scala

In Scala we can easily parallelize the sequential Apriori process by parallel collections, Future objects or actors.

Parallel Apriori - first solution

If we look at the itemsets state space we can easily divide this space into several isolated branches. Each branch can be processed in parallel within a single thread:

********* BRANCH 1 ******|*** BRANCH 2 ****|*** BRANCH 3 ***|*** BRANCH 4 **
            {a}          |      {b}        |     {c}        |      {d}
         /   |   \       |     /   \       |      |         |
     {a,b} {a,c} {a,d}   |  {b,c} {b,d}    |    {c,d}       |
     /   \      \        |     \           |                |
{a,b,c} {a,b,d} {a,c,d}  |    {b,c,d}      |                |
     \                   |                 |                |
     {a,b,c,d}           |                 |                |

If we have a collection of all items {a, b, c, d} we can divide it by number of processors (or cores) to partitions. Then we process each partition separately and concurently.

val items = Set("a", "b", "c", "d")
//with parallel collections
//with futures
items.grouped(math.ceil(item.size / numberOfProcessors).toInt).foreach { items =>
  Future {

Your task: Try to start parallel solutions of the Apriori algorithm using parallel collections or futures and watch the speed-up:

  1. Go to the SBT console of the “scala-apriori-parallel” project
  2. Type: run
  3. Choose ParallelCollectionApriori
  4. Type: run
  5. Choose FutureApriori

This solution is naive but is able to multiply speed-up the calculation. Try to think about a problem of this aproach. How can we improve it?

Parallel Apriori - second solution with Actors

The main problem of the previous solution is that some partitions have less work than others; therefore several processors do nothing while others are still working.

We can improve previous aproaches by the message-driven solution with Actors. The main idea is: if some separated process has no work, then another active process divides its working space and sends one part to the idle process. In this case we can fully use all processors during the main mining process.

If we have p processors (or cores) we create p actors which are able to accept these types of messages:

  • RegisterWorker(worker): Register an other working actor and save it to the workers collection.
  • Work(data): Obtain data for processing and save them to the queue.
  • ProcessData: Pull the current itemset and item from queue and expand the itemset with the item, count support and add the new itemset to the working space if it is frequent. If the actor has no work, send a request for a new piece of work to an other working actor.
  • AskForWork: Obtain an ask for work from an actor. If the target actor has enough work then it divides work to two parts and sends one part to the asking actor; otherwise it sends the NoWork message.
  • NoWork: Target actor has no work, we can ask another. If all actors have no work we can stop this actor.

This approach is implemented within the ActorApriori object.

Your task: Try to use the ActorApriori solution for frequent itemset mining and watch speed-up against other solutions.

  1. Go to the SBT console of the “scala-apriori-parallel” project
  2. Type: run
  3. Choose ActorApriori

Exercise 3 - Individual Work