Software Project

There are two software projects

  • Python project (group project - teams of two)
  • R project individual project

Both projects are performed on the same data and the basic requirements for both software projects are the same, the tools are different. Additionally, you will submit

  • review of project of other team

Deadlines

Please refer to the important dates

Choosing data

There are two main routes:

Some tips for your custom assignment project:

An example of an interesting data set is Fatality Analysis Report. Another interesting sample dataset accompanied by example solution is for US presidential elections.

Important note: you can use the same dataset for both assignments! Important note 2: the selected dataset should lead to classification task (not regression).

Project outline

Once you have a dataset, the goal is to create a python project that will

  • perform some sort of preprocessing using the Pandas library
  • build model
  • perform tuning of metaparameters
  • evaluate model

Requirements on preprocessing

Any two of the following operations are mandatory:

  • remove rows based on subsetting
  • derive new rows
  • use aggregation operators
  • treat missing values

Python: use pandas R: it is possible, but not mandatory, to use the tibble package (part of tidyverse)

Requirements on model building

Use any classifier. Choose one of the following two options:

  • perform train/test split
  • use crossvalidation

Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).

Python: use any classifier from sklearn
R: no restriction

Requirements on metaparameter tuning

If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:

  • try several configurations and describe the best result in the final report
  • perform grid search or other similar automatic method

Python recommendation: sklearn.model_selection.GridSearchCV
R recommendation: caret package

Requirements on model evaluation

  • report the accuracy on test set/crossvalidation
  • if you are performing binary classification task, involve also the ROC curve

Python: use model_selection.cross_val_score, plot the roc curve using sklearn.metrics.roc_curve R: print model learned using the caret package, the roc curve can be plotted using the plotROC package.

Submission

The first (group project) submission:

The second (individual project) submission:

For both submissions, observe the following guidelines:

  • submit as .zip file, not .rar
  • archive contains the analyzed data (preferrably .csv files)
  • input files will be loaded using relative paths (instead of c:\Documents\file.csv use just file.csv, this may require setting the working directory using the setwd() command in R)
  • make sure that you correctly specify the casing of input files (for input .csv file csco.csv, use csco <- read.csv("csco.csv") not csco <- read.csv("CSCO.csv") )
  • check that the submitted file is runnable
  • If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file, e.g. #install.packages("precrec","VIM")
  • The runtime of the submitted file should be below 2 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.

Report structure

  • Description of the dataset
  • Exploratory Data Analysis
  • Data preprocessing
  • Modeling
  • Results and Evaluation

The report should contain commented code that will allow for

  • replication of the analysis, ** including all data (csv) files **
  • explain any choices made (such as parameter selection, removal of features, adding new features, etc.)

Review report structure

The review should answer the following questions:

  • Did the authors select the right modeling algorithm?
  • Was metaparameter tuning (grid search) performed where appropriate?
  • Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
  • Were proper evaluation metrics selected? Are the results correctly interpreted?
  • Are all important steps explained and justified?
  • What is the quality of writing? Is the language clear and concise?

The review report should be submitted as a pdf.

Presentation outline

  • Introduction of the dataset
  • Preprocessing steps
  • Modeling techniques used
  • Lessons learned
  • Response to review feedback

You are recommended to prepare slides for the presentation. The target length is 5 minutes + 5 minutes for discussion.

Final Exam

The final exam will focus on these topics:

  • Python: List comprehension, …
  • R: data frame indexing, …
  • Explorative data analysis (e.g. understanding histograms, distributions)
  • Model building (e.g. difference between linear and logistic regression)
  • Code understanding (for example, which functionality is provided by which package)

The final exam will have a form of quiz.

If you do not pass the final exam, it is possible to retake it.

Grading

  • 20 points for the Python project (group)
  • 5 points for review of Python project of another team
  • 25 points for the individual R project
  • 50 points for the final exam

Minimum requirements

  • At least 25 points from the projects (points for the review are not counted)
  • At least 25 points from the final exam

Standard ECTS grading scheme applies.