Software Project

There are two software projects

  • Python project (group project - teams of two)
  • R project individual project

Both projects are performed on the same data and the basic requirements for both software projects are the same, the tools are different. Additionally, you will submit

  • review of project of other team

Deadlines

Please refer to the important dates

Choosing data

There are two main routes:

Some tips for your custom assignment project:

An example of an interesting data set is Fatality Analysis Report. Another interesting sample dataset accompanied by example solution is for US presidential elections.

Important note: you can use the same dataset for both assignments! Important note 2: the selected dataset should lead to classification task (not regression).

Project outline

Once you have a dataset, the goal is to create a python project that will

  • perform some sort of preprocessing using the Pandas library
  • build model
  • perform tuning of metaparameters
  • evaluate model

Requirements on preprocessing

Any two of the following operations are mandatory:

  • remove rows based on subsetting
  • derive new columns
  • use aggregation operators
  • treat missing values

Python: use pandas R: it is possible, but not mandatory, to use the tibble package (part of tidyverse)

Requirements on model building

Use any classifier. Choose one of the following two options:

  • perform train/test split
  • use crossvalidation

Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).

Python: use any classifier from sklearn
R: no restriction

Requirements on metaparameter tuning

If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:

  • try several configurations and describe the best result in the final report
  • perform grid search or other similar automatic method
  • once you have tuned metaparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here for Python: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html and https://stackoverflow.com/questions/26962050/confused-with-repect-to-working-of-gridsearchcv

Python recommendation: sklearn.model_selection.GridSearchCV
R recommendation: caret package

Requirements on model evaluation

  • report the accuracy on test set/crossvalidation
  • if you are performing binary classification task, involve also the ROC curve
  • make sure to use dedicated dataset for evaluation

Python: use model_selection.cross_val_score, plot the roc curve using sklearn.metrics.roc_curve R: print model learned using the caret package, the roc curve can be plotted using the plotROC package.

Submission

The first (group project) submission:

The second (individual project) submission:

For both submissions, observe the following guidelines:

  • submit as .zip file, not .rar
  • archive contains the analyzed data (preferrably .csv files)
  • input files will be loaded using relative paths (instead of c:\Documents\file.csv use just file.csv, this may require setting the working directory using the setwd() command in R)
  • make sure that you correctly specify the casing of input files (for input .csv file csco.csv, use csco <- read.csv("csco.csv") not csco <- read.csv("CSCO.csv") )
  • check that the submitted file is runnable
  • If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file, e.g. #install.packages("precrec","VIM")
  • The runtime of the submitted file should be below 2 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.

Report structure

  • Description of the dataset
  • Exploratory Data Analysis
  • Data preprocessing
  • Modeling
  • Results and Evaluation

The report should contain commented code that will allow for

  • replication of the analysis, ** including all data (csv) files **
  • explain any choices made (such as parameter selection, removal of features, adding new features, etc.)

Review report structure

The review should answer the following questions:

  • Did the authors select the right modeling algorithm?
  • Was metaparameter tuning (grid search) performed where appropriate?
  • Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
  • Were proper evaluation metrics selected? Are the results correctly interpreted?
  • Are all important steps explained and justified?
  • What is the quality of writing? Is the language clear and concise?

The review report should be submitted as a pdf.

Presentation outline

  • Introduction of the dataset
  • Preprocessing steps
  • Modeling techniques used
  • Lessons learned
  • Response to review feedback

You are recommended to prepare slides for the presentation. The target length is 5 minutes + 5 minutes for discussion.

Midterm

The midterm exam is scheduled after the end of the block on Python.

Final Exam

The final exam will focus on these topics:

  • Python: List comprehension, …
  • R: data frame indexing, …
  • Explorative data analysis (e.g. understanding histograms, distributions)
  • Model building (e.g. difference between linear and logistic regression)
  • Code understanding (for example, which functionality is provided by which package)

The final exam will have a form of quiz.

If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).

Grading

  • 10 points for activity at seminars and lectures
  • 20 points for the Python project (group)
  • 5 points for review of Python project of another team
  • 20 points for the midterm exam
  • 15 points for the individual R project
  • 30 points for the final exam

Students can also earn bonus points for completing additional home assignments. These bonus points can improve final grade, but you need to have at least 60 points without the bonus points to pass!

What happens if the course goes on-line

  • We will use MS Teams
  • Some requirements for completing the course may change

Standard ECTS grading scheme applies.