Software Project

There are two software projects

Python project (group project - teams of two)

Additionally, you will submit

review of project of other team

Deadlines

Choosing data

IMPORTANT: The dataset you choose should lead to a classification task (target variable is nominal). In other words, the target variable cannot be numerical (unless the numbers correspond to a small number of categories).

Generic datasets

You can also propose any other dataset to work on.

Project outline

Once you have a dataset, the goal is to create a python project that will

perform some sort of preprocessing using the Pandas library
build model
perform tuning of hyperparameters
evaluate model

Requirements on preprocessing

Any two of the following operations are mandatory:

remove rows based on subsetting
derive new columns
use aggregation operators
treat missing values

Python: use pandas

Requirements on model building

Use any classifier. Choose one of the following two options:

perform train/test split
use crossvalidation

Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).

Python: use any classifier from sklearn

Requirements on hyperparameter tuning

If the chosen classifier has any hyperparameters that can be tuned, use one of the following methods:

try several configurations and describe the best result in the final report
perform grid search or other similar automatic method
once you have tuned hyperparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here and here

Python recommendation: sklearn.model_selection.GridSearchCV

Requirements on model evaluation

report the accuracy on test set/crossvalidation
if you are performing binary classification task, show the ROC curve
make sure to use dedicated dataset for evaluation

Python: use model_selection.cross_val_score, plot the roc curve using sklearn.metrics.roc_curve

Avoid common mistakes

Feature selection on train and test set
Pre-processing on train and test sets together
Temporal leakage
Non independence between train and test sets
Illegitimate features

See also
https://reproducible.cs.princeton.edu/

Submission

The submission consists of

IPython (Jupyter) notebook (.ipynb)
IPython (Jupyter) notebook as html (.html)
Data (.csv)

If the data don’t fit the size limits, put a download link.

Observe the following guidelines:

submit as .zip file, not .rar
input files will be loaded using relative paths (instead of c:\Documents\file.csv use just file.csv, this may require setting the working directory using the setwd() command in R)
check that the submitted file runs without errors after restarting a Python kernel
If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file
The runtime of the submitted file should be roughly below 5 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.

Report structure

Description of the dataset
Exploratory Data Analysis
Data preprocessing
Modeling
Results and Evaluation

The report should contain commented code that will allow for

replication of the analysis, including all data (csv) files
explain any choices made (such as parameter selection, removal of features, adding new features, etc.)

Review report structure

The review should answer the following questions:

Did the authors select the right modeling algorithm?
Was hyperparameter tuning (grid search) performed where appropriate?
Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
Were proper evaluation metrics selected? Are the results correctly interpreted?
Are all important steps explained and justified?
What is the quality of writing? Is the language clear and concise?

The review report should be submitted as a pdf.

Presentation outline

Introduction of the dataset
Preprocessing steps
Modeling techniques used
Lessons learned
Response to review feedback

You are recommended to prepare slides for the presentation. Please prepare the presentation for up to 10 minutes (instructions given in class take priority).

Midterm

The midterm exam will cover topics covered in the first half of the semester, including a mix of theory and coding questions.

Final Exam

The final exam covers mainly these topics:

Python basics: List comprehension, …
Explorative data analysis (e.g. understanding histograms, distributions)
Model building
Code understanding (for example, which functionality is provided by which package)
Basics of other datascience languages and frameworks covered in the lectures (e.g., R language, big data processing with Spark, etc.)

The final exam will have a form of quiz.

If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).

Grading

10 points for activity (mostly minitests)
20 points for the Python project
5 points for review of Python project of another team
5 points for presentation
15 points for homework (coding quizzes, elearning)
15 points for the midterm exam
30 points for the final exam - minimum

Students can also earn bonus points for completing additional assignments.

What happens if the course goes on-line

We will use MS Teams
Some requirements for completing the course may change

Standard ECTS grading scheme applies.