Software Project

There are two software projects

  • Python project (group project - teams of two)

Additionally, you will submit

  • review of project of other team

Deadlines

Please refer to the important dates

Choosing data

IMPORTANT: The dataset you choose should lead to a classification task (target variable is nominal). In other words, the target variable cannot be numerical (unless the numbers correspond to a small number of categories).

Generic datasets

You can also propose any other dataset to work on.

Datasets specific to the Czech Republic and current challenges:

Open data used in theses supervised at KIZI FIS VSE:

Linked open data

Proprietary data sets used in theses supervised at VSE:

Project outline

Once you have a dataset, the goal is to create a python project that will

  • perform some sort of preprocessing using the Pandas library
  • build model
  • perform tuning of metaparameters
  • evaluate model

Requirements on preprocessing

Any two of the following operations are mandatory:

  • remove rows based on subsetting
  • derive new columns
  • use aggregation operators
  • treat missing values

Python: use pandas

Requirements on model building

Use any classifier. Choose one of the following two options:

  • perform train/test split
  • use crossvalidation

Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).

Python: use any classifier from sklearn

Requirements on metaparameter tuning

If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:

  • try several configurations and describe the best result in the final report
  • perform grid search or other similar automatic method
  • once you have tuned metaparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here and here

Python recommendation: sklearn.model_selection.GridSearchCV

Requirements on model evaluation

  • report the accuracy on test set/crossvalidation
  • if you are performing binary classification task, show the ROC curve
  • make sure to use dedicated dataset for evaluation

Python: use model_selection.cross_val_score, plot the roc curve using sklearn.metrics.roc_curve

Avoid common mistakes

  • Feature selection on train and test set
  • Pre-processing on train and test sets together
  • Temporal leakage
  • Non independence between train and test sets
  • Illegitimate features

See also
https://reproducible.cs.princeton.edu/

Submission

The submission consists of

  • IPython (Jupyter) notebook (.ipynb)
  • IPython (Jupyter) notebook as html (.html)
  • Data (.csv)

If the data don’t fit the size limits, put a download link.

Observe the following guidelines:

  • submit as .zip file, not .rar
  • input files will be loaded using relative paths (instead of c:\Documents\file.csv use just file.csv, this may require setting the working directory using the setwd() command in R)
  • check that the submitted file runs without errors after restarting a Python kernel
  • If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file
  • The runtime of the submitted file should be roughly below 5 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.

Report structure

  • Description of the dataset
  • Exploratory Data Analysis
  • Data preprocessing
  • Modeling
  • Results and Evaluation

The report should contain commented code that will allow for

  • replication of the analysis, including all data (csv) files
  • explain any choices made (such as parameter selection, removal of features, adding new features, etc.)

Review report structure

The review should answer the following questions:

  • Did the authors select the right modeling algorithm?
  • Was metaparameter tuning (grid search) performed where appropriate?
  • Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
  • Were proper evaluation metrics selected? Are the results correctly interpreted?
  • Are all important steps explained and justified?
  • What is the quality of writing? Is the language clear and concise?

The review report should be submitted as a pdf.

Presentation outline

  • Introduction of the dataset
  • Preprocessing steps
  • Modeling techniques used
  • Lessons learned
  • Response to review feedback

You are recommended to prepare slides for the presentation. Please prepare the presentation for up to 10 minutes (instructions given in class take priority).

Midterm

The midterm exam will cover topics covered in the first half of the semester, including a mix of theory and coding questions.

Final Exam

The final exam covers mainly these topics:

  • Python basics: List comprehension, …
  • Explorative data analysis (e.g. understanding histograms, distributions)
  • Model building
  • Code understanding (for example, which functionality is provided by which package)
  • Basics of other datascience languages and frameworks covered in the lectures (e.g., R language, big data processing with Spark, etc.)

The final exam will have a form of quiz.

If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).

Grading

  • 10 points for activity (mostly minitests)
  • 20 points for the Python project
  • 5 points for review of Python project of another team
  • 5 points for presentation
  • 15 points for homework (coding quizzes, elearning)
  • 15 points for the midterm exam
  • 30 points for the final exam

Students can also earn bonus points for completing additional assignments.

What happens if the course goes on-line

  • We will use MS Teams
  • Some requirements for completing the course may change

Standard ECTS grading scheme applies.