There are two software projects
- Python project (group project - teams of two)
- R project individual project
Both projects are performed on the same data and the basic requirements for both software projects are the same, the tools are different. Additionally, you will submit
- review of project of other team
Please refer to the important dates
There are two main routes:
- standard assignment You will be assigned dataset from the UCI Machine learning repository or OpenML to work on.
- custom assignment Propose any dataset (even proprietary) to work on.
Some tips for your custom assignment project:
- kaggle.com Data Mining Competitions
- Data World Data repository.
- Problems from guest lecturers
- Wikidata, for example: Explore Czech Republic using Wikidata: 10 best Queries of data journalist (in Czech)
Important note: you can use the same dataset for both assignments! Important note 2: the selected dataset should lead to classification task (not regression).
Once you have a dataset, the goal is to create a python project that will
- perform some sort of preprocessing using the Pandas library
- build model
- perform tuning of metaparameters
- evaluate model
Requirements on preprocessing
Any two of the following operations are mandatory:
- remove rows based on subsetting
- derive new columns
- use aggregation operators
- treat missing values
R: it is possible, but not mandatory, to use the
tibble package (part of
Requirements on model building
Use any classifier. Choose one of the following two options:
- perform train/test split
- use crossvalidation
Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).
Python: use any classifier from
R: no restriction
Requirements on metaparameter tuning
If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:
- try several configurations and describe the best result in the final report
- perform grid search or other similar automatic method
- once you have tuned metaparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here for Python: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html and https://stackoverflow.com/questions/26962050/confused-with-repect-to-working-of-gridsearchcv
Requirements on model evaluation
- report the accuracy on test set/crossvalidation
- if you are performing binary classification task, involve also the ROC curve
- make sure to use dedicated dataset for evaluation
The first (group project) submission:
- IPython (Jupyter) notebook with the Python solution. The notebook will have extension .ipynb. Example of an Ipython project Baseball - Part 1 + Part 2. Another complex example - School Budget
The second (individual project) submission:
- R Markdown report. The notebook will have extension .Rmd. Example of RMarkdown report on the Titanic dataset, another example on breast cancer data.
- html file that is created by ‘knitting’ the R Markdown.
For both submissions, observe the following guidelines:
- submit as .zip file, not .rar
- archive contains the analyzed data (preferrably .csv files)
- input files will be loaded using relative paths (instead of
file.csv, this may require setting the working directory using the
setwd()command in R)
- make sure that you correctly specify the casing of input files (for input .csv file
csco <- read.csv("csco.csv")not
csco <- read.csv("CSCO.csv"))
- check that the submitted file is runnable
- If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file, e.g.
- The runtime of the submitted file should be below 2 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.
- Description of the dataset
- Exploratory Data Analysis
- Data preprocessing
- Results and Evaluation
The report should contain commented code that will allow for
- replication of the analysis, ** including all data (csv) files **
- explain any choices made (such as parameter selection, removal of features, adding new features, etc.)
Review report structure
The review should answer the following questions:
- Did the authors select the right modeling algorithm?
- Was metaparameter tuning (grid search) performed where appropriate?
- Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
- Were proper evaluation metrics selected? Are the results correctly interpreted?
- Are all important steps explained and justified?
- What is the quality of writing? Is the language clear and concise?
The review report should be submitted as a pdf.
- Introduction of the dataset
- Preprocessing steps
- Modeling techniques used
- Lessons learned
- Response to review feedback
You are recommended to prepare slides for the presentation. The target length is 5 minutes + 5 minutes for discussion.
The final exam will focus on these topics:
- Python: List comprehension, …
- R: data frame indexing, …
- Explorative data analysis (e.g. understanding histograms, distributions)
- Model building (e.g. difference between linear and logistic regression)
- Code understanding (for example, which functionality is provided by which package)
The final exam will have a form of quiz.
If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).
- 20 points for the Python project (group)
- 5 points for review of Python project of another team
- 25 points for the individual R project
- 50 points for the final exam
- At least 25 points from the projects (points for the review are not counted)
- At least 25 points from the final exam
Standard ECTS grading scheme applies.