There are two software projects
- Python project (group project - teams of two)
- R project individual project
Both projects are performed on the same data and the basic requirements for both software projects are the same, the tools are different. Additionally, you will submit
- review of project of other team
Please refer to the important dates
There are two main routes:
- standard assignment You will be assigned dataset from the UCI Machine learning repository or OpenML to work on.
- custom assignment Propose any dataset (even proprietary) to work on.
Some tips for your custom assignment project:
- kaggle.com Data Mining Competitions
- Data World Data repository.
- Problems from guest lecturers
- Wikidata, for example: Explore Czech Republic using Wikidata: 10 best Queries of data journalist (in Czech)
Datasets specific to the Czech Republic and current challenges:
- Traffic accidents in Czech Republic - may require extra steps to obtain machine readable data
- Open Data exports from the Information System of Public Tenders of the Czech Republic
- Twitter Fake News
- CORD-19 Large dataset with about 500.000 research articles on COVID-19 (suitable for text mining)
Open data used in theses supervised at KIZI FIS VSE:
- Strojové účenie ako služba - predikovanie výsledkov verejných obstarávaní Vestník verejného obstarávania - data available via data.gov.sk
- Software pro konsolidaci informací o právních osobách z otevřených zdrojů - Insolvenční rejstřík, Obchodní rejstřík, Administrativní registr obchodních subjektů (ARES).
- Evaluace srozumitelnosti pro shlukování textu - disinformation dataset Russian Troll Tweets
- Extrakce pravidel ze znalostních grafů - large dataset with biomedical entities and their interactions KGCovid-19
Proprietary data sets used in theses supervised at VSE:
- Predikce insolvence podniku s využitím metod datové vědy - database of Czech companies Albetina
- Spacy - Natural language processing framework
For processing text in Czech
- UDPipe - trainable pipeline for tokenization, tagging, lemmatization and dependency parsing
- UDPipe integration to Spacy
- Optical Character Recognition (OCR): Tesseract, cf. thesis Software pro konsolidaci informací o právních osobách z otevřených zdrojů for a use case report.
Important note: you can use the same dataset for both assignments! Important note 2: the selected dataset should lead to classification task (not regression).
Once you have a dataset, the goal is to create a python project that will
- perform some sort of preprocessing using the Pandas library
- build model
- perform tuning of metaparameters
- evaluate model
Requirements on preprocessing
Any two of the following operations are mandatory:
- remove rows based on subsetting
- derive new columns
- use aggregation operators
- treat missing values
R: it is possible, but not mandatory, to use the
tibble package (part of
Requirements on model building
Use any classifier. Choose one of the following two options:
- perform train/test split
- use crossvalidation
Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).
Python: use any classifier from
R: no restriction
Requirements on metaparameter tuning
If the chosen classifier has any metaparameters that can be tuned, use one of the following methods:
- try several configurations and describe the best result in the final report
- perform grid search or other similar automatic method
- once you have tuned metaparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here for Python: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html and https://stackoverflow.com/questions/26962050/confused-with-repect-to-working-of-gridsearchcv
Requirements on model evaluation
- report the accuracy on test set/crossvalidation
- if you are performing binary classification task, involve also the ROC curve
- make sure to use dedicated dataset for evaluation
Avoid common mistakes
- Feature selection on train and test set
- Pre-processing on train and test sets together
- Temporal leakage
- Non independence between train and test sets
- Illegitimate features
For more details refer to
The first (group project) submission:
- IPython (Jupyter) notebook with the Python solution. The notebook will have extension .ipynb. Example of an Ipython project Baseball - Part 1 + Part 2. Another complex example - School Budget
The second (individual project) submission:
- R Markdown report. The notebook will have extension .Rmd. Example of RMarkdown report on the Titanic dataset, another example on breast cancer data.
- html file that is created by ‘knitting’ the R Markdown.
For both submissions, observe the following guidelines:
- submit as .zip file, not .rar
- archive contains the analyzed data (preferrably .csv files)
- input files will be loaded using relative paths (instead of
file.csv, this may require setting the working directory using the
setwd()command in R)
- make sure that you correctly specify the casing of input files (for input .csv file
csco <- read.csv("csco.csv")not
csco <- read.csv("CSCO.csv"))
- check that the submitted file is runnable
- If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file, e.g.
- The runtime of the submitted file should be below 2 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.
- Description of the dataset
- Exploratory Data Analysis
- Data preprocessing
- Results and Evaluation
The report should contain commented code that will allow for
- replication of the analysis, ** including all data (csv) files **
- explain any choices made (such as parameter selection, removal of features, adding new features, etc.)
Review report structure
The review should answer the following questions:
- Did the authors select the right modeling algorithm?
- Was metaparameter tuning (grid search) performed where appropriate?
- Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
- Were proper evaluation metrics selected? Are the results correctly interpreted?
- Are all important steps explained and justified?
- What is the quality of writing? Is the language clear and concise?
The review report should be submitted as a pdf.
- Introduction of the dataset
- Preprocessing steps
- Modeling techniques used
- Lessons learned
- Response to review feedback
You are recommended to prepare slides for the presentation. The target length is 5 minutes + 5 minutes for discussion.
The midterm exam is scheduled after the end of the block on Python.
The final exam will focus on these topics:
- Python: List comprehension, …
- R: data frame indexing, …
- Explorative data analysis (e.g. understanding histograms, distributions)
- Model building (e.g. difference between linear and logistic regression)
- Code understanding (for example, which functionality is provided by which package)
The final exam will have a form of quiz.
If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).
- 10 points for activity (mostly minitests)
- 20 points for the Python project (group)
- 5 points for review of Python project of another team
- 20 points for the midterm exam
- 15 points for the individual R project
- 30 points for the final exam
Students can also earn bonus points for completing additional assignments. These bonus points can improve final grade, but you need to have at least 60 points without the bonus points to pass!
What happens if the course goes on-line
- We will use MS Teams
- Some requirements for completing the course may change
Standard ECTS grading scheme applies.