Requirements for completing the course
Software Project
There are two software projects
- Python project (group project - teams of two)
Additionally, you will submit
- review of project of other team
Deadlines
Please refer to the important dates
Choosing data
IMPORTANT: The dataset you choose should lead to a classification task (target variable is nominal). In other words, the target variable cannot be numerical (unless the numbers correspond to a small number of categories).
Generic datasets
- UCI Machine learning repository
- OpenML to work on.
- kaggle.com Data Mining Competitions
- Papers With Code,
- United Nations Data Catalog
- Data.gov
- WorldData.AI
- OpenEI
- Portál otevřených dat.
You can also propose any other dataset to work on.
Datasets specific to the Czech Republic and current challenges:
- Traffic accidents in the Czech Republic - may require extra steps to obtain machine readable data
- Open Data exports from the Information System of Public Tenders of the Czech Republic
- Twitter Fake News
- CORD-19 Large dataset with about 500.000 research articles on COVID-19 (suitable for text mining) Problems from guest lecturers
Open data used in theses supervised at KIZI FIS VSE:
- Strojové účenie ako služba - predikovanie výsledkov verejných obstarávaní Vestník verejného obstarávania - data available via data.gov.sk
- Software pro konsolidaci informací o právních osobách z otevřených zdrojů - Insolvenční rejstřík, Obchodní rejstřík, Administrativní registr obchodních subjektů (ARES).
- Evaluace srozumitelnosti pro shlukování textu - disinformation dataset Russian Troll Tweets
- Extrakce pravidel ze znalostních grafů - large dataset with biomedical entities and their interactions KGCovid-19
Linked open data
- Wikidata, for example: Explore Czech Republic using Wikidata: 10 best Queries of data journalist (in Czech)
Proprietary data sets used in theses supervised at VSE:
- Predikce insolvence podniku s využitím metod datové vědy - database of Czech companies Albetina
Project outline
Once you have a dataset, the goal is to create a python project that will
- perform some sort of preprocessing using the Pandas library
- build model
- perform tuning of hyperparameters
- evaluate model
Requirements on preprocessing
Any two of the following operations are mandatory:
- remove rows based on subsetting
- derive new columns
- use aggregation operators
- treat missing values
Python: use pandas
Requirements on model building
Use any classifier. Choose one of the following two options:
- perform train/test split
- use crossvalidation
Also, evaluate and compare at least two algorithms of different types (e.g. logistic regression and random forest).
Python: use any classifier from sklearn
Requirements on hyperparameter tuning
If the chosen classifier has any hyperparameters that can be tuned, use one of the following methods:
- try several configurations and describe the best result in the final report
- perform grid search or other similar automatic method
- once you have tuned hyperparameters on a dedicated development (training) set, e.g. with GridSearchCV, you can retrain the model on the complete training data, as e.g. described here and here
Python recommendation: sklearn.model_selection.GridSearchCV
Requirements on model evaluation
- report the accuracy on test set/crossvalidation
- if you are performing binary classification task, show the ROC curve
- make sure to use dedicated dataset for evaluation
Python: use model_selection.cross_val_score
, plot the roc curve using sklearn.metrics.roc_curve
Avoid common mistakes
- Feature selection on train and test set
- Pre-processing on train and test sets together
- Temporal leakage
- Non independence between train and test sets
- Illegitimate features
See also
https://reproducible.cs.princeton.edu/
Submission
The submission consists of
- IPython (Jupyter) notebook (.ipynb)
- IPython (Jupyter) notebook as html (.html)
- Data (.csv)
If the data don’t fit the size limits, put a download link.
Observe the following guidelines:
- submit as .zip file, not .rar
- input files will be loaded using relative paths (instead of
c:\Documents\file.csv
use justfile.csv
, this may require setting the working directory using thesetwd()
command in R) - check that the submitted file runs without errors after restarting a Python kernel
- If you use other packages than in the standard anaconda (python) or R installation, justify the use of each package by comment - exaplain why functionality in standard packages is not sufficient. Also, put command installing all these packages to the beginning of the file
- The runtime of the submitted file should be roughly below 5 minutes. If parameter tuning takes longer, change the script so that by default it complies to this limit, e.g. by checking smaller parameter space.
Report structure
- Description of the dataset
- Exploratory Data Analysis
- Data preprocessing
- Modeling
- Results and Evaluation
The report should contain commented code that will allow for
- replication of the analysis, including all data (csv) files
- explain any choices made (such as parameter selection, removal of features, adding new features, etc.)
Review report structure
The review should answer the following questions:
- Did the authors select the right modeling algorithm?
- Was hyperparameter tuning (grid search) performed where appropriate?
- Are the results replicable? If you have the same data, does the report describe all steps in sufficient detail to obtain the same results as reported by the authors?
- Were proper evaluation metrics selected? Are the results correctly interpreted?
- Are all important steps explained and justified?
- What is the quality of writing? Is the language clear and concise?
The review report should be submitted as a pdf.
Presentation outline
- Introduction of the dataset
- Preprocessing steps
- Modeling techniques used
- Lessons learned
- Response to review feedback
You are recommended to prepare slides for the presentation. Please prepare the presentation for up to 10 minutes (instructions given in class take priority).
Midterm
The midterm exam will cover topics covered in the first half of the semester, including a mix of theory and coding questions.
Final Exam
The final exam covers mainly these topics:
- Python basics: List comprehension, …
- Explorative data analysis (e.g. understanding histograms, distributions)
- Model building
- Code understanding (for example, which functionality is provided by which package)
- Basics of other datascience languages and frameworks covered in the lectures (e.g., R language, big data processing with Spark, etc.)
The final exam will have a form of quiz.
If you do not pass the final exam, it is possible to retake it as per ECTS rules (4+).
Grading
- 10 points for activity (mostly minitests)
- 20 points for the Python project
- 5 points for review of Python project of another team
- 5 points for presentation
- 15 points for homework (coding quizzes, elearning)
- 15 points for the midterm exam
- 30 points for the final exam
Students can also earn bonus points for completing additional assignments.
What happens if the course goes on-line
- We will use MS Teams
- Some requirements for completing the course may change
Standard ECTS grading scheme applies.