Duomenų gavybos metodai verslo analitikai (IT101)

Program code:
Teaching language:
Anglų kalba
Download course syllabus

Course description

The rapidly increasing amount of data generated each year makes extracting useful information from that data ever more important. Companies are making use of various data techniques to answer business questions and power their decision-making. This data is often stored in data warehouses and databases which has to be extracted, pre-processed, and analyzed before it can be modeled using statistical techniques.

Course goals

The goal of this course is to provide the necessary technical expertise for extracting and exploring data stored in databases as well as building and evaluating common statistical models. Specifically, the course covers four broad topics: (1) Extracting data from databases using SQL and performing exploratory data analysis; (2) Building data-based supervised statistical models to find a predictive function; (3) Evaluating the results of the models, and (4) Working with non-tabular data and unsupervised models. The main statistical methods covered include shrinkage methods (L1 and L2 regularization), maximum margin classifiers, tree-based bagging and boosting algorithms, and clustering methods. Students are also equipped with the tools to perform model evaluation using cross-validation approaches, bootstrapping for estimating uncertainty, utilizing common classification evaluation metrics (e.g. F1-Score, ROC-AUC), and extracting feature importance, among others.

The course heavily focuses on predictive modeling using Python. Foundational knowledge in statistics, mathematical analysis, and Python programming is assumed.

Course results

  • Understand the key elements of relational databases and data storage. Extract data from databases by writing SQL queries.
  • Perform exploratory data analysis on tabular data using relevant Python packages. Visualize the relationship between variables.
  • Understand the differences between supervised and unsupervised models, the bias-variance trade-off, and train-test splits. Critically evaluate which models are best suited for specific tasks.
  • Understand and apply linear regression-based shrinkage methods for variable selection and addressing high-dimensional data problems.
  • Understand and apply multiple classification models for modeling binary and non-binary response variables. Understand the concepts of decision boundaries and maximum margin classifiers.
  • Understand and apply tree-based methods for both classification and regression problems. Understand the differences between bagging and boosting algorithms.
  • Evaluate models for regression and classification problems. Understand the different choices of metrics available for imbalanced data. Extract key features from statistical models.
  • Write clear, reproducible, and well-documented code in Python using the Jupyter Notebook environment. Be able to use the most relevant Python packages for data wrangling.