The rapidly increasing amount of data generated each year makes extracting useful information from that data ever more important. Companies are making use of various data techniques to answer business questions and power their decision-making. This data is often stored in data warehouses and databases which has to be extracted, pre-processed, and analyzed before it can be modeled using statistical techniques.
The goal of this course is to provide the necessary technical expertise for extracting and exploring data stored in databases as well as building and evaluating common statistical models. Specifically, the course covers four broad topics: (1) Extracting data from databases using SQL and performing exploratory data analysis; (2) Building data-based supervised statistical models to find a predictive function; (3) Evaluating the results of the models, and (4) Working with non-tabular data and unsupervised models. The main statistical methods covered include shrinkage methods (L1 and L2 regularization), maximum margin classifiers, tree-based bagging and boosting algorithms, and clustering methods. Students are also equipped with the tools to perform model evaluation using cross-validation approaches, bootstrapping for estimating uncertainty, utilizing common classification evaluation metrics (e.g. F1-Score, ROC-AUC), and extracting feature importance, among others.
The course heavily focuses on predictive modeling using Python. Foundational knowledge in statistics, mathematical analysis, and Python programming is assumed.