Foundations of Data Science using R
by Luis Torgo
Location: Faculty of Computer Science, Dalhousie University
Past Editions: Fall 2019, Fall 2020 Next Edition: Fall 2021
Course Web Page: contact Luis Torgo
Course Description
The Foundations of Data Science using R course provides an introduction to the main steps of a typical data science project, using the R language and environment as the illustration tool.
The course addresses the theoretical foundations of several key methods and techniques used in data science projects, from data importation and manipulation, going through data visualization and model building, to reporting and deployment.
The course is targeted to students without a strong technical background on programming or computer science tools and thus can be taken by both CS students and also students from other faculties. The theoretical concepts are motivated and introduced through practical case studies with the help of the R environment.
Learning Outcomes
- Understand and explain the main steps in a data science project.
- Be able to apply wisely different data manipulation solutions.
- Describe key concepts of data summarization and visualization and how to apply them in R.
- Understand and describe the basic foundations of the predictive models studied in the course.
- Understand and describe the basic foundations of several key descriptive modelling methods like clustering, outlier detection and association rules.
- Describe the problem of model evaluation and selection and be able to apply adequate methods to select a model for a concrete data set.
- Perform reporting and deployment using RMarkdown and Shiny Webapps.
Topics
- Overview of the Data Science process
- The CRISP-DM model
- Types of data and data sources
- Types of tasks and models
- Data manipulation
- Importing and cleaning data
- Handling missing values
- Data transformation and variable creation
- Exploratory data analysis
- Data summarization and data visualization
- Predictive modeling
- Classification and regression tasks
- Evaluation metrics
- Linear models: linear discriminant and linear regression
- Support vector machines
- Tree-based models: classification and regression trees
- Descriptive modeling
- Clustering: distance functions; k-means; hierarchical clustering
- Outlier Detection
- Model evaluation and selection
- Holdout, cross validation, bootstrap
- Reporting and deployment
- Tools and examples (dynamic reports and shiny web apps in R)