課程資訊
課程名稱
大數據分析專題
Seminar on Big Data Analytics 
開課學期
107-1 
授課對象
社會科學院  政治學系  
授課教師
張佑宗 
課號
PS5687 
課程識別碼
322 U2050 
班次
 
學分
2.0 
全/半年
半年 
必/選修
選修 
上課時間
星期四8,9(15:30~17:20) 
上課地點
社科研605 
備註
初選不開放。政治思想,國際關係,公共行政,本國政治,比較政治。
限學士班三年級以上 且 限本系所學生(含輔系、雙修生)
總人數上限:38人 
Ceiba 課程網頁
http://ceiba.ntu.edu.tw/1071PS5687_BDATA 
課程簡介影片
 
核心能力關聯
核心能力與課程規劃關聯圖
課程大綱
為確保您我的權利,請尊重智慧財產權及不得非法影印
課程概述

Data science is a field with goals overlapping with many disciplines, in particular, mathematics, statistics, algorithms, engineering, or optimization theory. It also has wide applications to a number of scientific areas such as natural sciences, social sciences, life sciences, business, or medicine. Data science has become an integral part of many research projects and started affecting social science reaches. The promise of the “big data” revolution is that in these data are the answers to fundamental questions of businesses, governments, and social sciences such as political science and sociology. Most importantly, these quantitative techniques provide ``better predictions'' across different systems. Many of the most astonishing results come from computational fields, which have little experience with the difficulty of social scientific inquiry. As social scientists, we have an extensive experience and observations of our own research fields and we can utilize the advance of these new computational methods to our studies.
The course objective is to study the theory and practice of constructing algorithms that learn from data. This is an applied graduate level course for social scientists. Students will learn practical ways to build machine learning solutions for their own researches. While some mathematical/statistical details are needed, we will have an overview of the quantitative tools we need and emphasize the methods with their conceptual underpinnings rather than their theoretical properties. Specifically, the course will cover: k-nearest neighbors methods, the naive Bayes method, decision trees, random forests, boosting, k-means clustering and nearest neighbors, kernels, scaling, and ensemble learning. We will also discuss topics related to best practices, including overfitting/underfitting of data, error rates, cross-validation, and the use of bootstrapping methods to develop uncertainty estimates.
Statistical Software:
R is a programming language and free software environment for statistical compu 

課程目標
By the end of this course, students should be able to:
(1) Understand the fundamental concepts and applications of data science.
(2) Learn the advantages and shortcomings of widely used machine learning algorithms.
(3) Uncover patterns and structure embedded in data with machine learning methods. (4) Test and improve model specification and predictions.
(5) Apply their learning to a social science research project.
As a result, we hope that this course will appeal not just to mathematicians/statisticians but also to researchers in a wide variety of social science research fields. 
課程要求
Prerequisites:
One-year of calculus, basic linear algebra, basic probability theory, applied statistics, proficiency in Python/R/MATLAB or permission of the instructors.

Grading Policy:
Quizzes ……………………………………… 10%
Assignments ……………………………… 30%
Midterm ……………………………………. 30%
Final Exam …………………………………. 30%

Assignments:
There will be 5-6 problem sets during the semester, with 3-5 questions apiece, drawn mostly from the two textbooks. The datasets we will be using, but not limited to, are mainly fields of social sciences and business. You are encouraged to discuss with your classmates about the problems, but you must write and turn in your own answers. To be blunt, rote copying of an answer from your classmates or other sources is a waste of your time and the grader's time.

Class Policy:
1. An important component of this course is active engagement with the material in classes. Regular attendance is essential and expected.
2. Quizzes are closed book, closed notes.
3. No makeup quizzes will be given.
4. No foods in class.

Academic Honesty:
Lack of knowledge of the academic honesty policy is not a reasonable explanation for a violation.

There will be 5-6 problem sets during the semester, with 3-5 questions apiece, drawn mostly from the two textbooks. The datasets we will be using, but not limited to, are mainly fields of social sciences and business. You are encouraged to discuss with your classmates about the problems, but you must write and turn in your own answers. To be blunt, rote copying of an answer from your classmates or other sources is a waste of your time and the grader's time. 
預期每週課後學習時數
 
Office Hours
 
指定閱讀
待補 
參考書目
一、 指定閱讀(請詳述每週指定閱讀)
Main References:
There are two required books for the course:
1. An Introduction to Statistical Learning: with Applications in R. Springer, 2009.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

2. Applied Predictive Modeling. Springer, 2013. Max Kuhn and Kjell Johnson.

二、 延伸閱讀(請詳述每週延伸閱讀)
Here are some recommended readings. Students are not required to read all of these books prior to class.
1. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
2. All of Statistics: A Concise Course in Statistical Inference. Springer, 2013. Larry Wasserman.
3. Python Machine Learning. PACKT Publishing, 2015. Sebastian Raschka. 
評量方式
(僅供參考)
   
課程進度
週次
日期
單元主題
第1週
9/13  Overview of Data Science 
第2週
9/20  Review Session 1 
第3週
9/27  Review Session 2 
第4週
10/04  Linear Regression 1 
第5週
10/11  Linear Regression 2 
第6週
10/18  Classification 1 
第7週
10/25  Classification 2 
第8週
11/01  Resampling Methods 
第9週
11/08  Midterm Exam 
第10週
11/15  No class due to school anniversary 
第11週
11/22  Linear Model Selection 
第12週
11/29  Regularization 
第13週
12/06  Nonlinear Methods 1 
第14週
12/13  Nonlinear Methods 2 
第15週
12/20  Tree Based Methods 1 
第16週
12/27  Tree Based Methods 2 
第17週
1/03  Unsupervised Learning 
第18週
1/10  Final Exam