Thư viện tri thức trực tuyến
Kho tài liệu với 50,000+ tài liệu học thuật
© 2023 Siêu thị PDF - Kho tài liệu học thuật hàng đầu Việt Nam

Big Data Analytics
Nội dung xem thử
Mô tả chi tiết
Big Data Analytics: Optimization
and Randomization
Tianbao Yang†
, Qihang Lin\
, Rong Jin∗
‡
Tutorial@SIGKDD 2015
Sydney, Australia
†Department of Computer Science, The University of Iowa, IA, USA
\Department of Management Sciences, The University of Iowa, IA, USA
∗Department of Computer Science and Engineering, Michigan State University, MI, USA
‡
Institute of Data Science and Technologies at Alibaba Group, Seattle, USA
August 10, 2015
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 1 / 234
URL
http://www.cs.uiowa.edu/˜tyng/kdd15-tutorial.pdf
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 2 / 234
Some Claims
No
This tutorial is not an exhaustive literature survey
It is not a survey on different machine learning/data mining
algorithms
Yes
It is about how to efficiently solve machine learning/data mining
(formulated as optimization) problems for big data
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 3 / 234
Outline
Part I: Basics
Part II: Optimization
Part III: Randomization
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 4 / 234
Big Data Analytics: Optimization and Randomization
Part I: Basics
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 5 / 234
Basics Introduction
Outline
1 Basics
Introduction
Notations and Definitions
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 6 / 234
Basics Introduction
Three Steps for Machine Learning
Model Optimization
20 40 60 80 100 0
0.05
0.1
0.15
0.2
0.25
0.3
iterations
distance to optimal objective
0.5T
1/T2
1/T
Data
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 7 / 234
Basics Introduction
Big Data Challenge
Big Data
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 8 / 234
Basics Introduction
Big Data Challenge
Big Model
60 million parameters
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 9 / 234
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
min
w∈Rd
1
n
Xn
i=1
(yi − w
>xi)
2 +
λ
2
kwk
2
2
xi ∈ R
d
: d-dimensional feature vector
yi ∈ R: target variable
w ∈ R
d
: model parameters
n: number of data points
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 10 / 234
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
min
w∈Rd
1
n
Xn
i=1
(yi − w
>xi)
2
| {z }
Empirical Loss
+
λ
2
kwk
2
2
xi ∈ R
d
: d-dimensional feature vector
yi ∈ R: target variable
w ∈ R
d
: model parameters
n: number of data points
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 11 / 234
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
min
w∈Rd
1
n
Xn
i=1
(yi − w
>xi)
2 +
λ
2
kwk
2
2
| {z }
Regularization
xi ∈ R
d
: d-dimensional feature vector
yi ∈ R: target variable
w ∈ R
d
: model parameters
n: number of data points
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 12 / 234
Basics Introduction
Learning as Optimization
Classification Problems:
min
w∈Rd
1
n
Xn
i=1
`(yiw
>xi) + λ
2
kwk
2
2
yi ∈ {+1, −1}: label
Loss function `(z): z = yw>x
1. SVMs: (squared) hinge loss `(z) = max(0, 1 − z)
p
, where p = 1, 2
2. Logistic Regression: `(z) = log(1 + exp(−z))
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 13 / 234
Basics Introduction
Learning as Optimization
Feature Selection:
min
w∈Rd
1
n
Xn
i=1
`(w
>xi
, yi) + λkwk1
`1 regularization kwk1 =
Pd
i=1
|wi
|
λ controls sparsity level
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 14 / 234
Basics Introduction
Learning as Optimization
Feature Selection using Elastic Net:
min
w∈Rd
1
n
Xn
i=1
`(w
>xi
, yi)+λ
kwk1 + γkwk
2
2
Elastic net regularizer, more robust than `1 regularizer
Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 15 / 234