2021-03-02

PRML上巻 P14

prml

昨日の続き

確率変数 $B$ の確率分布を単に $p(B)$ と書き，その分布が特定の値 $r$ を取る確率を $p(r)$ と書くことにする．このような簡潔な記法を使うと，確率論の2つの基本法則を以下のように書くことができる．

確率の基本法則

$\begin{align*} &\textbf{加法定理}\qquad &p(X)&=\sum_{Y}p(X,Y) \tag{1.10}\\ &\textbf{乗法定理}\qquad &p(X,Y)&=p(Y | X)p(X) \tag{1.11}\\ \end{align*}$

ここで， $p(X,Y)$ は同時確率， $p(Y|X)$ は条件付き確率， $p(X)$ は周辺確率（単に $X$ の確率）である．

乗法定理及び対称性 $p(X,Y)=p(Y,X)$ から，条件付き確率の間の以下の関係を得る．

$\displaystyle{ p(Y|X)=\frac{p(X|Y)p(Y)}{p(X)}\tag{1.12} }$

これはベイズの定理（Bayes' theorem）と呼ばれ，パターン認識や機械学習において中心的役割保果たす．

加法定理を使えば，ベイズの定理の分母は分子に現れる量を使って表すことができる．

$\displaystyle{ p(X)=\sum_{Y}p(X|Y)p(Y)\tag{1.13} }$

すなわち，

$\displaystyle{ p(Y|X)=\frac{p(X|Y)p(Y)}{\sum_{Y}p(X|Y)p(Y)} }$

と表せる．

すべての $Y$ について和を取ると，

$\begin{align*} \sum_{Y}p(Y|X)&=\frac{\sum_{Y}p(X|Y)p(Y)}{\sum_{Y}p(X|Y)p(Y)}\\ &=1 \end{align*}$

となる．

つまり，ベイズの定理の分母は $(1.12)$ 式の左辺の条件付き確率をすべての $Y$ について和をとったものが1になることを保証するための規格化（正規化）定数とみなすことができる．

今日はここまで．

2021-03-01

PRML上巻 P13

prml

昨日の続き．

$i$ 列の事例数は，その列にある枠内の事例数の総和のため， $c_i=\sum_j n_{ij}$ である． $(1.5)$ と $(1.6)$ より，

$\displaystyle{ p(X=x_i)=\sum_{j=1}^{L}p(X=x_i,Y=y_j)\tag{1.7} }$

が成り立つ．これが確率の加法定理(sum rule)である．

$p(X=x_i)$ は他の変数についての周辺化，すなわち $Y$ についての足し合わせであるため，周辺確率(marginal probability)と呼ばれることもある．

$X=x_i$ の事例だけを考え，その中での $Y=y_j$ の事例の比率を $p(Y=y_j|X=x_i)$ と書き， $X=x_i$ が与えられた下での $Y=y_j$ の条件付き確率 (conditional probability)と呼ぶ．

これは $i$ 列の点の中で $i, j$ の枠内にある点の数の比率であるため，

$\displaystyle{ p(Y=y_j|X=x_i)=\frac{n_{ij}}{c_i}\tag{1.8} }$

となる．

$(1.5)$ , $(1.6)$ および $(1.8)$ から次の関係を得る．

$\begin{align*} p(X=x_i,Y=y_j)&=\frac{n_{ij}}{N} \\ &=\frac{n_{ij}}{c_i}\cdot\frac{c_i}{N} \\ &=p(Y=y_j|X=x_i)p(X=x_i) \tag{1.9} \end{align*}$

これは確率の乗法定理(product rule)である．

今日はここまで．

2021-02-28

PRML上巻 P11-13

prml

1.2 確率論

確率論(probability theory)は，不確実性に関する定量化と操作に関して一貫した枠組みを与え，パターン認識の基礎の中心を担っている．

確率に関する２つの基本的な法則がある．

確率の加法定理(sum rule of probability)
確率の乗法定理(product rule of probability)

2つの確率変数 $X, Y$ を考える． $X$ は任意の値 $x_i (i=1,\ldots,M)$ ， $Y$ は任意の値 $y_j (j=1,\ldots,L)$ を取れるものとする． $X$ と $Y$ の両方についてサンプルを取り，全部で $N$ 回の試行を行う．そのうち $X=x_i,Y=y_j$ となる試行の数を $n_{ij}$ とする．また， $X$ が値 $x_i$ を取る試行の数を $c_i$ ， $Y$ が値 $y_j$ を取る試行の数を $r_j$ とする．

このとき， $X$ が $x_i$ ， $Y$ が $y_j$ を取る確率を $p(X=x_i,Y=y_j)$ と書き， $X=x_i$ と $Y=y_j$ の同時確率（結合確率; joint probability）と呼ぶ．

これは， $i,j$ という枠内の中にある点の個数を点の総数で割った数であるため，

$\displaystyle{ p(X=x_i,Y=y_j)=\frac{n_{ij}}{N}\tag{1.5} }$

で与えられる．

同様に， $X$ が $x_i$ を取る確率を $p(X=x_i)$ と書くと， $i$ 列にある点の数を点の総数で割った数で，

$\displaystyle{ p(X=x_i)=\frac{c_i}{N}\tag{1.6} }$

となる．

眠いのでここまで．

2021-02-23

PRML上巻 P10-11

prml

昨日の続き．

多項式曲線フィッティングの線形方程式verを作成した．

動作確認のため，図 $1.6$ を再描画してみる．

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import numpy.polynomial.polynomial as poly
np.random.seed(42)

# test data set (N=15)
# t = sin(2 \pi x) + gaussian noise
x4 = np.linspace(0, 1, 15)
t4 = np.sin(2 * np.pi * x4) + np.random.normal(0, 0.2, 15)
# test data set (N=100)
# t = sin(2 \pi x) + gaussian noise
x1 = np.linspace(0, 1, 100)
t3 = np.sin(2 * np.pi * x1) + np.random.normal(0, 0.2, 100)

# figure 1.6

def polyfit_linear_equation(x: np.ndarray, t: np.ndarray, M: int) -> np.ndarray:
    """solve linear equation
    
    Args:
        x (np.ndarray): explanatory variable　(N dims)
        t (np.ndarray): response variable (N dims)
        M (int): order
    
    Returns:
        np.ndarray: coefficient (M dims)
    """
    # add intercept term
    M += 1
    # A: M x M matrix
    A = np.zeros(shape=(M, M))
    for i in range(M):
        for j in range(M):
            A[i][j] = np.sum(np.power(x, i + j))
    # T: 1 x M vector
    T = np.zeros(M)
    for i in range(M):
        T[i] = np.dot(np.power(x, i), t)
    # fitting with linear equations
    # FIXME: the result is flipped. To make the result in proper order, we must flip it.
    return np.linalg.solve(A, T)[::-1]

# fitting with polynomial functions
w9_15 = polyfit_linear_equation(x4, t4, 9)
w9_100 = polyfit_linear_equation(x1, t3, 9)

# visualize
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].set_title('N = 15')
axs[1].set_title('N = 100')

for ax in axs.flat:
    ax.plot(x1, t1, color='green')
    ax.set(xlabel='x', ylabel='t')

axs[0].scatter(x4, t4, facecolor='None', edgecolor='blue')
axs[0].plot(x1, np.poly1d(w9_15)(x1), color='red')
axs[1].scatter(x1, t3, facecolor='None', edgecolor='blue')
axs[1].plot(x1, np.poly1d(w9_100)(x1), color='red')

plt.tight_layout()

f:id:twakamori:20210223225533p:plain — figure 1.6 (linear equation)

正しく動作しているようだ．

あらためて，線形方程式verを用いて図 $1.7$ を描画してみる．

なお，正則化した線形方程式には，演習問題1.2の解答である次式を用いる．

$\displaystyle{ \sum_{j=0}^{M}(A_{ij}-\lambda I_{ij})w_j=T_i }$

# figure 1.7

def polyfit_linear_equation_regularize(x: np.ndarray, t: np.ndarray, M: int, alpha: float) -> np.ndarray:
    """solve linear equation
    
    Args:
        x (np.ndarray): explanatory variable　(N dims)
        t (np.ndarray): response variable (N dims)
        M (int): order
        alpha (float): regularization parameter
    
    Returns:
        np.ndarray: coefficient (M dims)
    """
    # add intercept term
    M += 1
    # A: M x M matrix
    A = np.zeros(shape=(M, M))
    for i in range(M):
        for j in range(M):
            A[i][j] = np.sum(np.power(x, i + j))
    # I: M x M identity matrix
    I = np.identity(M)
    # T: 1 x M vector
    T = np.zeros(M)
    for i in range(M):
        T[i] = np.dot(np.power(x, i), t)
    A_reg = A - alpha * I
    # fitting with linear equations
    # FIXME: the result is flipped. To make the result in proper order, we must flip it.
    return np.linalg.solve(A_reg, T)[::-1]

w9_l18 = polyfit_linear_equation_regularize(x4, t4, 9, np.exp(-18))
w9_l0 = polyfit_linear_equation_regularize(x4, t4, 9, np.exp(0))

# visualize
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].set_title(r'$\ln{\lambda} = -18$')
axs[1].set_title(r'$\ln{\lambda} = 0$')

for ax in axs.flat:
    ax.plot(x1, t1, color='green')
    ax.set(xlabel='x', ylabel='t')

axs[0].scatter(x2, t2, facecolor='None', edgecolor='blue')
axs[0].plot(x1, np.poly1d(w9_l18.flatten())(x1), color='red')
axs[1].scatter(x2, t2, facecolor='None', edgecolor='blue')
axs[1].plot(x1, np.poly1d(w9_l0.flatten())(x1), color='red')

plt.tight_layout()

f:id:twakamori:20210223225737p:plain — figure 1.7

（元々のデータが書籍記載のものと異なるためか，書籍記載の図を再現できていない． $\lambda$ の値を変化させて確認したところ，正則化自体はうまく動作しているようだった．よってこの図はこのままとする．）

図から， $\lambda$ によって過学習が抑制され，背後の関数 $\sin(2\pi x)$ に近い表現が得られることがわかる．ただし， $\lambda$ の値を大きくしすぎると図 $1.7$ の右側に示すように当てはまりは再び悪くなる．

正則化が期待通り係数が大きい値を取らないようにしているかを見るために，当てはめた多項式の係数を次表に示す．（コードは省略）

	$\ln{\lambda}=-\infty$	$\ln{\lambda}=-18$	$\ln{\lambda}=0$
$w_ 0^ *$	20812.84	-2005.15	-0.95
$w_ 1^ *$	-92408.71	3634.86	-0.85
$w_ 2^ *$	170879.63	614.50	-0.71
$w_ 3^ *$	-170219.15	-3401.47	-0.54
$w_ 4^ *$	98565.04	-1228.53	-0.31
$w_ 5^ *$	-33325.18	4548.16	-0.02
$w_ 6^ *$	6273.03	-2740.68	0.37
$w_ 7^ *$	-609.66	618.99	0.84
$w_ 8^ *$	32.65	-40.79	1.19
$w_ 9^ *$	-0.38	0.20	-0.38

表より， $\lambda$ の値を増やすと共にほとんどの係数は小さな値を取るようになることがわかる．

正則化項が汎化誤差に与える影響は，訓練集合とテスト集合の両方に対するRMS誤差 $(1.3)$ の値を $\ln{\lambda}$ に対してプロットしてみれば良い．これを図 $1.8$ に示す．

# figure 1.8

RMSE_reg_train = []
RMSE_reg_test = []
w_values_reg = []
for alpha in range(-380, -199):
    w = polyfit_linear_equation_regularize(x2, t2, 9, np.exp(alpha / 10))
    w_values_reg.append(w.flatten())
    y_train = np.poly1d(w.flatten())(x2)
    y_test = np.poly1d(w.flatten())(x1)
    E_train = 0.5 * np.sum(np.square(y_train - t2))
    E_test = 0.5 * np.sum(np.square(y_test - t3))
    RMSE_reg_train.append(np.sqrt(2 * E_train / t2.size))
    RMSE_reg_test.append(np.sqrt(2 * E_test / t3.size))

ln_alpha = np.linspace(-38, -19, 181)
p1 = plt.plot(ln_alpha, RMSE_reg_train, color='blue')
p2 = plt.plot(ln_alpha, RMSE_reg_test, color='red')
plt.xlabel(r'$\ln{\lambda}$')
plt.ylabel(r'$E_{RMS}$')
plt.ylim(0, 1)
plt.legend([p1[0], p2[0]], ['訓練', 'テスト'])

f:id:twakamori:20210223230131p:plain — figure 1.8

むむ．書籍と随分違うグラフになってしまった．このグラフから何か読み取るのは困難だが，本来は，テスト誤差が最小となるような正則化パラメータ $\lambda$ を探索すればよい，ということになるはずである．

モデルの複雑さの問題は重要で，誤差関数を最小にするようなアプローチで実際の応用問題を解こうとする際には，モデルの複雑さを適切に決める方法を見つけなければならない．そのためには，得られたデータを，係数 $\mathbf{w}$ を決めるために使われる訓練集合と，それとは別の確認用集合（検証用集合; validation set）に分ける方法がある．確認用集合はホールドアウト集合（hold-out set）とも呼ばれ，モデルの複雑さ（ $M$ または $\lambda$ ）を最適化するのに使われる．

これまでの多項式曲線フィッティングの議論は，主として直感に訴えるものであった．ここからは，確率論的な議論の枠組みを使い，パターン認識における問題を解くためのより原理的なアプローチを探っていく．

以上．

2021-02-22

PRML上巻 P9

prml

昨日の続き．

過学習を制御するためによく使われるテクニックに正則化(regularization)がある．これは誤差関数 $(1.2)$ にpenalty項を付加することにより係数が大きな値になることを防ごうとするものである．

$\displaystyle{ E(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\left\{y(x_n, \mathbf{w}) - t_n \right\}^2\tag{1.2} }$

最も単純なものは，係数を二乗して和をとったもの（L2正則化）で，誤差関数は以下となる．

$\displaystyle{ \tilde{E}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^N\left\{y(x_n,\mathbf{w}-t_n)\right\}^2+\frac{\lambda}{2}\|\mathbf{w}\|^2\tag{1.4} }$

ここで， $\|\mathbf{w}\|^ 2=\mathbf{w}^ T\mathbf{w}=w_0^ 2+w_1^ 2+\cdots w_M^ 2$ であり，係数 $\lambda$ は正則化項と二乗誤差の和の項との相対的な重要度を調整している．（ただし，係数 $w_0$ は正則化から外すことも多い）

このようなテクニックは統計学の分野で縮小推定(shrinkage)と呼ばれている．特に2次の正則化の場合はリッジ回帰(ridge regression)と呼ばれる．ニューラルネットワークの文脈では荷重減衰(weight decay)として知られている．

演習問題 1.2

正則化された二乗和誤差関数 $(1.4)$ を最小にする係数 $w_i$ が満たす， $(1.122)$ に類似した線型方程式系を書き下せ．

演習問題 1.2 解答

$(1.4)$ に $(1.2)$ を代入する．

$\displaystyle{ \tilde{E}(\mathbf{w})=E(\mathbf{w})+\frac{\lambda}{2}\|\mathbf{w}\|^2 }$

$(1.4)$ が最小のとき， $(1.4)$ の $w_i$ についての偏微分が0となる．よって，

$\begin{align*} \frac{\partial}{\partial w_i}\tilde{E}(\mathbf{w}) &= \frac{\partial}{\partial w_i}E(\mathbf{w})-\lambda w_i \\ &= \frac{\partial}{\partial w_i}E(\mathbf{w})-\lambda \sum_{j=0}^M I_{ij}w_j\tag{a} \end{align*}$

ここで，

$\displaystyle{ I_{ij}=\left\{\begin{array}{ll} 1 & (i=j) \\ 0 & (\mathrm{otherwise}) \end{array}\right. }$

を用いた．

演習問題 1.1 解答より，

$\displaystyle{ \frac{\partial}{\partial w_i}E(\mathbf{w})=\sum_{j=0}^MA_{ij}w_j-T_i }$

なので，これを $(a)$ に代入すると．

$\begin{aligned} \frac{\partial}{\partial w_i}\tilde{E}(\mathbf{w})&=\sum_{j=0}^MA_{ij}w_j-T_i-\lambda \sum_{j=0}^M I_{ij}w_j\\ &=\sum_{j=0}^M(A_{ij}-\lambda I_{ij})w_j - T_i\\ &=0 \end{aligned}$

上式の $T_i$ を移行すると， $(1.122)$ に似た次式が得られる．

$\displaystyle{ \sum_{j=0}^{M}(A_{ij}-\lambda I_{ij})w_j=T_i }$

本日は以上．

2021-02-20

PRML上巻 P8

prml

昨日の続き．

いろいろな次数の多項式について得られた係数 $\mathbf{w}^*$ の値を検証してみる．（コードは省略）

	$M=0$	$M=1$	$M=3$	$M=9$
$w_ 0^ *$	0.09	-1.36	21.09	524.49
$w_ 1^ *$		0.77	-31.88	-8747.08
$w_ 2^ *$			10.88	30200.27
$w_ 3^ *$			0.01	-46656.18
$w_ 4^ *$				38429.23
$w_ 5^ *$				-17576.78
$w_ 6^ *$				4319.81
$w_ 7^ *$				-522.06
$w_ 8^ *$				28.31
$w_ 9^ *$				0.10

$M$ の増加に伴って，係数の多くが大きな値をとるようになっている．

次に，モデルの次数は固定し，データ集合のサイズを変えてみたときの振る舞いを確認する．（一時変数名が適当ですいません）

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import numpy.polynomial.polynomial as poly
np.random.seed(42)

# test data set (N=15)
# t = sin(2 \pi x) + gaussian noise
x4 = np.linspace(0, 1, 15)
t4 = np.sin(2 * np.pi * x4) + np.random.normal(0, 0.2, 15)
# test data set (N=100)
# t = sin(2 \pi x) + gaussian noise
x1 = np.linspace(0, 1, 100)
t3 = np.sin(2 * np.pi * x1) + np.random.normal(0, 0.2, 100)

# figure 1.6
# fitting with polynomial functions
w9_15 = np.polyfit(x4, t4, 9)
w9_100 = np.polyfit(x1, t3, 9)

# visualize
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].set_title('N = 15')
axs[1].set_title('N = 100')

for ax in axs.flat:
    ax.plot(x1, t1, color='green')
    ax.set(xlabel='x', ylabel='t')

axs[0].scatter(x4, t4, facecolor='None', edgecolor='blue')
axs[0].plot(x1, np.poly1d(w9_15)(x1), color='red')
axs[1].scatter(x1, t3, facecolor='None', edgecolor='blue')
axs[1].plot(x1, np.poly1d(w9_100)(x1), color='red')

plt.tight_layout()

f:id:twakamori:20210220195022p:plain — figure 1.6

図から，データ集合のサイズが大きくなるにつれて過学習の問題は深刻でなくなってくることがわかる．

大雑把な経験則としては，データの点の数はモデル中の適応パラメータの数の何倍か（例えば5とか10）よりは小さくてはいけない，といわれているらしい．

今日はここまで．

2021-02-20

PRML上巻 P6-7

prml

昨日の続き．

f:id:twakamori:20210219012715p:plain — figure 1.4

図 $1.4$ をみると，次数 $M=9$ のケースは関数 $sin(2\pi x)$ の表現としては明らかに不適切であることがわかる．このような振る舞いは過学習(over-fitting)として知られている．

我々の目標は，新たなデータに対して正確な予測を行える高い汎化性能の達成である．

そこで汎化性能が $M$ にどう依存するかを定量的に評価する．

$\displaystyle{ E(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\left\{y(x_n, \mathbf{w}) - t_n \right\}^2\tag{1.2} }$

テスト集合に対して，誤差関数 $E(\mathbf{w}^*)$ を評価する．このとき，

$\displaystyle{ E_{RMS}=\sqrt{2E(\mathbf{w}^*)/N}\tag{1.3} }$

で定義される平均二乗平方根誤差(root-mean-square error, RMS error)を用いると便利なことがある． $N$ で割ることによりサイズの異なるデータ集合を比較できるようになり，平方根を取ることによって $E_{RMS}$ は目的変数 $t$ と同じ尺度であることが保証される．

いろいろな $M$ に対する訓練集合とテスト集合のRMS誤差のグラフを描いてみる．ここで，テスト集合は新たに生成した100個のデータ点からなる．

%matplotlib inline
import matplotlib.pyplot as plt
import japanize_matplotlib
import numpy as np
import numpy.polynomial.polynomial as poly
np.random.seed(42)

# t = sin(2 \pi x)
x1 = np.linspace(0, 1, 100)
t1 = np.sin(2 * np.pi * x1)
# training data set (N=10)
# t = sin(2 \pi x) + gaussian noise
x2 = np.linspace(0, 1, 10)
t2 = np.sin(2 * np.pi * x2) + np.random.normal(0, 0.2, 10)
# test data set (N=100)
# t = sin(2 \pi x) + gaussian noise
t3 = np.sin(2 * np.pi * x1) + np.random.normal(0, 0.2, 100)

RMSE_train = []
RMSE_test = []
for m in range(10):
    w = np.polyfit(x2, t2, m)
    y_train = np.poly1d(w)(x2)
    y_test = np.poly1d(w)(x1)
    E_train = 0.5 * np.sum(np.square(y_train - t2))
    E_test = 0.5 * np.sum(np.square(y_test - t1))
    RMSE_train.append(np.sqrt(2 * E_train / t2.size))
    RMSE_test.append(np.sqrt(2 * E_test / t1.size))

M = np.linspace(0, 9, 10)
p1 = plt.plot(M, RMSE_train, color='blue', marker='o', markerfacecolor='None', markeredgecolor='blue')
p2 = plt.plot(M, RMSE_test, color='red', marker='o', markerfacecolor='None', markeredgecolor='red')
plt.xlabel('M')
plt.ylabel(r'$E_{RMS}$')
plt.legend([p1[0], p2[0]], ['訓練', 'テスト'])

f:id:twakamori:20210220005238p:plain — figure 1.5

上図をみると，（書籍の図 $1.5$ ほど顕著ではないものの）3<Mの範囲で徐々にテスト誤差が大きくなっていることがわかる．

また， $M=9$ では訓練誤差が0になっている．これは多項式の自由度が10であり，訓練集合の10個のデータ点にちょうど当てはめられるからである．

今日はここまで．

一日坊主

雰囲気でやっている

PRML上巻 P14

PRML上巻 P13

PRML上巻 P11-13

1.2 確率論

PRML上巻 P10-11

PRML上巻 P9

演習問題 1.2

演習問題 1.2 解答

PRML上巻 P8

PRML上巻 P6-7