• 价格透明
• 信息保密
• 进度掌控
• 售后无忧

# Machine Learning (Stanford University)第一周笔记

### 目录：

• (第一周)
• 一、Introduction
• 1、What is machine learning
• 2、Machine learning algorithms
• 1）Supervised learning
• 2）Unsupervised learning
• 3) Others: reinforcement learning, recommender systems
• 二、Model&Cost Function
• 1、Model Representation
• 2、Cost Function 代价函数
• 1) **The gradient descent algorithm**
• 4）gradient descent for linear regression
• 三、Matrix
• 1、Matrix Multiplication Properties
• 2、Inverse and Transpose matrix

## 一、Introduction

#### 1、What is machine learning

1） “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.机器学习是在没有详细地编程情况下，赋予计算机学习的能力的学科。

2）Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

#### 2、Machine learning algorithms

##### 1）Supervised learning

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

##### 2）Unsupervised learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

• 无监督学习是在我们无法界定学习结果的前提下，能基于变量数据之间的关系自动获取变量间的结构。

• 无监督学习的预测结果是没有衡量手段的

Example:

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

## 二、Model&Cost Function



### 1、Model Representation

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

h是假设方程，是对样本点的拟合函数，同时也是
θ 0 和 θ 1 对 x 的 函 数 \theta 0 和\theta 1对x的函数

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

### 2、Cost Function 代价函数

#### 1) The gradient descent algorithm

repeat until convergence:
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) 注 意 ： 不 是 乘 以 J , 而 是 对 J 的 两 个 参 数 分 别 求 导 \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) 注意：不是乘以J,而是对J的两个参数分别求导
where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters
θ 1 , θ 2 , . . . , θ n \theta_1, \theta_2,...,\theta_n
. Updating a specific parameter prior to calculating another one on the
j ( t h ) j^{(th)}
iteration would yield to a wrong implementation.

[

• 两个Theta参数要同时更新（很重要！）

• （如果不是同步更新两个参数值，那么第一个先更新的参数值会带入到第二个参数的运算中，影响第二个参数的值。

• when cost function is at the bottom, we will know that we have succedded. 如果代价函数值在底部，那么就成功了

• The slope of the tangent is the derivative at that point and it will give us a direction to move towards, then we make steps down the cost function in the direction with the steepest descent.点的斜率会告诉我们前进的方向

• the size of each step is determined by the parameter
α \alpha
,which is called the learning rate

• 无论曲线对参数求导数的符号正负如何，theta1始终会朝着代价函数的minimum移动。

• 即使学习率固定，下降的步伐也会越来越小，因为曲线导数越来越小。

The intuition behind the convergence is that
d d θ 1 J ( θ 1 ) \frac{d}{d\theta_1} J(\theta_1)
approaches 0 as we approach the bottom of our convex function （在local minimum的低点导数为0）. At the minimum, the derivative will always be 0 and thus we get:
θ 1 : = θ 1 − α ∗ 0 \theta_1:=\theta_1-\alpha*0

#### 4）gradient descent for linear regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to : 假设 hx= theta0 + theta1✖️xi

## 三、Matrix

### 1、Matrix Multiplication Properties

• Matrices are not commutative:
A ∗ B ≠ B ∗ A A∗B \neq B∗A
矩阵乘法不满足交换律

• Matrices are associative:
( A ∗ B ) ∗ C = A ∗ ( B ∗ C ) (A∗B)∗C = A∗(B∗C)
矩阵乘法满足结合律

• The identity matrix, when multiplied by any matrix of the same dimensions, results in the original matrix. It’s just like multiplying numbers by 1. The identity matrix simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere.

乘以单位矩阵，结果是原来的矩阵

• When multiplying the identity matrix after some matrix (A∗I), the square identity matrix’s dimension should match the other matrix’s columns. When multiplying the identity matrix before some other matrix (I∗A), the square identity matrix’s dimension should match the other matrix’s rows.

乘以单位矩阵，要注意维度相同。***I***表示 单位矩阵identity matrix

% Initialize random matrices A and B
A = [1,2;4,5]
B = [1,1;0,2]

% Initialize a 2 by 2 identity matrix
I = eye(2) 单位矩阵 identity matrix

% The above notation is the same as I = [1,0;0,1]

% What happens when we multiply I*A ?
IA = I*A

AI = A*I

% Compute A*B
AB = A*B

% Is it equal to B*A?
BA = B*A

% Note that IA = AI but AB != BA


### 2、Inverse and Transpose matrix

The inverse of a matrix A is denoted
A − 1 A^{-1}

• Multiplying by the inverse results in the identity matrix.

矩阵乘以它的可逆矩阵返回单位矩阵

• A non square matrix does not have an inverse matrix.

行列数不等的矩阵没有可逆矩阵