系统思考

发表于 2025-06-22

系统

由相互依赖，具有特殊目的多个部分组成的整体

多个部分
相互依赖
具有特殊目的

系统思考角度

深度思考: 从现象到本质
全局思考: 从局部思考到整体思考
动态思考: 理解每个人每件事之间的关系都是动态变化的

因果回路图

变量

系统结构里面的因素，会随时间而变化的名词

链路: 变量之间的因果关系，也称因果链路，分为正相关链路(A+B+)和负相关链路(A+B-)

回路

通过多条链路形成一个闭合的圈，分为增强回路和平衡回路

增强回路

所有链路都是正相关的
Reinforcing loop，一个回路中的变量增加或减少，会影响这个回路中的所有链路持续增加或减少，发展的趋势不受控制，
常见的类比说法比如“恶性循环”、“强者恒强”等等就是增强回路导致的；

平衡回路

有正相关也有负相关
Balance loop，一个回路中的变量增加或减少受到系统中其他变量的反向影响，使得这个系统中的变量在长期的维度会表现出一种保持平衡的状态，
比如最常见的例子是，猪肉如果大幅度涨价，就会有更多的人加入到养猪的行业，第二年的猪肉就会因为供应充足而降价，最终长期看价格会维持在一个平衡的状态。

延时

一个变量的变化影响另一个变量并不一定是马上生效的，他们之间的关系有可能存在时延
在链路之间增加一个“||”的符号代表这两个变量之间的因果关系存在时延

时延在工作中最典型的例子比如：
招聘对项目人力缺口的影响、代码单元测试对产品质量的影响、学习对于工作能力的影响等等。
对时延的感知也是帮助理解系统复杂性的重点之一。

系统思考的五个基础模型

通过执行不同的决策最终导致平衡或者增强回路的出现，没有解决实际问题的模型
可以更好的帮助了解分析定位系统中出现问题的因果流程

饮鸩止渴

B：走捷径为了加速
R：有故障会拖后腿

饮鸩止渴”描述了我们是怎么在进度的压力下一次又一次的放弃了自己的坚持，因为链路上的延迟，让我们心存侥幸，最后使得我们的系统背负了沉重的技术债的。
左侧半圈虽然是个平衡回路，但是整个外圈回路是个增强回路【负负得正】

舍本逐末

B1：走捷径为了加速【末】
B2：优化架构做加速【本】
R：陷入架构泥潭无法脱身【因延时】

“舍本逐末”描述了短期表面方案和长期根本方案之间的冲突，因为增强回路的存在，使得我们不能对“架构优化”这个根本的方案提高优先级，最终上瘾于短期表面方案

目标侵蚀

B1：通过行动推进【本】
B2：通过修改目标推进【走捷径】
R：完成目标的动力被侵蚀【对实际进展产生负影响】

“目标侵蚀”描述了我们怎么在目标完成的压力下，放弃了做正确的事，而是通过直接降低目标来达成目标的。真实的“加速”措施通常需要更长的时间才能见效。正是这个延迟，使得我们逐步转向上面的平衡回路，需求延期和下调目标成为一种习惯

成长上限

R：业务增长飞轮
B：业务增长天花板【人口规模是一定的，不可能一直增】

“成长上限”描述了一个增强回路不可能独自持续下去，在一个更大的维度，一定会有另一个因素（或平衡回路）对它进行限制，这个就是成长上限。

公地悲剧

R：投放即增长
B：体验下降，用户用脚投票【最外面一圈是个平衡回路】

“公地悲剧”描述了对于大家共享的有限资源（push渠道），每个个体（业务单元）都想自己利益最大化。
使用者越多，越消耗用户对平台体验的信任。
随着push总量迅速增加，遭遇用户容忍瓶颈时，消费者会感到不可容忍，用脚投票。

更复杂的例子

回归思考的心智模型

怎么找到合适的“变量”

从目标或问题出发
- 明确系统目标：首先确定系统的核心目标（如 “提高产品销量”“降低成本”），围绕目标反推影响它的关键因素。
- 聚焦核心问题：从具体问题出发，拆解问题背后的驱动因素。例如，“员工流失率高” 可能涉及 “薪资水平”“晋升机会”“工作压力” 等变量。
逐层分解系统
- MECE 原则（相互独立、完全穷尽）：将复杂系统分解为互不重叠且覆盖全面的子系统。例如，分析企业运营时，可分解为 “市场”“生产”“财务”“人力资源” 等模块，再分别提取变量。
- 5Why 分析法：通过连续追问 “为什么”，挖掘深层变量。
从参与者视角切入
- 列出所有利益相关者：识别与系统相关的个体或群体（如PM、开发、测试、DA等），分析他们的行为如何影响系统。
- 寻找交叉影响：关注不同利益相关者之间的互动。例如，“排期压力” 可能影响 “软件质量”，进而影响 “用户体验”。

系统思考的心智模型

对于复杂问题的思考是有层次的，从最表面的事件（正在发生什么），到事件背后的规律（发展趋势是什么），再到这个问题的结构模式（解释趋势背后的原因），再到价值观（驱动这个模式的理念），层层递进。

在画完自己的业务系统因果回路图之后，再结合这个心智模型，思考自己的思考在哪个层次，是否可以有机会再下钻到更深的层次。

值得注意的地方，“系统思考”只是一个工具，不同的人，面对同样一个系统，因为了解的信息多少不同，关注的问题角度不同，对系统发展方向的期待不同，都会导致画出来的因果回路图有所不同。

所以，“系统思考”就是一个帮助你不断的通过zoom out、zoom in 来完整的、体系的看待复杂问题的工具，通过使用这个工具的过程，帮助更好的思考和理解你面对的复杂问题。

RNN-循环神经网络

发表于 2025-02-25

循环神经网络（RNN, Recurrent Neural Network）是一类用于处理序列数据的神经网络，它与传统的前馈神经网络不同，具有“记忆”能力。
RNN的特点是神经元之间的连接不仅仅是前向的，还包括了“循环”连接，这样可以把之前的输出作为当前的输入，从而捕捉到时间序列中不同时间点之间的依赖关系。

主要特点：
序列数据处理：RNN擅长处理序列数据（例如文本、时间序列、语音等），可以在处理每个时间步时考虑之前的状态。
权重共享：RNN在不同时间步之间共享相同的权重参数，这使得它能够在不同的时间点上进行相似的计算。
记忆能力：通过循环结构，RNN可以保留以前输入的信息，帮助模型理解时间依赖性。

常见序列数据处理任务：语音识别，乐谱生成，情绪分类，DNA序列分析，机器翻译，视频动作识别，名称实体识别

下面以名称实体识别为例，介绍一个多对多且输出序列长度等于输入序列长度的RNN架构工作原理

名称实体识别

任务：识别文本中的实体，如人名、地名、组织机构名等
实现原理：

准备一个字典，将单词映射为数字表示
将输入句子中的每一个单词利用one-hot编码表示, 此时一个单词就是一个序列
将所有序列输入到RNN中，RNN会对每个序列进行处理，输出结果
每一个结果对应一个单词是否是名称实体

RNN 符号表示

RNN 计算过程

如果用简单卷积神经网络主要存在两个问题

并不是所有输入或者所有输出的序列长度都不一致，如果都填充到一个最大值，会造成表示不友好的问题
不能跨文本位置共享学习到的特征
输入层巨大(单词数*10000)，会导致第一层的权重矩阵非常大

RNN 可以解决上述问题

不同的问题中输入或者输出的序列长度可能不一致，RNN可以处理任意长度的序列
学习到的特征值可以应用到不同位置的名称实体识别中，RNN可以学习到不同位置的特征值

循环过程

如果从左到右读取单词，处理完第一个单词后，在处理第二个单词时，不仅需要将第二个单词作为输入，
也需要将第一个单词的输出作为输入，以此类推，直到处理完最后一个单词，拿到所有单词的输出后

向前传播

向后传播

具体实现

符号表示

关于输入的维度描述

rnn cell vs rnn_cell_forward

rnn cell 是一个函数，输入是x^t和a^(t-1)，输出是a^t
rnn cell forward 是一个函数，输入是a^t，输出是和y^t
下图是一个单个时间步的计算过程，实线是rcc cell，虚线是rnn cell forward
这里是最基础的RNN单元的实现，后面会介绍降低梯度消失的RGU单元和LSTM单元

recurrent neural network

循环神经网络的实现就是基于时间步数循环调用rnn cell 函数，rnn cell 里面的parameters参数对于每一个时间步都相同

Situations when this RNN will perform better:

This will work well enough for some applications, but it suffers from the vanishing gradient problems.
The RNN works best when each output 𝑦̂ ⟨𝑡⟩ can be estimated using “local” context.
“Local” context refers to information that is close to the prediction’s time step 𝑡 .
More formally, local context refers to inputs 𝑥⟨𝑡′⟩ and predictions 𝑦̂ ⟨𝑡⟩ where 𝑡′ is close to 𝑡 .

不同的RNN架构

Music generation –> 1 to many
Sentiment classification –> many to 1
DNA sequence analysis –> many to many
Machine translation –> many to many Xt ===? Yt
Name entity recognition –> many to many Xt === Yt

Video activity recognition –> many to many ????
Speech recognition –> 1 to many ???
如果任务是从输入的时间序列映射到另一个时间序列（如语音转文本、逐帧动作识别），则属于多对多（Many-to-Many）。
如果任务是从整个序列映射到一个单一类别（如视频整体分类、语音情感识别），则属于多对一（Many-to-One）。

语言模型(language model)

本质上是在计算输出结果的概率是多大
比如输出一个句子，The apple and pear salad
那么这个概率代表的就是P(The)P(apple|The)P(and|The apple)…P(salad|The apple and pear salad)
其中P(and|The apple) 表示第三个词在前两个词是 the apple 的情况下是 and 的概率
因此评价一个语言模型的好坏标准就是对一个正常准确句子计算得到的概率高低，对正确句子计算的概率越高，说明模型准确度越高

训练过程

针对语言模型的训练过程，就一个训练样本来说
第一步就是进行分词，也就是token处理
这个过程就拿到一个训练样本的多个时间step, 一个词是一个x^t,
对于句尾符号一般用或者也也用onehot表示
对于词典中没有出现的词，则一般用表示，这样在训练和采样阶段可以针对不存在的词制定相应的处理策略，
比如是照常输出，还是跳过重新采样

第二步开始训练
time step 训练的第一轮输入都是0向量，输出是字典里任意词语的的概率，即每个词都等概率
在后面的每一轮中，另x^t = y^(t-1)，那么y^~^t 计算的就是在前一轮基础上计算下个词y^t出现的概率，最终直到出现句子结束

采样

就是根据y^~^t取出对应的词典里的词作为下一轮的输入，代替y^t
类似于训练过程，只不过在第二轮开始的输入，变成np.random.choice(y^~t), 根据计算过程的概率取词，接着预测下一个词
结束规则可以自行定义，如果词典中有则碰到即可以结束采样，如果没有，可以定义一定数据，采样到一定数量的词后自行结束采样
这也相当于根据前面的词预测下一个词的过程
要尽量避免如果出现，出现的话则丢弃重新进行采样

字母级语言模型

类似于上面word 级别的模型，只不过分词细化到每个字母，每个字母是一个时间序，字典就是26个英文字母加相关符号
优点是不会出现unk，
但是缺点是最终会得到太多太长的序列，捕捉句子中的依赖关系时(句子较前部分影响较后部分)不如word language model 能捕捉长范围范围内关系
另外训练计算成本高，
除非用于处理大量未知文本，未知词汇的应用，或者专有词汇的领域

练习

梯度消失&爆炸

如果出现梯度爆炸常见的处理办法就是进行梯度修剪(gradient clipping)
即如果梯度值超过某个阈值，则对其进行缩放，始终保证其在阈值范围内，详见上面【练习】

对于梯度消失的影响，类似于深层网络，最终给到输出的权重y很难影响到靠前层的权重
对于RNN来说，梯度消失意味着后面层的输出误差很难影响前面层的计算，其实就是无法实现一个句子中的长范围影响

下面是两种处理梯度消失的解决办法，主要通过引入记忆细胞进行长范围特征传递避免梯度消失

GRU 单元

Gated Recurrent Unit (GRU) 门控循环单元
通过引入记忆细胞，存储前层数据特征，然后利用【更新门】逻辑决定是否将特征传递给后面数据实现长范围影响

更新门是个sigmod 函数，即使无限接近0,在更新候选值时可以始终保持记忆细胞的值，从而实现深层传递，实现长范围影响，缓解梯度消失的问题
Γu：控制当前状态与前一时刻状态的融合程度。
Γr：决定前一时刻的记忆细胞信息保留多少在当前时刻的记忆细胞候选值中

通过门控机制缓解梯度问题，但比 LSTM 稍弱

LSTM (long short term memory) unit

LSTM 循环单元出现的比GRU 单元要早，相比GRU 单元更复杂
多了【遗忘门】和【输出门】
【更新门】现在决定当前输入信息有多少被存入细胞状态
【遗忘门】决定遗忘多少过去的信息
【输出门】决定细胞状态中的信息有多少影响当前的隐藏状态
激活值不再等于记忆细胞，而是由【输出门】和【记忆细胞】共同决定
LSTM 通过这三个门的组合，实现对信息的精确控制，使其能够有效处理长序列依赖问题

通过细胞状态缓解梯度消失问题，更适用于长序列

LSTM 单元的具体实现

一个LSTM单元的具体实现

各个状态和门的计算和解释

GRU 单元 VS LSTM 单元

1. 结构对比

对比项	GRU	LSTM
门控机制	2 个门：更新门（Update Gate）、重置门（Reset Gate）	3 个门：输入门（Input Gate）、遗忘门（Forget Gate）、输出门（Output Gate）
记忆单元	直接更新隐藏状态	额外维护一个“细胞状态”
计算复杂度	相对较低	计算量较大
参数数量	较少	较多
梯度消失/爆炸	通过门控机制缓解梯度问题，但比 LSTM 稍弱	通过细胞状态缓解梯度消失问题，更适用于长序列

2. 对比分析

对比项	GRU	LSTM
训练时间	更快（参数较少，计算量低）	相对较慢（参数多，计算复杂）
表现效果	适用于中等长度依赖	更擅长长序列依赖问题
内存占用	低（因参数少）	高（因参数多）
适用场景	机器翻译、语音识别、文本生成	时间序列预测、长文本处理

3.应用场景

GRU 适用于：
计算资源有限的设备（如移动端）
需要较快训练和推理的任务
语音识别、机器翻译等

LSTM 适用于：
处理长序列依赖问题
需要更精细控制记忆存储的任务
生成式任务（如文本生成、音乐生成）

更多– RNN 普通单元和LSTM单元的反向传播过程

双向RNN

先正向计算每个时间步的激活值，然后再反向(从最后一个时间步开始)计算一遍一遍激活值
每个时间步的最终计算值，由正反两次计算的激活值共同决定
主要用于解决，既要考虑前文又要考虑后文的模型判断
优点是可以预测句子中任意位置信息
缺点是需要完整的数据序列，才能预测任意位置，比如语音识别，需要等人说完完整的句子后才开始识别

深度循环网络 deep RNN

使用基本RNN单元，GRU 单元，LSTM 单元构建的多层循环神经网络
常见的是三层，每层参数相同，
也可能有更深的架构但是在循环层上不在有联系
甚至有双向深度循环网络，但训练成本高，需要更多计算资源和时间

卷积神经网络

发表于 2025-02-03

背景

如果用神经网络直接处理1000x1000的图片，那么在一开始入参就需要1000x1000x3 = 3mili个参数，加上后续layer的参数，
会导致整个神经网络需要足够巨大的内存，且消耗训练时间，另外难以获取足够多的数据防止出现过拟合问题和竞争需求
因此，计算机视觉中进行图片识别就引入了卷积计算，解决大图片识别问题

卷积计算

卷积运算在图像处理和计算机视觉中被广泛用于边缘检测。
边缘检测是识别图像中像素值变化显著的区域，这些区域通常对应于物体的边界。
卷积运算通过在图像上应用特定的滤波器（或卷积核）来实现这一点。

卷积运算涉及将一个小矩阵（称为卷积核或滤波器）在图像上滑动，并在每个位置计算该核与图像局部区域的点积。
卷积核的大小通常较小（如 3x3、5x5），而图像可能很大。

步骤

选择卷积核：选择一个合适的卷积核，例如 Sobel 核，用于检测图像中的边缘。
滑动卷积核：将卷积核从图像的左上角开始，逐像素地在图像上滑动。对于每个位置，将卷积核与图像的对应区域进行点积运算。
计算卷积值：将卷积核与图像局部区域的像素值相乘并求和，得到该位置的卷积值。
生成输出图像：将每个位置的卷积值组成一个新的图像，该图像的每个像素值表示原始图像中对应位置的边缘强度。

padding

如上计算过程存在两个弊端
一个是卷积计算完之后，输出图像的大小会比输入图像小，如果神经网络有100层，每一层都缩小一点点，最终得到的图片可能是1x1大小的图片，
另外一个是无法充分利用图片边缘信息，因为卷积核在边缘处计算时，图片的边缘数据只被用到了一次，但是中间的数据被用到了多次，导致图片信息没有等概率的参数推测

The main benefits of padding are the following:
It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

为了解决这个问题，在图片边缘添加p圈0，这样卷积核在边缘处计算时，图片的边缘数据就可以被用到了多次，从而可以防止计算后图片缩小
这个操作的过程就是卷积计算加padding的操作

加padding有两种方式valid 和 same
valid 表示不加padding，输出图片计算公式为 nxn * fxf = (n-f+1) x (n-f+1)
same 表示加padding，使得输出图像的大小和输入图像的大小相同
(n + 2p - f + 1 ) x (n + 2p - f + 1 ) = n x n
p = (f - 1) / 2
这样过滤器f 一般为奇数，才能保证实现对称填充(4周填充数相同)，不然会出现不对称填充（左边多右边少）
另外奇数过滤器为奇数，会存在中心点，方便定位过滤器位置

stride

stride 表示卷积核在图像上滑动的步长，默认值为1，表示每次滑动一个像素
如果stride 为2，表示每次滑动2个像素，这样可以减少计算量，同时可以减少输出图像的大小
输出图像的大小计算公式为 math.floor((n + 2p - f) / stride) + 1

三维计算

上面讨论的是在一张灰度图片上的计算，如果是彩色图片，那么需要对图片的每个通道进行卷积计算，
然后将每个通道的卷积结果相加，得到最终的输出图像
最终输入图片大小同上，为 (n-f+1) x (n-f+1)
但是要保证输入图片的通道数和卷积核的通道数相同，否则无法进行卷积计算

如果同时对图片进行多个通道，多个过滤器的计算，
那么输出图片维度需要加上过滤器的个数，相当于一张过滤器产生一个通道
输出图片的大小为 (n-f+1) x (n-f+1) x 过滤器的个数

卷积层

之前的神经网络，输入x 和 w 都是具体的一个数字，现在相当之于x 和 w 是一个三维矩阵
有几个过滤器，相当于有几个神经元，每个神经元的输入是一个三维矩阵，
每个神经元的w是一个和上一层输入深度相同的三维矩阵, b 也是一个和上一层输入深度相同的三维矩阵
这样每个神经元计算结果就是一个2维矩阵，
多个神经元的计算结果就是一个三维矩阵，相当于多个通道的图片

单层的实现原理

卷积层的实现

池化层

减小模型规模，提高计算速度
类似与卷积层，只不过在过滤器范围内，不再是卷积计算，而是取最大值或者平均值
但是要注意，池化层的过滤器大小和步长是固定的，在整个神经网络中属于静态属性，不参与梯度下降运算

卷积神经网络

上面提到了卷积层和池化层，卷积神经网络的最后一层叫做全连接层
相当于之前的标准神经网络，将卷积层和池化层的输出结果进行展平，然后进行全连接计算

随着卷积池化层的增加，图片尺寸会越来越小，但是通道越来多
池化层没有学习参数，卷积层可学习参数远小于全连接层
而且随着卷积神经网络的向后计算，激活值数据逐渐减少

优点

Parameter sharing: A feature detector (such as a vertical edge detector) that’s useful in one part of the image is probably useful in another part of the image.
Sparsity of connections: In each layer, each output value
depends only on a small number of inputs.

参数共享，从而减少参数数量
结果输出仅依赖输入的部分数据，计算速度快

实验
 实验

三种常见的卷积神经网络架构

LeNet - 5

论文: LeCun et al., 1998. Gradient-based learning applied to document recognition

AlexNet

论文: Krizhevsky et al., 2012. ImageNet classification with deep convolutional neural networks

VGG-16

简化了神经网络的结构，所有的卷积层和池化层都相同
16 指的是整个网络架构中所有卷积层，池化层以及全连接层的层数总和
结构庞大，会有约1.38亿个参数
但是结构规律，
池化层都在缩小一倍图片尺寸
过滤器数量随层数递深，整倍增长

论文：Simonyan & Zisserman 2015. Very deep convolutional networks for large-scale image recognition

残差网络

The problem of very deep neural networks

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the shallower layers, closer to the input) to very complex features (at the deeper layers, closer to the output).

However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent prohibitively slow.

More specifically, during gradient descent, as you backprop from the final layer back to the first layer, you are multiplying by the weight matrix on each step, and thus the gradient can decrease exponentially quickly to zero (or, in rare cases, grow exponentially quickly and “explode” to take very large values).

During training, you might therefore see the magnitude (or norm) of the gradient for the shallower layers decrease to zero very rapidly as training proceeds:

残差块

在计算a^[l+2]前，激活函数不再使用z^(l+2)作为入参，而是以z^(l+2) + a^(l) 作为入参
即 a^(l+2) = g(z^(l+2) + a^(l))
其中a^(l) 被称为残差块
在通用场景下，残差块(a^(l))可能不止被加在后面第二层的计算中，可能会加入在更深的网络层中，
这种在main path 的基础上进行的计算又叫short cut 捷径/ skip connect 远跳链接

残差网络

将多个远跳链接计算堆积在一起就形成残差网络，可以使神经网络按照理论规律，随着神经网路层数加深，误差降低

残差网络优化原理

根据残差块的计算原理
a^(l+2) = g(z^(l+2) + a^(l))
a^(l+2) = g(w^(l+2) * a^(l+1) + b^(l+2) + a^(l))
在进行梯度下降处理过程中如果有使用L2正则化,或者梯度缩减,导致 w^(l+2), b^(l+2)变小至0,
那么残差块的计算结果就会变成
a^(l+2) = g(a^(l))
这时激活函数式是Relu函数的话,就会得到
a^(l+2) = a^(l)
也就是说，我们通过在激活函数前添加一个远跳链接，
不仅没有影响网络本身性能，而且发现(学习到)了一个恒等映射，
可以直接利用a^(l+2) = a^(l)进行计算，
中间的两个隐藏层的增加对整个网络没有影响，没有的话会更好
扩展思路，如果跳跃链接加在更深的网络层，甚至神经网络最后一层
有可能学习到比恒等映射更有用的东西，从而提高整个网络的性能

另外，如果没有跳跃链接，随着网络层数的增加，深层参数的初始化会是一个非常困难的事情，更别说是学习恒等映射
这也是为什么随着层数加深，训练的效果不是越来越好，反而越糟

因此，通过增加跳跃连接形成残差网络，从不影响性能开始学习恒等映射，然后梯度下降只能从这里进行更新，
从而避免梯度消失或爆炸，进而提高整个网络的性能

上述讨论假设在全链接层或者卷积层的 a^(l+2),a^(l) 维度相同可以成立，
但是在卷积层中，如果 a^(l+2),a^(l)维度不同, 需要给a^(l) 增加参数保证(比如进行卷积处理后再相加)和 a^(l+2) 维度相同

论文：He et al., 2015. Deep residual networks for image recognition

实验

Inception网络

1 x 1 卷积层

当对三维矩阵进行11 的卷积计算时，相当于取三维矩阵的一个切片进行加和计算
对于11 的过滤器，利用其个数，可以对三维矩阵实现【通道】压缩或增加，
从而实现对输入数据的维度变换

在下面的Inception网络中, 被用来构建【瓶颈层】
对矩阵先压缩在扩展，从而大大降低计算成本

论文：[Lin et al., 2013. Network in network]

Inception网络

可以自行选择过滤器实现卷积层和池化层的计算，代替人工来确定卷积层中的过滤器类型(11, 33, 55, 77, 个数)

直接进行卷积计算，计算量巨大

可以引入瓶颈层降低计算量

inception module

在普通卷积计算中引入瓶颈层，最后将所有结果进行拼接，这个模块就是inception module

inception network

将多个inception module 堆叠在一起，就形成了inception network

论文： Szegedy et al., 2014, Going Deeper with Convolutions

迁移学习

卷积网络中的迁移学习没什么不同，在公开模型基础上继续进行训练
对于训练数据较少的情况，建议固定所有隐藏层参数，仅更改softmax 层结构，使之符合自己的分类规则
相当于只训练输出结果层的参数，

一个提高训练速度的方法是因为隐藏层参数固定，可以看做固定函数，训练数据不多的情况下，
提前计算好所有训练数据的最后一层的激活值，拿激活值进行训练，避免重复计算

对于训练数据较多的情况，建议固定前面几层的参数，仅更改后面几层的参数，使之符合自己的分类规则
相当于只训练后面几层的参数

对于训练数据超多的情况，可以放开所有层级，以已有参数做初始值，从头开始训练

数据增强

图片

镜像处理，随机裁剪，旋转,shearing,local warping(局部扭曲)

色彩转换(加减rgb值, PCA颜色增强算法)

架构实现选择

对于数据量少的情况，一般进行更多的手工设计
对于数据量多的情况，一般使用更简单的算法和更少的手工工程，不需要精心设计


Use open source code
• Use architectures of networks published in the literature
• Use open source implementations if possible
• Use pretrained models and fine-tune on your dataset

目标检测

在图片分类的基础上，我们可以进行分类和定位
但是分类的定位都针对一个对象进行讨论
如果是对一张图的多个对象进行分类和定位就是对象检测
)

定义输出

在分类的基础上，定义输出是一个向量，包含分类和定位信息
pc 代表是否有检测目标存在，存在为1，不存在为0，不存在的情况下，后续向量值不做继续讨论
bx 对象中心点横坐标
by 对象中心店纵坐标
bh 对象在图中相对整幅图片高度
bw 对象在图中相对整幅图片宽度
c1 c2 c3 代表具体是哪一个对象，其中一个值为1 时，另外两个为0，同时pc 为1

损失函数分为pc 是否为1 两种计算方式
在实际计算过程中，可能会去将y 向量分类，pc, bx ~bw, c1~c3 然后分别使用不同的计算方法进行损失值计算

特征点检测

原理同定位检测，只不过输出向量是N个特征点的位置坐标 + pc值

滑动窗口进行图像识别

识别一张图片中是否有某个物体的过程一般是通过分别定义不同大小的窗口，对图片进行滑动裁剪
对裁剪到的图片进行对象识别，但是这样无疑会大大增加计算量
解决方案就是利用卷积计算的过程，避免窗口滑动过程重复区域的重复计算
直接将一张图片输入，直接得到物体识别的结果

卷积滑动窗口的具体实现，现将FC 层也转成卷积层，方便输入各个窗口的计算值

YOLO

YOLO(you only look once) 算法通过对图片进行格子分割，直接对格子内图片进行对象识别，从而更准确的判断对象位置
最后的输出结果代表每个格子的识别和位置检测结果

如何解释结果？
将框看做左上角(0,0)，右下角(1,1) 的坐标
bx,by,相对一个格子而言，是中心，肯定要小于1
bh,bw 同样相对于一个格子的大小，进行比例计算，有可能大于1，说明对象是跨格子存在的

交并比

用于评价一个位置检测算法的好坏
预测值面积与实际对象所占面积重叠的部分就是交集大小 –> S1
二者面积之和即为并集大小 –>S2
交并比 = S1/S2
阈值认为决定，1 当然是最好的，说明准确发现了对象位置，
在YOLO 中用于non_max_suppression 计算，排除重叠的格子

非最大值抑制(non_max_suppression)

用于解决出现多个位置检测符合条件的情况

原理就是在符合条件的窗口中寻找最大正交比

对于所有格子的输出，去掉pc概率小于阈值的格子
对剩下格子循环处理，首先找pc值最大的格子A最为最终的检测结果，然后计算剩余格子与A 的交并比，去掉交并比大于一定阈值的格子(在最大值附近)，剩余格子的话，继续循环处理
如果是对多类对象进行同时检测，则需要对各个类别，分别进行非最大值抑制处理

ancher box

用于处理两个对象同时出现在一个格子的情况
提前定义两个不同形状的ancher box，分别用不同ancher box去识别不同的对象
然后根据正交比大小，判断当前格子的对象是哪个分类，修改对应分类位置的值
注意这里，通过增加输出通道，输出从单一结果，变成两个
一般很少遇到3个对象同时出现在一个格子的情况，暂不考虑

小结

Summary for YOLO:
Input image (608, 608, 3)
Image is split into a grid of 19x19 cells， Each cell predicts 5 boxes(uses 5 anchor boxes.)
就是直接切割成19x19的格子，对每个格子直接进行5种ancher box 的检测，这里避免了滑动窗口的重复计算
The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
Each cell in a 19x19 grid over the input image gives 425 numbers.
425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
85 = 5 + 80 where 5 is because $(p_c, b_x, b_y, b_h, b_w)$ has 5 numbers, and 80 is the number of classes we’d like to detect
You then select only few boxes based on:
Score-thresholding: throw away boxes that have detected a class with a score less than the threshold
Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes
This gives you YOLO’s final output.

论文【难】： Redmon et al., 2015, You Only Look Once: Unified real-time object detection

实验

R-CNN

利用图像分割算法，对图像进行分割后，在一定区域内进行图像识别和定位
缺点不如YOLO一次性计算快

人脸识别

验证与识别

验证比识别要简单一些，属于1对1 的问题，识别输入仅仅是一张图，相对来说更难

Face recognition problems commonly fall into two categories:

Face Verification - “is this the claimed person?”. For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.

Face Recognition - “who is this person?”. For example, the video lecture showed a face recognition video of Baidu employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem.

one shot & similarity

one shot learning 就是根据已有的一张图片对输入的图片进行识别
用传统的训练思维处理容易出现过拟合，而且数据库更新需要重新训练，成本高

解决方式是通过看他们相似度的方法进行图片验证，比较输入图片和图库图片，
有差值小于阈值的图片的话，说明有这个人，没有小于阈值的图片的话，则输入图片对应的人没有在数据库中

Siamese network

根据相似度进行图片验证的网络架构
原理就是，对两张图片进行相同的网络架构处理得到能够尽可能代表两张图片的128为编码
然后对两个编码进行范数运算，如果是同一个人则范数值会很小，否则会很大

模型训练过程也是通过进行范数比较然后再去调整模型参数

论文： Taigman et. al., 2014. DeepFace closing the gap to human level performance

So, an encoding is a good one if:

The encodings of two images of the same person are quite similar to each other.
The encodings of two images of different persons are very different.

三元损失函数

在训练过程过程中，通过构造识别对象的三元组数据，进行损失函数计算
具体过程就是分别准备待识别对象A，与待识别对象相似的对象P，与待识别对象差异较大的对象N三者的图像编码
通过计算比较AP与AN 之间范数的大小关系，用作损失函数的计算
注意这里引入超参α用于加强 AP 与AN之间的差距，提高准确度，也防止出现图像编码始终为0，导致比较关系始终成立情况出现

论文：Schroff et al.,2015, FaceNet: A unified embedding for face recognition and clustering

Siamese network + 二分类

在 Siamese network 基础上，对输入的两张图片进行计算得出0 和 1 的结果，直接判断二者是否是同一个人

y^的计算方式可以有多种
在实际使用中可以提前计算好数据库中图片编码，需要验证的时候只计算输入图片编码即可

Key points to remember
Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

实验

神经风格迁移

将一张作为内容的图片C合并上一张具有艺术表现风格的图片S，从而生成一张具有图片B艺术表现风格，但图片内容是图片A 的新图片 G

The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

可视化深层网络

【要去读】
论文： Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks

成本函数

由两部分组成，CG之间的内容成本函数Jcg（衡量内容相似度），SG之间的风格成本函数Jsg(衡量风格相似度)

论文：Gatys et al., 2015. A neural algorithm of artistic style.

具体实现过程

内容成本函数

计算CG在相同预训练模型上某一层的激活函数值的相似度，如果二者相似说明图片有相似的内容

风格成本函数

如果两个通道间关联度高，说明图片风格特征同时出现的概率高
通过某一层计算激活函数值各通道间的关联程度定义来衡量图片的风格，
如果SG相似度低，则成本函数值高

具体实现过程

小结

What you should remember
Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
It uses representations (hidden layer activations) based on a pretrained ConvNet.
The content cost function is computed using one hidden layer’s activations.
The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
Optimizing the total cost function results in synthesizing new images.

You can also tune your hyperparameters:

Which layers are responsible for representing the style? STYLE_LAYERS
How many iterations do you want to run the algorithm? num_iterations
What is the relative weighting between content and style? alpha/beta

====》

first time building a model in which the optimization algorithm updates the pixel values rather than the neural network’s parameters. Deep learning has many different types of models and this is only one of them!

实验

1D 2D 3D 卷积计算

超参数调优

发表于 2025-01-31

超参数如何取值

直接调优

根据经验对于超参数的重要性进行排序

学习率
momentum 的β（~0.9）;隐藏层的神经元数量;mini batch size
隐藏层的数量；学习率的衰减率
adam 的β1（~0.9），β2（~0.999）Σ（10^-8）这三者一般固定，很少需要调试

数值选择上遵循随机取值和精确搜索的原则
随机取值可以快速找到影响较大的超参数，确定范围后，精确搜索可以进一步优化超参数，由粗糙到精细

借鉴已有其他领域模型，复用其参数进行测试评估

根据计算能力选择调试方式

1，如果计算能力有限，一次进行一个模型的训练调试，用时间换取效果
2，如果计算能力充足，一次同时进行多个模型的不同超参训练调试，用并行换取时间

batch归一化处理

基于对输入样本数据进行归一化处理可加速算法训练学习的原理，对隐层的输入数据也同样进行归一化处理，
可以降低前一层的数据计算结束变动对后一层计算的影响，降低前后层的联系，每一层都可以独立学习，从而加速算法的训练速度

Softmax regression

Softmax 算法是一种用于多分类任务的函数，通常应用于神经网络的输出层，以将网络的输出转换为概率分布。
它可以将一个未归一化的向量（即原始的网络输出）转换为一个归一化的概率分布，使得每个类别的概率值在 0 到 1 之间，并且所有类别的概率和为 1。

TensorFlow 框架学习

What you should remember: - Tensorflow is a programming framework used in deep learning - The two main object classes in tensorflow are Tensors and Operators. - When you code in tensorflow you have to take the following steps: - Create a graph containing Tensors (Variables, Placeholders …) and Operations (tf.matmul, tf.add, …) - Create a session - Initialize the session - Run the session to execute the graph - You can execute the graph multiple times as you’ve seen in model() - The backpropagation and optimization is automatically done when running the session on the “optimizer” object.

实验
 练习代码

一些优化梯度下降的方法

发表于 2025-01-31

小批量梯度下降算法

在单轮训练过程中，不再一次从计算整个训练数据，将整个训练数据分批进行训练，每批（epoch）训练的样本数称为batch size

batch size 最大等于整个训练样本数m时，相当于进行一次批量运算，就是标准的批量梯度下降算法,
需要计算完基于整个训练样本参数和损失函数，花费时间较长，然后才能进行梯度下降计算

# (Batch) Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    # Forward propagation
    a, caches = forward_propagation(X, parameters)
    # Compute cost.
    cost += compute_cost(a, Y)
    # Backward propagation.
    grads = backward_propagation(a, caches, parameters)
    # Update parameters.
    parameters = update_parameters(parameters, grads)

batch size 最小等于1时，就是一个批次处理一个数据, 速度快，但是没法利用向量加速梯度下降，也称为随机梯度下降算法Stochastic Gradient Descent:

# Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
    for j in range(0, m):
        # Forward propagation
        a, caches = forward_propagation(X[:,j], parameters) # X[:,j] 从m列中，每次取一列，就是一个样本
        # Compute cost
        cost += compute_cost(a, Y[:,j])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

当batch size 介于1和m之间时，就是小批量梯度下降算法Mini-batch Gradient Descent
小批量梯度下降处理数据分两步，
一步是混洗数据，将数据集顺序打乱，但保证X(i) 和 Y（i）是一一对应的
第二步是将数据分成多个batch，每个batch包含batch size 个样本，需要注意如果m 不能整除batch size，最后一个batch 是不足batch size 个样本，需要单独处理

Shuffling and Partitioning are the two steps required to build mini-batches -
Powers of two are often chosen to be the mini-batch size, e.g., 16, 32, 64, 128.

# GRADED FUNCTION: random_mini_batches

def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
    """
    Creates a list of random minibatches from (X, Y)
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
    mini_batch_size -- size of the mini-batches, integer
    
    Returns:
    mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
    """
    
    np.random.seed(seed)            # To make your "random" minibatches the same as ours
    m = X.shape[1]                  # number of training examples
    mini_batches = []
        
    # Step 1: Shuffle (X, Y)
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))

    # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
    num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
    for k in range(0, num_complete_minibatches):
        mini_batch_X = shuffled_X[:, k * mini_batch_size : (k + 1) * mini_batch_size]
        mini_batch_Y = shuffled_Y[:, k * mini_batch_size : (k + 1) * mini_batch_size]
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    # Handling the end case (last mini-batch < mini_batch_size)
    if m % mini_batch_size != 0:
        ### START CODE HERE ### (approx. 2 lines)
        mini_batch_X = shuffled_X[:, num_complete_minibatches * mini_batch_size :]
        mini_batch_Y = shuffled_Y[:, num_complete_minibatches * mini_batch_size :]
        ### END CODE HERE ###
        mini_batch = (mini_batch_X, mini_batch_Y)
        mini_batches.append(mini_batch)
    
    return mini_batches

迭代过程处理的数据集就是上面分批好的mini_batches

mini_batches = random_mini_batches(X, Y, mini_batch_size = 64, seed = 0)
t =  math.floor(m/mini_batch_size)
if m % mini_batch_size != 0
  t+=1
for i in range(0, num_iterations):
    for j in range(0, t):
        # Forward propagation
        a, caches = forward_propagation(mini_batches[j][0], parameters) # 取一批计算一批，不用整个计算完再更新梯度
        # Compute cost
        cost += compute_cost(a, mini_batches[j][1])
        # Backward propagation
        grads = backward_propagation(a, caches, parameters)
        # Update parameters.
        parameters = update_parameters(parameters, grads)

The difference between gradient descent, mini-batch gradient descent and stochastic gradient descent is the number of examples you use to perform one update step. - You have to tune a learning rate hyperparameter 𝛼 . - With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent (particularly when the training set is large).

momentum

指数加权平均

指数加权平均（Exponential Weighted Moving Average, EWMA）是一种用于平滑时间序列数据的技术。
它通过对数据点赋予不同的权重来计算平均值，较新的数据点权重较大，较旧的数据点权重较小。
这样可以更敏感地反映最新数据的变化，同时保留历史数据的趋势。
指数加权平均的计算公式如下：

St = β St-1 + （1-β） Xt
其中：
St是时间 t 时刻的指数加权平均值。
Xt是时间 t 时刻的实际数据值。
β 是平滑因子，取值范围在 0 到 1 之间。较大的值使得 EWMA 对最新数据更敏感，较小的值则使得 EWMA 更平滑。
St-1是时间 t-1 时刻的指数加权平均值。
解释
初始值：通常，初始的指数加权平均值 S0 可以设置为第一个数据点 X0 。
递归计算：每个新的数据点都会更新 EWMA，新的 EWMA 是当前数据点和前一个 EWMA 的加权和。
平滑因子：决定了新数据点和历史数据对当前 EWMA 的影响程度。较大的值会使得 EWMA 对新数据点变化更敏感，较小的值会使得 EWMA 更平滑，受历史数据影响更大。

偏差修正

在计算 EWMA 时，初始值的选择对后续计算的影响较大。特别是在数据序列的初始阶段，
由于缺乏足够的历史数据，计算的平均值可能会偏离真实值。
因此，需要对初始阶段的计算结果进行修正，以减小这种偏差。
为了进行偏差修正，我们可以使用以下公式：
St^ = St / (1 - β^t)
其中：
St^是经过偏差修正的时间 t 时刻的指数加权平均值。
St是时间 t 时刻的未修正的指数加权平均值。
t是时间步数，从1 开始，它是β^t 是指β的 t 次方。
通过对初始值进行修正，我们可以使 EWMA 在初始阶段更接近真实值，从而减少偏差。
修正后的 EWMA 可以帮助我们更准确地反映数据的趋势和变化。

momentum

momentum 是一种优化算法，用于加速梯度下降过程。它通过引入动量（momentum）的概念来加速参数的更新。
在每次迭代中，momentum 会考虑上一次迭代的梯度方向，并根据动量的大小来调整当前的梯度方向。
这样可以在梯度下降的过程中，更加平滑地更新参数，从而加速收敛。
momentum 的计算公式如下：
VdW = β VdW + (1 - β) dW
Vdb = β Vdb + (1 - β) db
W = W - α VdW
b = b - α Vdb
其中：
VdW是权重参数W的动量。
Vdb是偏置参数b的动量。
dW是权重参数W的梯度。
db是偏置参数b的梯度。
α是学习率。
β是动量因子，通常取值在0.9到0.99之间。

由于小批量梯度下降所采用的路径将“振荡”至收敛，利用momentum可以减少这些振荡。
（将VdW，带入W更新式子计算，会发现 W 下降的比之前要慢一些，负负得正，会加回来-β VdW + β dW）

# GRADED FUNCTION: initialize_velocity

def initialize_velocity(parameters):
    """
    Initializes the velocity as a python dictionary with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    
    Returns:
    v -- python dictionary containing the current velocity.
                    v['dW' + str(l)] = velocity of dWl
                    v['db' + str(l)] = velocity of dbl
    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    
    # Initialize velocity
    for l in range(L):
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        
    return v


# GRADED FUNCTION: update_parameters_with_momentum

def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
    """
    Update parameters using Momentum
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- python dictionary containing the current velocity:
                    v['dW' + str(l)] = ...
                    v['db' + str(l)] = ...
    beta -- the momentum hyperparameter, scalar
    learning_rate -- the learning rate, scalar
    
    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- python dictionary containing your updated velocities
    """

    L = len(parameters) // 2 # number of layers in the neural networks
    
    # Momentum update for each parameter
    for l in range(L):
        
        # compute velocities
        v["dW" + str(l+1)] = beta * v["dW" + str(l+1)] + (1 - beta) * grads["dW" + str(l+1)]
        v["db" + str(l+1)] = beta * v["db" + str(l+1)] + (1 - beta) * grads["db" + str(l+1)]
        # update parameters
        parameters["W" + str(l+1)] -= learning_rate * v["dW" + str(l+1)]
        parameters["b" + str(l+1)] -= learning_rate * v["db" + str(l+1)]
        
    return parameters, v

Note that:

The velocity is initialized with zeros. So the algorithm will take a few iterations to “build up” velocity and start to take bigger steps.
If 𝛽=0 , then this just becomes standard gradient descent without momentum.
How do you choose 𝛽 ?

The larger the momentum 𝛽 is, the smoother the update because the more we take the past gradients into account. But if 𝛽 is too big, it could also smooth out the updates too much.
Common values for 𝛽 range from 0.8 to 0.999. If you don’t feel inclined to tune this, 𝛽=0.9 is often a reasonable default.
Tuning the optimal 𝛽 for your model might need trying several values to see what works best in term of reducing the value of the cost function 𝐽 .

Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent. - You have to tune a momentum hyperparameter 𝛽 and a learning rate 𝛼 .

Adam

RMSprop

Root Mean Square Propagation
类似指数加权平均，在更改更新梯度的逻辑，不再直接减去学习率乘以梯度，而是减去学习率乘以优化处理后的梯度值，详见如下公式

Adam 优化算法

结合指数平均和RMSprop两种算法更新梯度

# GRADED FUNCTION: initialize_adam

def initialize_adam(parameters) :
    """
    Initializes v and s as two python dictionaries with:
                - keys: "dW1", "db1", ..., "dWL", "dbL" 
                - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters.
    
    Arguments:
    parameters -- python dictionary containing your parameters.
                    parameters["W" + str(l)] = Wl
                    parameters["b" + str(l)] = bl
    
    Returns: 
    v -- python dictionary that will contain the exponentially weighted average of the gradient.
                    v["dW" + str(l)] = ...
                    v["db" + str(l)] = ...
    s -- python dictionary that will contain the exponentially weighted average of the squared gradient.
                    s["dW" + str(l)] = ...
                    s["db" + str(l)] = ...

    """
    
    L = len(parameters) // 2 # number of layers in the neural networks
    v = {}
    s = {}
    
    # Initialize v, s. Input: "parameters". Outputs: "v, s".
    for l in range(L):
        v["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        v["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
        s["dW" + str(l+1)] = np.zeros_like(parameters["W" + str(l+1)])
        s["db" + str(l+1)] = np.zeros_like(parameters["b" + str(l+1)])
    
    return v, s

# GRADED FUNCTION: update_parameters_with_adam

def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
                                beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
    """
    Update parameters using Adam
    
    Arguments:
    parameters -- python dictionary containing your parameters:
                    parameters['W' + str(l)] = Wl
                    parameters['b' + str(l)] = bl
    grads -- python dictionary containing your gradients for each parameters:
                    grads['dW' + str(l)] = dWl
                    grads['db' + str(l)] = dbl
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    learning_rate -- the learning rate, scalar.
    beta1 -- Exponential decay hyperparameter for the first moment estimates 
    beta2 -- Exponential decay hyperparameter for the second moment estimates 
    epsilon -- hyperparameter preventing division by zero in Adam updates

    Returns:
    parameters -- python dictionary containing your updated parameters 
    v -- Adam variable, moving average of the first gradient, python dictionary
    s -- Adam variable, moving average of the squared gradient, python dictionary
    """
    
    L = len(parameters) // 2                 # number of layers in the neural networks
    v_corrected = {}                         # Initializing first moment estimate, python dictionary
    s_corrected = {}                         # Initializing second moment estimate, python dictionary
    
    # Perform Adam update on all parameters
    for l in range(L):
        # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
        v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1) * grads["dW" + str(l+1)]
        v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1) * grads["db" + str(l+1)]

        # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
        v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - beta1 ** t)
        v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - beta1 ** t)

        # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
        s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * grads["dW" + str(l+1)] ** 2
        s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * grads["db" + str(l+1)] ** 2

        # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
        s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - beta2 ** t)
        s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - beta2 ** t)

        # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
        parameters["W" + str(l+1)] -= learning_rate * v_corrected["dW" + str(l+1)] / (np.sqrt(s_corrected["dW" + str(l+1)]) + epsilon)
        parameters["b" + str(l+1)] -= learning_rate * v_corrected["db" + str(l+1)] / (np.sqrt(s_corrected["db" + str(l+1)]) + epsilon)

    return parameters, v, s

Adam 算法原理

实验

上面三种算法练习

小结

Momentum usually helps, but given the small learning rate and the simplistic dataset, its impact is almost negligeable. Also, the huge oscillations you see in the cost come from the fact that some minibatches are more difficult thans others for the optimization algorithm.

Adam on the other hand, clearly outperforms mini-batch gradient descent and Momentum. If you run the model for more epochs on this simple dataset, all three methods will lead to very good results. However, you’ve seen that Adam converges a lot faster.

实验效果Adam算法速度快，准确率高

Some advantages of Adam include:

Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum)
Usually works well even with little tuning of hyperparameters (except 𝛼 )

学习率衰减

在小批量梯度下降计算过程中，随着计算批次后移，逐渐减小学习率的值，可以降低梯度震荡幅度，加速收敛速度
一些学习率衰减算法如下

两个经验法则

Unlikely to get stuck in a bad local optima 一般不存在局部最优解，损失函数与参数关系往往成马鞍装
Plateaus can make learning slow 平缓的地方往往会造成学习速度下降，需要花费更多时间找到更优解

一些优化深度神经网络【训练过程】的方法

发表于 2025-01-29

数据集的使用

一般将数据集分成三部分，训练集（training set），交叉验证集（cross validation set / dev set），测试集（test set）
训练集用于训练模型
交叉验证集用于调整超参数，或者比较不同模型算法性能的优劣
测试集用于对最终模型的进行无偏评估(是否存在欠拟合或过拟合)

如果不需要进行最终的无偏评估，那么可以将交叉验证集和测试集合并为一个数据集，进行模型训练和评估
将测试集数据合并到交叉验证集中，数据集最终只会剩下训练集和测试验证集

另外需要注意进行训练的数据集和验证测试的数据集要来自同一个分布，否则会影响训练速度
比如训练集图片来自网络，分辨率高，清晰度高，但是测试集图片来自手机，分辨率低，清晰度低
那么训练出来的模型在测试集上的表现可能会很差

使用偏差(bias)和方差(variance)对模型进行评估

高偏差一般欠拟合
高方差一般过拟合
先来复习一下偏差和方差的定义
解决过拟合问题
 高偏差和高方差

正则化处理(Regularization)过拟合问题

通过给成本函数增加正则化项，进行权重衰减，来降低模型的方差，从而避免过拟合的出现

L2 具体实现

Dropout 正则化

dropout 方法实现原理就是在每层计算前前，生成随机蒙层，按keep_prob比例随机干掉一些节点，再参与运算，计算后除以keep_prob，保证最终输出的结果不变
在反向传播时，使用向前传播的蒙层对梯度进行相应的处理，保证该轮的参数矩阵相同，所以要对向前传播的蒙层进行缓存处理方便使用

可以看到相比L2 类似于对w 进行缩小处理，dropout 方法类似于对w 进行放大处理, 因为干掉一些节点相当于随机取消某些w 的影响，最后除以keep_prob 相当于对w 进行放大处理

其它正则化方法

除了上面提到给cost函数添加L1,L2正则化项，还可以通过dropout 方法，数据增强，提早结束训练等正则化方法来避免过拟合的出现

L2和Dropout实验

归一化训练集

通过对训练数据集进行归一化处理，加快梯度下降，提高训练速度
复习一下特征放缩

初始化参数

由于神经网络的多层的计算过程中，权重是乘积关系，如果各层初始化参数一开始大于1 或者小于1，整个乘积下来，就导致y^ 无限大或者无限小，从而导致梯度爆炸或者梯度消失

解决

 不同初始化参数方式

梯度校验

双边误差会比单边误差更准确

梯度校验的过程就是拿到一轮训练的参数和梯度，进行摊平处理，字典变 n * 1 二维数组，每一层的每一个w 是一个数组
对每一层每一个参数修改一个很小的值，一次加，一次减，重新计算cost 值，拿到两个cost 值后进行双边误差计算，
整个参数都计算完后，与梯度进行比较，详见公式
如果比较值相差不大，说明梯度计算没有问题

实验

具体代码

分析解决问题

之前提到的一些优化模型的方法

正交化

是一种解决问题的思想，简单来说就是通过将问题分解为相互独立的子问题，从而使得每个子问题可以独立解决，减少相互干扰
比如下面的训练过程，每一个阶段有每一个阶段要完成的目标，每一个阶段的实现目标过程会遇到不同问题，针对不同阶段不同问题采取相应解决办法
每个子目标相互独立，互不干扰一个阶段的目标完成后，再进行下一个阶段的目标

就相当于把一个大问题分解为多个小问题，每个小问题相互独立，互不干扰，
针对不同问题，采取相应的解决办法，最终完整整个模型的训练

建立单一数字评估指标

如果需要对多个指标进行优化，参考上面正交化思想，独立解决每个指标的优化问题，即建立单一数字评估指标
模型评估阶段会用到多个指标，比如准确率，召回率，F1 值，
但是这些指标之间存在相互影响，比如准确率高，召回率低，F1 值低
所以需要建立单一数字评估指标，比如F1 值，通过F1
此时F1 值就是【优化指标】(需要继续提高的指标)，其他指标就是【满足指标】(达到一定阈值即可)
之后模型训练相当于就朝着优化项指标进行训练，有的放矢
或者通过其他指标综合计算出唯一的指标值来评估模型优劣
相关复习一
 相关复习二

混洗数据保证数据分布均匀

避免dev和test集数据分布差异过大，在训练过程中，会导致模型训练效果不佳
在训练前，把数据进行混洗，打乱数据分布，保证dev和test数据分布差异不大

Choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on.

数据集大小建议

Set your dev set to be big enough to detect differences in algorithm/models you’re trying out.
Set your test set to be big enough to give high confidence in the overall performance of your system.
如果对实际数据效果要求较高，那么最好要有test set 进行性能测试，否则会影响实际使用效果

更改指标

如果训练过程中，之前设置的指标已经不能满足模型训练要求，
或者已经错误的对模型进行了筛选，
要么需要更改指标，要么需要更改模型结构，要么更改dev/test set

以人的能力为参考

Why compare to human-level performance
Humans are quite good at a lot of tasks. So long as ML is worse than humans, you can:

Get labeled data from humans.
Gain insight from manual error analysis: Why did a person get this right?
Better analysis of bias/variance.

小结

错误分析

对判断错误的数据进行错误类型统计分析，决定下一步优化方向，给出优先级

清除标签错误

如果dev/test set 数据中，存在标签错误的数据，
如果该类数据在错误数据中占比较大，已经影响到dev set 阶段对模型的选择
则需要采取措施进行标签错误的数据清除

Correcting incorrect dev/test set examples
• Apply same process to your dev and test sets to make sure they continue to come from the same distribution 分布相同
• Consider examining examples your algorithm got right as well as ones it got wrong.正确错误数据都检查下标签
• Train and dev/test data may now come from slightly different distributions. 训练数据和验证测试数据现在可以来自稍微不同的分布

不同分布数据集处理

对于训练集和验证测试集分布不同的问题
不建议将数据混洗后再进行训练，这样会导致验证测试阶段不是实际数据，而是部分测试数据的问题
更好的处理办法是把验证测试集数据集的50%也进行训练，剩下50%再均分进行验证测试

快速搭建然后迭代

如果需要对一个全新的问题进行模型训练，建议快速搭建模型，然后迭代，在迭代中发现问题，优化算法
• Set up dev/test set and metric
• Build initial system quickly
• Use Bias/Variance analysis & Error analysis to prioritize next steps.
但是要尽可能避免系统过于简单或者过于复杂，
对于有众多论文支持的问题，比如人脸识别，可以根据论文快速搭建复杂模型，然后开始训练
对于陌生领域的问题，则可以由简单到复杂进行训练

不匹配数据

更细致的对错误率进行分析，通过计算不同错误率的差值，分析不同原因比重，决定优化方向
· 可避免偏差
· 方差，如果比较大，说明模型不能泛化到同一分布数据
· 数据不匹配，训练数据和验证数据可能分布不同
· 过拟合程度

如何结局数据不匹配问题

对train set 和dev set进行错误分析，找出二者不同，然后尽可能多的用dev set 数据进行训练
除了去收集真实的dev set，还可以通过人工合成的数据进行训练，
但是要注意，人工合成的数据只是实际数据的一个子集，避免出现对子集数据过拟合的情况
Addressing data mismatch
• Carry out manual error analysis to try to understand difference between training and dev/test sets
• Make training data more similar; or collect more data similar to dev/test sets

迁移学习

复习
适用情形
• Task A and B have the same input x.
• You have a lot more data for Task A than Task B.
• Low level features from A could be helpful for learning B.

多任务学习

建立单一神经网络同时输出多个任务的结果
适用情形
• Training on a set of tasks that could benefit from having shared lower-level features.
• Usually: Amount of data you have for each task is quite similar.
• Can train a big enough neural network to do well on all the tasks.

端到端深度学习

端到端深度学习（End-to-End Deep Learning）是指在构建和训练深度学习模型时，直接从原始输入数据到最终输出结果的整个过程都由一个单一的模型来完成。
这种方法避免了传统机器学习中常见的多个独立步骤和手工特征工程，旨在通过一个统一的模型自动学习和优化整个任务。

优点:
• Let the data speak
• Less hand-designing of components needed
缺点:
• May need large amount of data
• Excludes potentially useful hand-designed components

适用情形
Do you have sufficient data to learn a function of the complexity needed to map x to y?
有没有足够有效的直接从x到y映射数据用来进行复杂函数的学习？

现实中，如果没足够多x-> Y 的映射数据，存在中间数据方便，x-> Z -> Y 这样的映射数据，
那么将端到端任务分解成两个小任务，会更简单
比如下面的人脸识别和根据手X光图片判断儿童年龄，自动驾驶，但如果是机器翻译就比较适合端到端

一个实现流通信的案例

发表于 2025-01-26

背景

前端包括node层和纯前端层，需要请求第三方http接口在页面实现chatGTP的打字机效果

方案

在node层调用第三方http接口，避免跨域问题
由于第三方接口为流式接口，从node层发出请求再转发到前端也需要进行流式通信
前端层对返回的流式数据进行处理后更新数据呈现在页面上

实现

前端层使用fetch进行请求，使用ReadableStream进行流式处理
node层使用axios进行请求，使用stream进行流式处理

node层实现

import axios from 'axios';
import { PassThrough } from 'stream';

export function faqStream(body?: any): Promise<any> {
  const ctx = useContext<HttpContext>();
  ctx.set({
    Connection: 'keep-alive',
    'Cache-Control': 'no-cache',
    'Content-Type': 'application/octet-stream' // 表示返回数据是个 stream
  });
  const stream = new PassThrough();
  ctx.body = stream;
  // 发起第三方请求
  const headers = {
    'Content-Type': 'application/json'
  };
  const url = 'http://vvvv.xxxx.net/aiBot/oncall_response_stream';
  axios
    .post(url, ctx.request.body, { headers: headers, responseType: 'stream' })
    .then((response) => {
      if (response.status !== 200) {
        console.error('Error status:', response.status);
        return;
      }
      response.data.on('data', (chunk) => {
        chunk
          .toString()
          .split('\n\n')
          .filter((item) => item)
          .forEach((chunkStr) => {
            let chunkJson = {};
            try {
              chunkJson = JSON.parse(chunkStr);
            } catch (error) {
              console.error('Error parse:', error);
              console.error('Error chunkStr:', chunkStr);
              console.error('Error origin chunk:', chunk.toString());
            }
            if (chunkJson?.data?.chunk) {
              // 拿到有效数据后，传给前端
              stream.write(chunkJson.data.chunk);
            }
          });
      });
      response.data.on('end', () => {
        // 第三方请求流结束后，关闭向前端写的流
        stream.end();
      });
    })
    .catch((error) => {
      console.error('Error all:', error);
    });
}

前端层实现

  import { useState, useCallback, useRef } from 'react';

  const [listLoading, setListLoading] = useState(0);
  const [hasReport, setHasReport] = useState(-1);
  const [AiRemoteData, setAIRemoteData] = useState('');
  const AiRequestController = useRef();
  const { addThrottle } = useThrottleFunc();

// 拉取Ai 回答
  const getAIRemoteData = useCallback(
    addThrottle(
      async () => {
        const { keyWord } = query;
        if (!keyWord) {
          return;
        }
        try {
          setListLoading((val) => val + 1);
          if (AiRequestController.current) {
            // 如果当前请求存在，则取消当前请求
            AiRequestController.current.abort();
          }
          AiRequestController.current = new AbortController();
          const jwtToken = await getJwt();
          // 1. 创建一个新的请求
          const response = await fetch(
            `https://hahahaha.net/api/diagnosisBasic/faqStream`,
            {
              method: 'POST',
              body: JSON.stringify({
                user_id: userInfo.id,
                query: keyWord,
                class: ''
              }),
              headers: {
                'x-jwt-token': jwtToken
              },
              signal: AiRequestController.current.signal
            }
          );
          const reader = response.body.getReader(); // 获取reader
          const decoder = new TextDecoder(); // 文本解码器
          let answer = ''; // 存储答案
          // 2. 循环取值
          while (true) {
            // 取值, value 是后端返回流信息, done 表示后端结束流的输出
            const { value, done } = await reader.read();
            if (done) {
              break;
            }
            // 对 value 进行解码
            const val = decoder.decode(value);
            if (!answer) {
              setListLoading((count) => count - 1);
              setHasReport(-1);
            }
            answer += val;
            setAIRemoteData(answer);
          }
        } catch {
          setAIRemoteData('');
          console.error('数据解析出错');
        } finally {
          setListLoading((val) => val - 1);
        }
      },
      500,
      'getAIRemoteData'
    ),
    [query.keyWord]
  );

番外: 全局节流函数

const throttleTimerList: { [key: string]: Timeout | null } = {};
export const useThrottleFunc = () => {
  const timer = useRef();
  const addThrottle = (
    fn: (params?: LooseObject | undefined) => void,
    waitTime?: number,
    timerKey?: string
  ) => {
    const timerFlag = timerKey || 'getRemoteData';
    throttleTimerList[timerFlag] = null;
    return (params?: LooseObject | undefined) => {
      if (throttleTimerList[timerFlag]) {
        clearTimeout(throttleTimerList[timerFlag]);
      }
      throttleTimerList[timerFlag] = setTimeout(() => {
        fn(params);
        clearTimeout(throttleTimerList[timerFlag]);
        throttleTimerList[timerFlag] = null;
      }, waitTime || 500);
    };
  };

  useEffect(
    () => () => {
      Object.keys(throttleTimerList).forEach((key) => {
        if (throttleTimerList[key]) {
          clearTimeout(throttleTimerList[key]);
        }
        delete throttleTimerList[key];
      });
    },
    []
  );
  return {
    addThrottle,
    throttleTimer: timer
  };
};

神经网络的相关推导公式

发表于 2024-12-01

矩阵维度

为方便重复计算，减少for循环的使用，在神经网络的计算过程中，尽可能的将数据转成向量进行计算
利用向量的广播能力进行快速计算，神经网络多层传递过程中，矩阵的维度一般遵循以下关系

如果前一层（输入）维度为（m,1），中间层维度是（n, 1）, 后一层（输出）维度是（p， 1）
那么中间层w的维度就是（n, m）， b 的维度就是（n，1）, b 的维度始终和中间层一致
输出层W的维度（p,n），b 的维度就是(p,1)

参数的矩阵维度关系

向前传播计算过程

一个神经元的计算

每个神经元的计算包括两部分，先计算z,在用激活函数计算a,
同一层不同神经元计算的区别就在于使用不同的参数

如果将参数w和b整理成向量，对当前样本数据进行一次性向量计算
就可以直接得到当前层的直接产出向量

一层神经元的计算

同样，每一层都可以用相同的计算式表示

一组样本数据的计算

通过for 循环进行每一层的计算可得到所有样本数据的预测数据y^

但是通过将输入层维度(m,1) 的向量增加为（m,x）的向量，可以实现一次计算x个样本的效果，从而去掉for循环
如果中间层有n个神经元，输出得到的结果就是(n,x)的矩阵
第n - 1行上的x个数，每个数代表每个样本数据在中间层第n-1个神经元的计算后的值
第x - 1列上的n个数，每个数代表第x-i个样本数据在中间层计算后的每个神经元的值

最终经过两层神经元处理后变成，结果变成(1,x)的向量，每个值代表每个样本经过神经网络计算后的预测值

输入输出值矩阵维度之间的关系

小结

其他激活函数

除了sigmoid 激活函数外，常见的激活函数还有Tanh, ReLu,和leaky ReLu,
后三者更常见，且使用更广泛，sigmoid基本只用于二分类场景

为什么不使用线性函数作为激活函数

因为如果使用线性函数作为激活函数，无论神经网络有多少层，都相当于只进行了一次线性函数计算，隐藏层作用消失

不同激活函数的导数

向后传播过程

回顾一下梯度下降的计算过程

对于单个神经元的向后传播过程，就是计算单个神经元参数偏导数的过程

对于多层的神经网络进行带入

小结

同向前传播一样，通过引入向量矩阵，减少for循环

为啥不能初始化参数为0？

如果初始化参数为0或相同值，那么所有节点计算的值都相同，会产生对称性，对后续计算的影响也相同，同样会导致隐藏层节点结算无效
解决办法就是随机初始化参数

一般一开始会将参数随机成比较小的值，如果一开始是比较大的值，z 的值就会比较大，
当激活函数是sigmoid 或者tanh 这样的激活函数时，
计算结果所在的位置就会在梯度比较平缓的地方导致，激活函数处于比较饱和的状态，梯度下降比较慢，影响学习速度

深度网络

向前传播

向后传播

一些超参数

小结

为什么要使用深层网络

使用小的（单层神经元数据量少的）的但是有多层的深层网络，往往会比使用浅层(layer数少)网络计算步骤更简洁
比如下面的电路与或非门计算过程
如果像左侧使用深层网络，每一次层神经元都少一半
如果使用右侧单层神经网络，这一层上的神经元会以2的指数方式计算
总体算下来，深层网络需要处理的神经元会少很多

实验练习

逻辑回归全过程

gitbub[ipynb]链接
 实验

思路梳理

数据处理

向量化

实验中要实现对一张图片是否是猫的判断，
首先要对图片进行处理，将图片转换成向量，
一个像素点由RGB三个数据组成, 现在如果横竖都取图片的64个像素点
一张64X64的图片就有64X64=4096个 [r,g,b] 这样的数据，
一张图片的数据表示就是

[
    [[196 192 190], [193 186 182],...中间还有61组, [188 179 174]], 每一行 有64个
    [[196 192 190], [193 186 182],..., [188 179 174]],
    ... 中间有60行
    [[196 192 190], [193 186 182],..., [188 179 174]]
    [[196 192 190], [193 186 182],..., [88 79 74]]
] 一共64行

现在把所有数据摊平再转置，就可转成一个[64X64X3=12288, 1]的向量,
也就是m个测试数据组成的矩阵中的一列

1	[[196,], [192,],..., [88,], [79,],[74,]]

A trick when you want to flatten a matrix X of shape (a,b,c,d) to a matrix X_flatten of shape (b ∗ c ∗ d, a) is to use:

1	X_flatten = X.reshape(X.shape[0], -1).T

现在我们有209个训练数据的训练集train_set_x_orig的维度是(209, 64, 64, 3)
a 就是209
现将要将训练集数据一次性转成209列的向量

1 2	m_train = train_set_x_orig.shape[0] // 209 train_set_x_flatten = train_set_x_orig.reshape(m_train, -1).T

train_set_x_flatten 现在的维度就是(12288, 209)
每一列是一张图片的像素数据

数据中心标准化

基于现在处理的是图片的像素数据，所以所有的数据肯定都在0~255之间

One common preprocessing step in machine learning is to center and standardize your dataset,
meaning that you substract the mean of the whole numpy array from each example,
and then divide each example by the standard deviation of the whole numpy array.
But for picture datasets,
it is simpler and more convenient and works almost as well to just divide every row of the dataset by 255 (the maximum value of a pixel channel).

一个常见的预处理步骤是尽可能将数据聚拢到坐标系0附近，常用的方法是对数据进行标准化，
也就是将数据减去均值，然后将数据除以标准差
但是对于图片数据集来说，
除以255（像素通道的最大值），会更简单，而且效果也差不多

1	train_set_x = train_set_x_flatten / 255.

小结

What you need to remember:
Common steps for pre-processing a new dataset are:
Figure out the dimensions and shapes of the problem (m_train, m_test, num_px, …)
Reshape the datasets such that each example is now a vector of size (num_px num_px 3, 1)
“Standardize” the data

常见的数据预处理步骤：

确定问题的维度和形状（m_train, m_test, num_px, …）
将数据集重新组织成每个示例都是大小为（num_px num_px 3, 1）的向量
标准化数据

构建模型

The main steps for building a Neural Network are:

Define the model structure (such as number of input features)
Initialize the model’s parameters
Loop:
Calculate current loss (forward propagation)
Calculate current gradient (backward propagation)
Update parameters (gradient descent)
You often build 1-3 separately and integrate them into one function we call model().

构建一个神经网络模型的主要步骤：

定义模型结构（例如输入特征的数量,这里是一张图片的12288个rgb数据）
初始化模型的参数
循环：
计算当前损失（前向传播）
计算当前梯度（反向传播）
更新参数（梯度下降）
通常会将1-3分别构建, 然后将它们集成到一个函数中，我们称之为model()。

𝑠𝑖𝑔𝑚𝑜𝑖𝑑函数实现

𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑇𝑥+𝑏)=1/(1+𝑒−(𝑤𝑇𝑥+𝑏))

"""
    Compute the sigmoid of z
    Arguments:
    z -- A scalar or numpy array of any size.
    Return:
    s -- sigmoid(z)
"""
def sigmoid(z):
    s = 1/(1+np.exp(-z))
    return s

初始化模型的参数

用0来初始化参数

"""
This function creates a vector of zeros of shape (dim, 1) for w and initializes b to 0.
Argument:
dim -- num_px * num_px * 3
Returns:
w -- initialized vector of shape (dim, 1)
b -- initialized scalar (corresponds to the bias)
"""
def initialize_with_zeros(dim):
    w = np.zeros((dim, 1))
    b = 0
    return w, b

## 测试
dim = 2
w, b = initialize_with_zeros(dim)

==>
w = [[0.]
 [0.]]
b = 0.0

前向向后传播实现

根据公式进行代码实现

最终得到每轮训练的损失函数和梯度

# GRADED FUNCTION: propagate

"""
Implement the cost function and its gradient for the propagation explained above

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)
Y -- true "label" vector (containing 0 if non-cat, 1 if cat) of size (1, number of examples)

Return:
cost -- negative log-likelihood cost for logistic regression
dw -- gradient of the loss with respect to w, thus same shape as w
db -- gradient of the loss with respect to b, thus same shape as b

"""

def propagate(w, b, X, Y):
   
    
    m = X.shape[1]
    
    # FORWARD PROPAGATION (FROM X TO COST)
    A = sigmoid(w.T @ X + b)                                    # compute activation 得到 (m,1) 的矩阵A,m 是训练集样本数
    cost = -np.mean(Y * np.log(A) + (1 - Y) * np.log(1 - A))    # compute cost, 在某些 NumPy 的特定版本或上下文中, np.mean 的输出可能是形状为 (1,) 的数组，而不是一个纯标量
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw = X @ (A - Y).T / m
    db = np.mean(A - Y)

    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost) # 移除多余的单一维度，确保 cost 是标量
    assert(cost.shape == ()) # 这里明确要求 cost 的形状是 ()，即零维标量。如果 cost 是 (1,)，那么会触发断言错误。
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

梯度下降实现

"""
This function optimizes w and b by running a gradient descent algorithm

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of shape (num_px * num_px * 3, number of examples)
Y -- true "label" vector (containing 0 if non-cat, 1 if cat), of shape (1, number of examples)
num_iterations -- number of iterations of the optimization loop
learning_rate -- learning rate of the gradient descent update rule
print_cost -- True to print the loss every 100 steps

Returns:
params -- dictionary containing the weights w and bias b
grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.

1) Calculate the cost and the gradient for the current parameters. Use propagate().
2) Update the parameters using gradient descent rule for w and b.
"""

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
     costs = [] // 收集每轮计算的损失函数值
    
    for i in range(num_iterations):
        
        # Cost and gradient calculation (≈ 1-4 lines of code)
        # 第一轮用初始化w和b计算的损失函数和梯度
        # 后面用更新后的w和b计算的损失函数和梯度
        grads, cost = propagate(w, b, X, Y) 
        
        # 解构梯度
        dw = grads["dw"]
        db = grads["db"]
        
        # 梯度下降更新参数
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record the costs
        # 每100轮记录一次损失函数值
        if i % 100 == 0: 
            costs.append(cost)
        
        # 如果需要每100轮打印下损失函数就再打印下
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    # num_iterations轮 训练结束后返回最终更新到的参数，梯度，和损失函数集合(可以用于绘制学习曲线)
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

预测函数

根据公式实现预测函数

𝑌̂ =𝐴=𝜎(𝑤𝑇𝑋+𝑏)

'''
Predict whether the label is 0 or 1 using learned logistic regression parameters (w, b)

Arguments:
w -- weights, a numpy array of size (num_px * num_px * 3, 1)
b -- bias, a scalar
X -- data of size (num_px * num_px * 3, number of examples)

Returns:
Y_prediction -- a numpy array (vector) containing all predictions (0/1) for the examples in X
'''
# X 是摊平后数据 (12288, m), X.shape[0] 是影响因素个数 12288 个RGB值, X.shape[1] 是训练集样本数
def predict(w, b, X):
    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities of a cat being present in the picture
    A = sigmoid(w.T @ X + b)
    
    for i in range(A.shape[1]):
        # Convert probabilities A[0,i] to actual predictions p[0,i] 
        # 大于0.5 预测为1 是猫, 小于0.5 预测为0, 不是猫
        Y_prediction[0, i] = A[0, i] > 0.5
    
    assert(Y_prediction.shape == (1, m))
    
    return Y_prediction

组装模型


"""
Builds the logistic regression model by calling the function you've implemented previously

Arguments:
X_train -- training set represented by a numpy array of shape (num_px * num_px * 3, m_train)
Y_train -- training labels represented by a numpy array (vector) of shape (1, m_train)
X_test -- test set represented by a numpy array of shape (num_px * num_px * 3, m_test)
Y_test -- test labels represented by a numpy array (vector) of shape (1, m_test)
num_iterations -- hyperparameter representing the number of iterations to optimize the parameters
learning_rate -- hyperparameter representing the learning rate used in the update rule of optimize()
print_cost -- Set to true to print the cost every 100 iterations

Returns:
d -- dictionary containing information about the model.
"""

def model(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False):

    # initialize parameters with zeros (≈ 1 line of code)
    # 初始化模型的参数
    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent (≈ 1 line of code)
    # 根据训练数据采用梯度下降方法更新参数
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
    
    # Retrieve parameters w and b from dictionary "parameters"
    w = parameters["w"]
    b = parameters["b"]
    
    # Predict test/train set examples (≈ 2 lines of code)
    # 用训练好的参数预测测试集和训练集的结果
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors
    # 打印训练集和测试集的准确率
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    # 返回模型训练的损失函数值(学习曲线)，训练测试数据集的预测结果(判断模型是否拟合)，模型的参数, 学习率，迭代次数等信息
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

# 调用模型
# 注意入参train_set_x, train_set_y, test_set_x, test_set_y, 是经过预处理的数据集
d = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 2000, learning_rate = 0.005, print_cost = True)

模型分析

预测结果分析

除了函数本身里面的准确率计算，可以初步判断模型是否过拟合训练数据，还可以单独拿出一个测试数据，和预测数据进行结果比较，进行验证

1
2
3

index = 14
plt.imshow(test_set_x[:,index].reshape((num_px, num_px, 3)))
print ("y = " + str(test_set_y[0,index]) + ", you predicted that it is a \"" + classes[int(d["Y_prediction_test"][0,index])].decode("utf-8") +  "\" picture.")

学习曲线分析

# Plot learning curve (with costs)
costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()

学习率分析 or 超参数分析

增加训练次数，观察学习曲线变化同理

learning_rates = [0.01, 0.001, 0.0001]
models = {}
for i in learning_rates:
    print ("learning rate is: " + str(i))
    models[str(i)] = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 1500, learning_rate = i, print_cost = False)
    print ('\n' + "-------------------------------------------------------" + '\n')

for i in learning_rates:
    plt.plot(np.squeeze(models[str(i)]["costs"]), label = str(models[str(i)]["learning_rate"]))

plt.ylabel('cost')
plt.xlabel('iterations (hundreds)')

legend = plt.legend(loc='upper center', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
plt.show()

应用训练结果进行预测

my_image = "my_image2.jpg"   # change this to the name of your image file 

fname = "images/" + my_image
image = np.array(plt.imread(fname))
image = image/255.
my_image = np.array(Image.fromarray(np.uint8(image)).resize((num_px,num_px))).reshape((1, num_px*num_px*3)).T // 摊平数据(12288, 1)
#my_image = scipy.misc.imresize(image, size=(num_px,num_px)).reshape((1, num_px*num_px*3)).T
my_predicted_image = predict(d["w"], d["b"], my_image)

plt.imshow(image)
print("y = " + str(np.squeeze(my_predicted_image)) + ", your algorithm predicts a \"" + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") +  "\" picture.")

二分类问题的实践
 scikit框架

使用隐藏层实现对非线性数据的分类

题目
我们现在有一堆数据分布成花朵的形状，非线性，如下，

可以看到上面的数据有的是红色，有的是蓝色，假设红色代表支持特朗普，蓝色代表支持拜登
我们希望用一个神经网络来对这些数据进行分类，
分类结果就是输入数据可以直接得到数据是红色还是蓝色的标签
解决思路就想办法对红色和蓝色的数据集中地区进行分块划分，
如果我们还是用sigmoid 函数，那么就会变成线性的，不会得到正确的区块划分

我们希望用一个非线性函数把数据进行精确度更好的划分

The general methodology to build a Neural Network is to:

Define the neural network structure ( # of input units, # of hidden units, etc).

Initialize the model’s parameters

Loop: - Implement forward propagation - Compute loss - Implement backward propagation to get the gradients - Update parameters (gradient descent)

You often build helper functions to compute steps 1-3 and then merge them into one function we call nn_model().
Once you’ve built nn_model() and learnt the right parameters, you can make predictions on new data.

模型结构

涉及方程

向前传播

反向传播

实现

重点关注模型组装好后，如何使用，如何预测，精确度计算，超参数如何训练，迁移学习怎么做

非线性逻辑回归实现代码

L层神经网络实现

整体架构

向前传播的时候，前L-1层都是先线性然后用relu 函数激活，最后一层是线性然后用sigmoid 函数激活

The model’s structure is: LINEAR -> RELU -> LINEAR -> SIGMOID.
计算损失函数
反向传播的时候，与向前传播相反，除第一层是线性然后用sigmoid 函数激活，后面l-1层是线性然后用relu 函数激活，前面的每一层都是线性然后用relu 函数激活

The model’s structure is: SIGMOID -> LINEAR -> RELU -> LINEAR -> RELU -> … -> SIGMOID.
for every forward function, there is a corresponding backward function. That is why at every step of your forward module you will be storing some values in a cache. The cached values are useful for computing gradients. In the backpropagation module you will then use the cache to calculate the gradients. This assignment will show you exactly how to carry out each of these steps.

使用链式法则, 对下面的线性函数求导dw, db, dA

得到反向传播计算公式

注意输出层的sigmoid 函数求导

L层神经网络实现代码

运行异常

参数初始化问题

语句	初始化方式	优缺点
`np.random.randn(layer_dims[l], layer_dims[l - 1])`	标准正态分布初始化	简单，但可能导致梯度爆炸或梯度消失，尤其是在深层网络中。
`np.random.randn(layer_dims[l], layer_dims[l-1]) / np.sqrt(layer_dims[l-1])`	Xavier初始化的变体，`np.sqrt(layer_dims[l-1])`是上一层神经元个数	提供更稳定的梯度和激活值，适合对称激活函数（如Sigmoid、Tanh）。减少梯度爆炸或梯度消失问题。

如果使用 ReLU 或 Leaky ReLU 作为激活函数，可以采用 He 初始化：

1	parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1]) * np.sqrt(2 / layer_dims[l-1])

一些基础知识

发表于 2024-11-27

深度学习为什么会崛起

随着数据量的增多，计算能力的提高以及算法的进步，使得深度学习的训练周期变短，可以快速进行迭代更新优化
从而用于工业生成

下面以二分类问题为例，复习一下相关的数学知识和概念

损失函数和成本函数

重新理解一下
损失函数是一个训练数据的预测结果和实际值的差
成本函数是所有训练数据的损失函数的平均值

常见求导公式

1.C’=0(C为常数)；
2.(Xn)’=nX(n-1) (n∈R)；
3.(sinX)’=cosX；
4.(cosX)’=-sinX；
5.(aX)’=aXIna （ln为自然对数）；
6.(logaX)’=1/(Xlna) (a>0，且a≠1)；
7.(tanX)’=1/(cosX)2=(secX)2
8.(cotX)’=-1/(sinX)2=-(cscX)2
9.(secX)’=tanX secX；
10.(cscX)’=-cotX cscX；

计算图

向前传播

计算成本函数

向后传播

通过链式求导得到每一轮计算中参数的导数，从而用于进行梯度下降计算

梯度下降计算过程

Neural network programming guideline
Whenever possible, avoid explicit for-loops.
避免for-loops循环计算带来的算力损耗，使用向量对上面两次循环(训练数迭代和参数迭代)进行优化，最终只剩训练次数一次loop 循环

广播

通过使用向量的广播计算，可以大幅度减少for循环的计算成本
广播常见的计算过程如下

练习

使用广播的运算概念，实现下面的softmax 函数

def softmax(x):
    """Calculates the softmax for each row of the input x.

    Your code should work for a row vector and also for matrices of shape (m,n).

    Argument:
    x -- A numpy matrix of shape (m,n)

    Returns:
    s -- A numpy matrix equal to the softmax of x, of shape (m,n)
    """
    # Apply exp() element-wise to x. Use np.exp(...).
    x_exp = np.exp(x)

    # Create a vector x_sum that sums each row of x_exp. Use np.sum(..., axis = 1, keepdims = True).
    x_sum = np.sum(x_exp, axis=1, keepdims=True)
    
    # Compute softmax(x) by dividing x_exp by x_sum. It should automatically use numpy broadcasting.
    s = x_exp / x_sum
    
    return s

x = np.array([
    [9, 2, 5, 0, 0],
    [7, 5, 0, 0 ,0]])
print("softmax(x) = " + str(softmax(x)))
==>
softmax(x) = [[9.80897665e-01 8.94462891e-04 1.79657674e-02 1.21052389e-04
  1.21052389e-04]
 [8.78679856e-01 1.18916387e-01 8.01252314e-04 8.01252314e-04
  8.01252314e-04]]