干货 | Information Flow Mechanism in LSTMs and their Comparison

上次 character-level models 里（回复代码：GH023）提到了一个新 NN，叫 Highway networks，简称 HW-Net。和它有关的还有几个 LSTM variants，motivation 其实都是虽然 LSTM 的提出是为了解决 gradient vanishing 的问题，但是也不能说解决的很好。如果说 LSTM 的 input/forget gate 设计，是其中一种解决 gradient vanishing 的机制，那么以下几篇论文中的个 variants 是提出了更多可能更多灵活的机制。同时，第三篇论文又横向对比了这些 variants 的设计单元（如 gate, cell）的作用与表现。

相关几篇论文分别是（不支持外链，依然请大家自行搜索 paper, code 和 note）：

Training Very Deep Networks ：arXiv pre-print; paper code
Grid Long Short-Term Memory ：arXiv preprint; paper
LSTM: A Search Space Odyssey： arXiv preprint; paper note

Training Very Deep Networks (Highway networks)

Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber

这篇论文前身是《Highway Networks》，发表于 ICML workshop 上。现在这版还没有 publish，只是在 arXiv 上挂着。如上面所说，motivation 也是为了解决 gradient vanishing 的问题，所谓 gradient vanishing 其实是 information 无法有效传递到 deeper 的 layers。这时候就好像 information 被阻隔了一样。而作者就是希望找到一种方法，让信息重新变得畅通起来，就像高速公路一样——于是就有了这个名字，Highway Networks（HW-Nets）。
To overcome this, we take inspiration from Long Short Term Memory (LSTM) recurrent networks. We propose to modify the architecture of very deep feedforward networks such that information flow across layers becomes much easier. This is accomplished through an LSTM-inspired adaptive gating mechanism that allows for paths along which information can flow across many layers without attenuation. We call such paths information highways. They yield highway networks, as opposed to traditional ‘plain’ networks.

加粗了 adaptive，这就是这个 mechanism 的重点。

adaptive gating mechanism

在文章中，公式（2）就是他们的机制。

对应的两个 gate，分别是 T: transform gate；C: carry gate。其实意思就是，对于当前的这个 input，在这个 layer 里，我多大程度是让它去进行 nonlinear transform（隐层），还是多大程度上让它保留原貌，
直接传递给下一层，直通无阻。
有了这个设计，作者进一步拆分，把每一个 layer 的处理从 layerwise 变成了 blockwise，也就是我们可以对于 input 中的某一部分（某些维度）选择是 transform 还是 carry。即：

Implementation Details

初始化的时候，作者发现把 transform gate 的值设置为负比较好。
因为这个机制要求input, hidden layer, transform gate的维度一致，不足的情况下用 zero-padding 补足就好。

Analysis and Conclusions

文章分析了实验结果，想探究在一个 deep 的 layer 里，到底多少是被 transform 了（hidden layer, nonlinearalize），多少是被直接 carry 的。

分析结果在 Figure 2 中的最后两列，结论是绝大多数都不被 transform。
同时还有一个结论，The last column of Figure 2 displays the block outputs and visualizes the concept of “information highways”. Most of the outputs stay constant over many layers forming a pattern of stripes. Most of the change in outputs happens in the early layers (≈ 15 for MNIST and ≈ 40 for CIFAR-100).
第三个验证和分析的就是他们 blockwise 的机制设置，是否有必要。这个结果在 Figure 3 中表明，其实就是看 input 的不同地方到底在 layers 中表现是否一致——如果不一致，不是在同一个地方被 transform/carry，
那么说明我们就该把它们区别对待。说明 blockwise 是对的。
This data-dependent routing mechanism is further investigated in Figure 3. In each of the columns we show how the average over all samples of one specific class differs from the total average shown in the second column of Figure 2. For MNIST digits 0 and 7 substantial differences can be seen within the first 15 layers, while for CIFAR class numbers 0 and 1 the differences are sparser and spread out over all layers. In both cases it is clear that the mean activity pattern differs between classes. The gating system acts not just as a mechanism to ease training, but also as an important part of the computation in a trained network.

Grid Long Short-Term Memory

Nal Kalchbrenner, Ivo Danihelka, Alex Graves

总的来说，这篇的贡献应该是给出了一个更 flexible 还 computation capability 更高的框架。
要理解这个论文，可能首先要理解三个概念：grid/block, stacked, depth。

Grid/Block, Stacked, Depth

Grid/Block 是把一个 LSTM 机制改造后的一个 component，这个 component 可以是 multi-dimensional 的，决定了几个方向进行 propagate。每一个 dimension 都有 memory 和 hidden cell。1-dimensional 的 Grid LSTM 就很像上面所说的 Highway Networks。
Stacked 和 LSTM stacked 一样，是指把 output 和 input 连在一起。但是 stacked 并不会改变 Grid LSTM 的 dimension。stacked 2D Grid LSTM 依然是 2D 的，而不是 3D 的。从 visualize 来看，无非就是把一个个方块/方形，平铺在空间里（每个 dimension 都要延展）。
Depth 则是会增加 dimension。在一个 block 内部，变 deep，就是增加 layers。一个 block 由几个 layer 组成，就是几层 deep 的 Grid LSTM。

Multidimensional

只是 1D/2D 的时候，Grid LSTM 看不出特别大的优点。但是当变成 multidimensional 的时候，就会比传统的 multidimensional LSTM 更好的解决 gradient vanishing 的问题。原因是，传统 multidimensional LSTM 在计算每层的 memory cell 的时候，是把每个 dimensional 的 gate 信息集合起来的：
显然这样有问题。Grid LSTM 就不是这样。它是每个 dimensional 分开计算 memory cell。对于每一个 grid，有 N 个 incoming memory cells 和 hidden cells，同时还有 N 个 outgoing memory cells 和 hidden cells。N 是 dimension 的个数。而 Grid LSTM share 的其实大的隐层 H。这样既保证了 interaction 又保证了 information flow。

这篇论文后面还有挺有趣的应用，把 MT 的任务转换成一个 3D Grid LSTM 的问题，其中两个 dimensions 分别是 bi-LSTM 正向逆向读写，第三个 dimension 是 depth。效果不俗。

可能这篇论文的这个框架的提出，在于让 LSTM 的变种稍微有迹可循了一点，到底有多大 performance 的提高，我还是比较怀疑的。

LSTM: A Search Space Odyssey

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber

关于这篇论文，还有一个别人的note（请自行搜索，暂时没法支持外链）。这论文最开始我看到题目以为会看起来很困难，就一直拖。现在看完发现看起来很流畅。而且 get 到的 point 和上面这个 note 里几乎一样。

还是挺清晰的。

首先，这篇论文提到的 8种 variants based on vanilla LSTM，并不全是已经被 propose 出来用在实际工作中的。而是用了控制变量，专门为了实验对比，设计出来的。除了 vanilla LSTM，比较有意义的可能就是 GRU 变种的（第七种 variants）。
其次，这篇论文中讲解 vanilla LSTM 时候，非常清晰 clear，如果让我推荐入门 LSTM 的模型讲解，我也会推荐这个（和上面那个 note 的人感受一样）
第三，这篇论文号称是做了 N 多实验，需要 CPU 跑15年的实验量的实验……但最后结果展示和分析非常有条理。用的是一个叫 functional Analysis of Variance (fANOVA) 的方法，用于分析不同的 hypermeter 对于结果的贡献/影响有多大。看起来很不错的方法。

Conclusions

然后得出来的重要结论直接看最后的 Conclusions 部分就可以了，我来摘录一下：

The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets and using any of eight possible modifications does not significantly improve the LSTM performance.
Certain modifications such as coupling the input and forget gates or removing peephole connections simplify LSTM without significantly hurting performance.
The forget gate and the output activation function are the critical components of the LSTM block. While the first is crucial for LSTM performance, the second is necessary whenever the cell state is unbounded.
Learning rate and network size are the most crucial tunable LSTM hyperparameters. Surprisingly, the use of momentum was found to be unimportant (in our setting of online gradient descent). Gaussian noise on the inputs was found to be moderately helpful for TIMIT, but harmful for other datasets.
The analysis of hyperparameter interactions revealed that even the highest measured interaction (between learning rate and network size) is quite small. This implies that the hyperparameters can be tuned independently. In particular, the learning rate can be calibrated first using a fairly small.
如果要补充的话，还有一条结论就是虽然整体上各种 variants 没有提高，但是也是 task-specific 的。
换句话说，既然整体上没提高，很多减少 hypermeter 的 variants 是值得一用的（相当于不损害 performance）。所以 GRU 的改造是比较合理的。
另外就是警醒在 tune learning rates 时的错误。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。