Sequence Modeling with CTC

Introduction

Consider speech recognition. We have a dataset of audio clips andcorresponding transcripts. Unfortunately, we don’t know how the characters inthe transcript align to the audio. This makes training a speech recognizerharder than it might at first seem.

Without this alignment, the simple approaches aren’t available to us. Wecould devise a rule like “one character corresponds to ten inputs”. Butpeople’s rates of speech vary, so this type of rule can always be broken.Another alternative is to hand-align each character to its location in theaudio. From a modeling standpoint this works well — we’d know the ground truthfor each input time-step. However, for any reasonably sized dataset this isprohibitively time consuming.

This problem doesn’t just turn up in speech recognition. We see it in manyother places. Handwriting recognition from images or sequences of pen strokesis one example. Action labelling in videos is another.

Handwriting recognition: The input can be (x,y)(x,y) coordinates of a pen stroke or pixels in an image.

Speech recognition: The input can be a spectrogram or some other frequency based feature extractor.

Connectionist Temporal Classification (CTC) is a way to get around notknowing the alignment between the input and the output. As we’ll see, it’sespecially well suited to applications like speech and handwritingrecognition.

To be a bit more formal, let’s consider mapping input sequencesX = [x_1, x_2, \ldots, x_T]X=[x1,x2,…,xT], such as audio, to corresponding outputsequences Y = [y_1, y_2, \ldots, y_U]Y=[y1,y2,…,yU], such as transcripts.We want to find an accurate mapping from XX’s to YY’s.

There are challenges which get in the way of ususing simpler supervised learning algorithms. In particular:

Both XX and YY can vary in length.

The ratio of the lengths of XX and YY can vary.

We don’t have an accurate alignment (correspondence of the elements) of XX and Y.Y.

The CTC algorithm overcomes these challenges. For a given XXit gives us an output distribution over all possible YY’s. Wecan use this distribution either to infer a likely output or to assessthe probability of a given output.

Not all ways of computing the loss function and performing inference aretractable. We’ll require that CTC do both of these efficiently.

Loss Function: For a given input, we’d like to train ourmodel to maximize the probability it assigns to the right answer. To do this,we’ll need to efficiently compute the conditional probabilityp(Y \mid X).p(Y∣X). The function p(Y \mid X)p(Y∣X) shouldalso be differentiable, so we can use gradient descent.

Inference: Naturally, after we’ve trained the model, wewant to use it to infer a likely YY given an X.X.This means solving Y^* \enspace =\enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X).Y∗=Yargmaxp(Y∣X).Ideally Y^*Y∗ can be found efficiently. With CTC we’ll settlefor an approximate solution that’s not too expensive to find.

The Algorithm

The CTC algorithm can assign a probability for any YYgiven an X.X. The key to computing this probability is how CTCthinks about alignments between inputs and outputs. We’ll start by looking atthese alignments and then show how to use them to compute the loss function andperform inference.

Alignment

The CTC algorithm is alignment-free — it doesn’t require analignment between the input and the output. However, to get the probability ofan output given an input, CTC works by summing over the probability of allpossible alignments between the two. We need to understand what thesealignments are in order to understand how the loss function is ultimatelycalculated.

To motivate the specific form of the CTC alignments, first consider a naiveapproach. Let’s use an example. Assume the input has length six and Y=Y= [c, a, t]. One way to align XX and YYis to assign an output character to each input step and collapse repeats.

This approach has two problems.

Often, it doesn’t make sense to force every input step to align to some output. In speech recognition, for example, the input can have stretches of silence with no corresponding output.

We have no way to produce outputs with multiple characters in a row. Consider the alignment [h, h, e, l, l, l, o]. Collapsing repeats will produce “helo” instead of “hello”.

To get around these problems, CTC introduces a new token to the set ofallowed outputs. This new token is sometimes called the blank token. We’llrefer to it here as \epsilon.ϵ. The\epsilonϵ token doesn’t correspond to anything and is simplyremoved from the output.

The alignments allowed by CTC are the same length as the input. We allow anyalignment which maps to YY after merging repeats and removing\epsilonϵ tokens:

If YY has two of the same character in a row, then a validalignment must have an \epsilonϵ between them. With this rulein place, we can differentiate between alignments which collapse to “hello” andthose which collapse to “helo”.

Let’s go back to the output [c, a, t] with an input of length six. Here area few more examples of valid and invalid alignments.

The CTC alignments have a few notable properties. First, the allowedalignments between XX and YY are monotonic.If we advance to the next input, we can keep the corresponding output thesame or advance to the next one. A second property is that the alignment ofXX to YY is many-to-one. One or more inputelements can align to a single output element but not vice-versa. This impliesa third property: the length of YY cannot be greater than thelength of X.X.

Loss Function

The CTC alignments give us a natural way to go from probabilities at eachtime-step to the probability of an output sequence.

To be precise, the CTC objectivefor a single (X, Y)(X,Y) pair is:

p(Y \mid X) \;\; =p(Y∣X)=

\sum_{A \in \mathcal{A}_{X,Y}}A∈AX,Y∑

\prod_{t=1}^T \; p_t(a_t \mid X)t=1∏Tpt(at∣X)

The CTC conditional probability marginalizes over the set of valid alignments computing the probability for a single alignment step-by-step.

Models trained with CTC typically use a recurrent neural network (RNN) toestimate the per time-step probabilities, p_t(a_t \mid X).pt(at∣X).An RNN usually works well since it accounts for context in the input, but we’refree to use any learning algorithm which produces a distribution over outputclasses given a fixed-size slice of the input.

If we aren’t careful, the CTC loss can be very expensive to compute. Wecould try the straightforward approach and compute the score for each alignmentsumming them all up as we go. The problem is there can be a massive number ofalignments. For a YY of length UU without any repeat characters and an XX of length TT the size of the set is {T + U \choose T - U}.(T−UT+U). For T=100T=100 and U=50U=50 this number is almost 10^{40}.1040.For most problems this would be too slow.

Thankfully, we can compute the loss much faster with a dynamic programmingalgorithm. The key insight is that if two alignments have reached the sameoutput at the same step, then we can merge them.

Summing over all alignments can be very expensive.

Dynamic programming merges alignments, so it’s much faster.

Since we can have an \epsilonϵ before or after any token inYY, it’s easier to describe the algorithmusing a sequence which includes them. We’ll work with the sequenceZ \enspace =\enspace [\epsilon, ~y_1, ~\epsilon, ~y_2,~ \ldots, ~\epsilon, ~y_U, ~\epsilon]Z=[ϵ, y1, ϵ, y2, …, ϵ, yU, ϵ]which is YY with an \epsilonϵ atthe beginning, end, and between every character.

Let’s let \alphaα be the score of the merged alignments at agiven node. More precisely, \alpha_{s, t}αs,t is the CTC score ofthe subsequence Z_{1:s}Z1:s after tt input steps.As we’ll see, we’ll compute the final CTC score, P(Y \mid X)P(Y∣X),from the \alphaα’s at the last time-step. As long as we knowthe values of \alphaα at the previous time-step, we can compute\alpha_{s, t}.αs,t. There are two cases.

Case 1:

In this case, we can’t jump over z_{s-1}zs−1, the previoustoken in Z.Z. The first reason is that the previous token canbe an element of YY, and we can’t skip elements ofY.Y. Since every element of YY inZZ is followed by an \epsilonϵ, we canidentify this when z_{s} = \epsilon.zs=ϵ. The second reason isthat we must have an \epsilonϵ between repeat characters inY.Y. We can identify this whenz_s = z_{s-2}.zs=zs−2.

To ensure we don’t skip z_{s-1}zs−1, we can either be thereat the previous time-step or have already passed through at some earliertime-step. As a result there are two positions we can transition from.

\alpha_{s, t} \; =αs,t=

(\alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot(αs−1,t−1+αs,t−1)⋅ The CTC probability of the two valid subsequences after t-1t−1 input steps.

p_t(z_{s} \mid X)pt(zs∣X) The probability of the current character at input step t.t.

Case 2:

In the second case, we’re allowed to skip the previous token inZ.Z. We have this case whenever z_{s-1}zs−1 isan \epsilonϵ between unique characters. As a result there arethree positions we could have come from at the previous step.

\alpha_{s, t} \; =αs,t=

(\alpha_{s-2, t-1} + \alpha_{s-1, t-1} + \alpha_{s, t-1}) \quad\quad \cdot(αs−2,t−1+αs−1,t−1+αs,t−1)⋅ The CTC probability of the three valid subsequences after t-1t−1 input steps.

p_t(z_{s} \mid X)pt(zs∣X) The probability of the current character at input step t.t.

Below is an example of the computation performed by the dynamic programmingalgorithm. Every valid alignment has a path in this graph.

output

Y =Y= [a, b] input, XX

Node (s, t)(s,t) in the diagram represents \alpha_{s, t}αs,t – the CTC score of the subsequence Z_{1:s}Z1:s after tt input steps.

There are two valid starting nodes and two valid final nodes since the\epsilonϵ at the beginning and end of the sequence isoptional. The complete probability is the sum of the two final nodes.

Now that we can efficiently compute the loss function, the next step is tocompute a gradient and train the model. The CTC loss function is differentiablewith respect to the per time-step output probabilities since it’s just sums andproducts of them. Given this, we can analytically compute the gradient of theloss function with respect to the (unnormalized) output probabilities and fromthere run backpropagation as usual.

For a training set \mathcal{D}D, the model’s parametersare tuned to minimize the negative log-likelihood\sum_{(X, Y) \in \mathcal{D}} -\log\; p(Y \mid X)(X,Y)∈D∑−logp(Y∣X)instead of maximizing the likelihood directly.

Inference

After we’ve trained the model, we’d like to use it to find a likely outputfor a given input. More precisely, we need to solve:

Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}} \enspace p(Y \mid X)Y∗=Yargmaxp(Y∣X)

One heuristic is to take the most likely output at each time-step. Thisgives us the alignment with the highest probability:

A^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{A}} \enspace \prod_{t=1}^{T} \; p_t(a_t \mid X)A∗=Aargmaxt=1∏Tpt(at∣X)

We can then collapse repeats and remove \epsilonϵ tokens toget Y.Y.

For many applications this heuristic works well, especially when most of theprobability mass is alloted to a single alignment. However, this approach cansometimes miss easy to find outputs with much higher probability. The problemis, it doesn’t take into account the fact that a single output can have manyalignments.

Here’s an example. Assume the alignments [a, a, \epsilonϵ]and [a, a, a] individually have lower probability than [b, b, b]. Butthe sum of their probabilities is actually greater than that of [b, b, b]. Thenaive heuristic will incorrectly propose Y =Y= [b] asthe most likely hypothesis. It should have chosen Y =Y= [a].To fix this, the algorithm needs to account for the fact that [a, a, a] and [a,a, \epsilonϵ] collapse to the same output.

We can use a modified beam search to solve this. Given limitedcomputation, the modified beam search won’t necessarily find themost likely Y.Y. It does, at least, havethe nice property that we can trade-off more computation(a larger beam-size) for an asymptotically better solution.

A regular beam search computes a new set of hypotheses at each input step.The new set of hypotheses is generated from the previous set by extending eachhypothesis with all possible output characters and keeping only the topcandidates.

A standard beam search algorithm with an alphabet of \{\epsilon, a, b\}{ϵ,a,b} and a beam size of three.

We can modify the vanilla beam search to handle multiple alignments mapping tothe same output. In this case instead of keeping a list of alignments in thebeam, we store the output prefixes after collapsing repeats and removing\epsilonϵ characters. At each step of the search we accumulatescores for a given prefix based on all the alignments which map to it.

The CTC beam search algorithm with an output alphabet \{\epsilon, a, b\}{ϵ,a,b} and a beam size of three.

A proposed extension can map to two output prefixes if the character is arepeat. This is shown at T=3T=3 in the figure abovewhere ‘a’ is proposed as an extension to the prefix [a]. Both [a] and [a, a] arevalid outputs for this proposed extension.

When we extend [a] to produce [a,a], we only want include the part of theprevious score for alignments which end in \epsilon.ϵ. Remember, the\epsilonϵ is required between repeat characters. Similarly,when we don’t extend the prefix and produce [a], we should only include the partof the previous score for alignments which don’t end in \epsilon.ϵ.

Given this, we have to keep track of two probabilities for each prefixin the beam. The probability of all alignments which end in\epsilonϵ and the probability of all alignments which don’tend in \epsilon.ϵ. When we rank the hypotheses ateach step before pruning the beam, we’ll use their combined scores.

The implementation of this algorithm doesn’t require much code, but it isdense and tricky to get right. Checkout thisgistfor an example implementation in Python.

In some problems, such as speech recognition, incorporating a language modelover the outputs significantly improves accuracy. We can include the languagemodel as a factor in the inference problem.

Y^* \enspace = \enspace {\mathop{\text{argmax}}\limits_{Y}} Y∗=Yargmax

p(Y \mid X) \quad \cdotp(Y∣X)⋅ The CTC conditional probability.

p(Y)^\alpha \quad \cdotp(Y)α⋅ The language model probability.

L(Y)^\betaL(Y)β The “word” insertion bonus.

The function L(Y)L(Y) computes the length ofYY in terms of the language model tokens and acts as a wordinsertion bonus. With a word-based language model L(Y)L(Y)counts the number of words in Y.Y. If we use a character-basedlanguage model then L(Y)L(Y) counts the number of charactersin Y.Y. The language model scores are only included when aprefix is extended by a character (or word) and not at every step of thealgorithm. This causes the search to favor shorter prefixes, as measured byL(Y)L(Y), since they don’t include as many language modelupdates. The word insertion bonus helps with this. The parameters\alphaα and \betaβ are usually set bycross-validation.

The language model scores and word insertion term can be included in thebeam search. Whenever we propose to extend a prefix by a character, we caninclude the language model score for the new character given the prefix sofar.

Properties of CTC

We mentioned a few important properties of CTC so far. Here we’ll gointo more depth on what these properties are and what trade-offs they offer.

Conditional Independence

One of the most commonly cited shortcomings of CTC is the conditionalindependence assumption it makes.

Graphical model for CTC.

The model assumes that every output is conditionally independent ofthe other outputs given the input. This is a bad assumption for manysequence to sequence problems.

Say we had an audio clip of someone saying “triple A”. Another valid transcription couldbe “AAA”. If the first letter of the predicted transcription is ‘A’, thenthe next letter should be ‘A’ with high probability and ‘r’ with lowprobability. The conditional independence assumption does not allow for this.

If we predict an ‘A’ as the first letter then the suffix ‘AA’ should get much more probability than ‘riple A’. If we predict ‘t’ first, the opposite should be true.

In fact speech recognizers using CTC don’t learn a language model over theoutput nearly as well as models which are conditionally dependent. However, a separate language model canbe included and usually gives a good boost to accuracy.

The conditional independence assumption made by CTC isn’t always a badthing. Baking in strong beliefs over output interactions makes the model lessadaptable to new or altered domains. For example, we might want to use a speechrecognizer trained on phone conversations between friends to transcribecustomer support calls. The language in the two domains can be quite differenteven if the acoustic model is similar. With a CTC acoustic model, we can easilyswap in a new language model as we change domains.

Alignment Properties

The CTC algorithm is alignment-free. The objective functionmarginalizes over all alignments. While CTC does make strong assumptions aboutthe form of alignments between XX and YY, themodel is agnostic as to how probability is distributed amongst them. In someproblems CTC ends up allocating most of the probability to a single alignment.However, this isn’t guaranteed.We could force the model to choose a singlealignment by replacing the sum with a max in the objective function,p(Y \mid X) \enspace = \enspace \max_{A \in \mathcal{A}_{X,Y}} \enspace \prod_{t=1}^T \; p(a_t \mid X).p(Y∣X)=A∈AX,Ymaxt=1∏Tp(at∣X).

As mentioned before, CTC only allows monotonic alignments. Inproblems such as speech recognition this may be a valid assumption. For otherproblems like machine translation where a future word in a target sentencecan align to an earlier part of the source sentence, this assumption is adeal-breaker.

Another important property of CTC alignments is that they aremany-to-one. Multiple inputs can align to at most one output. In somecases this may not be desirable. We might want to enforce a strict one-to-onecorrespondence between elements of XX andY.Y. Alternatively, we may want to allow multiple outputelements to align to a single input element. For example, the characters“th” might align to a single input step of audio. A character based CTC modelwould not allow that.

The many-to-one property implies that the output can’t have more time-stepsthan the input. If YY has rr consecutive repeats, then the length of YY must be less than the length of XX by 2r - 1.2r−1.This is usually not a problem for speech and handwriting recognition since theinput is much longer than the output. However, for other problems whereYY is often longer than XX, CTC just won’twork.

CTC in Context

In this section we’ll discuss how CTC relates to other commonly usedalgorithms for sequence modeling.

HMMs

At a first glance, a Hidden Markov Model (HMM) seems quite different fromCTC. But, the two algorithms are actually quite similar. Understanding therelationship between them will help us understand what advantages CTC has overHMM sequence models and give us insight into how CTC could be changed forvarious use cases.

Let’s use the same notation as before,XX is the input sequence and YYis the output sequence with lengths TT andUU respectively. We’re interested in learningp(Y \mid X).p(Y∣X). One way to simplify the problem is to applyBayes’ Rule:p(Y \mid X) \; \propto \; p(X \mid Y) \; p(Y).p(Y∣X)∝p(X∣Y)p(Y).The p(Y)p(Y) term can be any language model, so let’s focus onp(X \mid Y).p(X∣Y). Like before we’ll let\mathcal{A}A be a set of allowed alignments betweenXX and Y.Y. Members of\mathcal{A}A have length T.T.Let’s otherwise leave \mathcal{A}A unspecified for now. We’llcome back to it later. We can marginalize over alignments to getp(X \mid Y)\; = \; \sum_{A \in \mathcal{A}} \; p(X, A \mid Y).p(X∣Y)=A∈A∑p(X,A∣Y).To simplify notation, let’s remove the conditioning on YY, itwill be present in every p(\cdot).p(⋅). With two assumptions we canwrite down the standard HMM.

p(X) \quad =p(X)= The probability of the input

\sum_{A \in \mathcal{A}} \; \prod_{t=1}^T∑A∈A∏t=1T Marginalizes over alignments

p(x_t \mid a_t) \quad \cdotp(xt∣at)⋅ The emission probability

p(a_t \mid a_{t-1})p(at∣at−1) The transition probability

The first assumption is the usual Markov property. The statea_tat is conditionally independent of all historic states giventhe previous state a_{t-1}.at−1. The second is that the observationx_txt is conditionally independent of everything given thecurrent state a_t.at.

The graphical model for an HMM.

Now we can take just a few steps to transform the HMM into CTC and see howthe two models relate. First, let’s assume that the transition probabilitiesp(a_t \mid a_{t-1})p(at∣at−1) are uniform. This givesp(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; p(x_t \mid a_t).p(X)∝A∈A∑t=1∏Tp(xt∣at).There are only two differences from this equation and the CTC loss function.The first is that we are learning a model of XX givenYY as opposed to YY given X.X.The second is how the set \mathcal{A}A is produced. Let’s dealwith each in turn.

The HMM can be used with discriminative models which estimate p(a \mid x).p(a∣x).To do this, we apply Bayes’ rule and rewrite the model asp(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; \frac{p(a_t \mid x_t)\; p(x_t)}{p(a_t)}p(X)∝A∈A∑t=1∏Tp(at)p(at∣xt)p(xt) \quad\quad\quad\propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; \frac{p(a_t \mid x_t)}{p(a_t)}.∝A∈A∑t=1∏Tp(at)p(at∣xt).

If we assume a uniform prior over the states aa and condition on all ofXX instead of a single element at a time, we arrive atp(X) \enspace \propto \enspace \sum_{A \in \mathcal{A}} \enspace \prod_{t=1}^T \; p(a_t \mid X).p(X)∝A∈A∑t=1∏Tp(at∣X).

The above equation is essentially the CTC loss function, assuming the set\mathcal{A}A is the same. In fact, the HMM framework does not specify what\mathcal{A}A should consist of. This part of the model can be designed on aper-problem basis. In many cases the model doesn’t condition on YY and theset \mathcal{A}A consists of all possible length TT sequences from theoutput alphabet. In this case, the HMM can be drawn as an ergodic statetransition diagram in which every state connects to every other state. Thefigure below shows this model with the alphabet or set of unique hidden statesas \{a, b, c\}.{a,b,c}.

In our case the transitions allowed by the model are strongly related toY.Y. We want the HMM to reflect this. One possible model couldbe a simple linear state transition diagram. The figure below shows this withthe same alphabet as before and Y =Y= [a, b]. Another commonlyused model is the Bakis or left-right HMM. In this model anytransition which proceeds from the left to the right is allowed.

Ergodic HMM: Any node can be either a starting or final state.

Linear HMM: Start on the left, end on the right.

CTC HMM: The first two nodes are the starting states and the last two nodes are the final states.

In CTC we augment the alphabet with \epsilonϵ and the HMM model allows asubset of the left-right transitions. The CTC HMM has two startstates and two accepting states.

One possible source of confusion is that the HMM model differs for any uniqueY.Y. This is in fact standard in applications such as speech recognition. Thestate diagram changes based on the output Y.Y. However, the functions whichestimate the observation and transition probabilities are shared.

Let’s discuss how CTC improves on the original HMM model. First, we can thinkof the CTC state diagram as a special case HMM which works well for manyproblems of interest. Incorporating the blank as a hidden state in the HMMallows us to use the alphabet of YY as the other hidden states. This modelalso gives a set of allowed alignments which may be a good prior for someproblems.

Perhaps most importantly, CTC is discriminative. It models p(Y \mid X)p(Y∣X) directly, an idea that’s been important in the past with otherdiscriminative improvements to HMMs.Discriminative training let’s us apply powerful learning algorithms like theRNN directly towards solving the problem we care about.

Encoder-Decoder Models

The encoder-decoder is perhaps the most commonly used framework for sequencemodeling with neural networks. These models have an encoderand a decoder. The encoder maps the input sequence XX into ahidden representation. The decoder consumes the hidden representation andproduces a distribution over the outputs. We can write this as\begin{aligned}H\enspace &= \enspace\textsf{encode}(X) \\[.5em]p(Y \mid X)\enspace &= \enspace \textsf{decode}(H).\end{aligned}Hp(Y∣X)=encode(X)=decode(H).The \textsf{encode}(\cdot)encode(⋅) and\textsf{decode}(\cdot)decode(⋅) functions are typically RNNs. Thedecoder can optionally be equipped with an attention mechanism. The hiddenstate sequence HH has the same number of time-steps as theinput, T.T. Sometimes the encoder subsamples the input. If theencoder subsamples the input by a factor ss thenHH will have T/sT/s time-steps.

We can interpret CTC in the encoder-decoder framework. This is helpful tounderstand the developments in encoder-decoder models that are applicable toCTC and to develop a common language for the properties of thesemodels.

Encoder: The encoder of a CTC model can be just about anyencoder we find in commonly used encoder-decoder models. For example theencoder could be a multi-layer bidirectional RNN or a convolutional network.There is a constraint on the CTC encoder that doesn’t apply to the others. Theinput length cannot be sub-sampled so much that T/sT/sis less than the length of the output.

Decoder: We can view the decoder of a CTC model as a simplelinear transformation followed by a softmax normalization. This layer shouldproject all TT steps of the encoder outputHH into the dimensionality of the output alphabet.

We mentioned earlier that CTC makes a conditional independence assumption overthe characters in the output sequence. This is one of the big advantages thatother encoder-decoder models have over CTC — they can model thedependence over the outputs. However in practice, CTC is still more commonlyused in tasks like speech recognition as we can partially overcome theconditional independence assumption by including an external language model.

Practitioner’s Guide

So far we’ve mostly developed a conceptual understanding of CTC. Here we’ll gothrough a few implementation tips for practitioners.

Software: Even with a solid understanding of CTC, theimplementation is difficult. The algorithm has several edge cases and a fastimplementation should be written in a lower-level programming language.Open-source software tools make it much easier to get started:

Baidu Research has open-sourcedwarp-ctc. The package is written in C++ and CUDA. The CTC loss function runs on either the CPU or the GPU. Bindings are available for Torch, TensorFlow andPyTorch.

TensorFlow has built inCTC loss andCTC beam search functions for the CPU.

Nvidia also provides a GPU implementation of CTC incuDNN versions 7 and up.

Numerical Stability: Computing the CTC loss naively isnumerically unstable. One method to avoid this is to normalize the\alphaα’s at each time-step. The original publicationhas more detail on this including the adjustments to the gradient. In practice this works well enoughfor medium length sequences but can still underflow for long sequences.A better solution is to compute the loss function in log-space with thelog-sum-exp trick.When computing the sum of two probabilities in log space use the identity\log(e^a + e^b) = \max\{a, b\} + \log(1 + e^{-|a-b|})log(ea+eb)=max{a,b}+log(1+e−∣a−b∣)Most programming languages have a stable function to compute\log(1 + x)log(1+x) whenxx is close to zero.Inference should also be done in log-space using the log-sum-exp trick.

Beam Search: There are a couple of good tips to know aboutwhen implementing and using the CTC beam search.

The correctness of the beam search can be tested as follows.

Run the beam search algorithm on an arbitrary input.

Save the inferred output \bar{Y}Y¯ and the corresponding score \bar{c}.c¯.

Compute the actual CTC score cc for \bar{Y}.Y¯.

Check that \bar{c} \approx cc¯≈c with the former being no greater than the latter. As the beam size increases the inferred output \bar{Y}Y¯ may change, but the two numbers should grow closer.

A common question when using a beam search decoder is the size of the beamto use. There is a trade-off between accuracy and runtime. We can check if thebeam size is in a good range. To do this first compute the CTC score for theinferred output c_i.ci. Then compute the CTC score for the groundtruth output c_g.cg. If the two outputs are not the same, weshould have c_g \lt c_i.cg<ci. If c_i << c_gci<<cg thenthe ground truth output actually has a higher probability under the model andthe beam search failed to find it. In this case a large increase to the beamsize may be warranted.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。