Number plate recognition with Tensorflow

Created by Matthew Earl on May 06, 2016. Discuss on reddit!

Introduction

Over the past few weeks I’ve been dabbling with deep learning,in particular convolutional neural networks. One standout paper fromrecent times is Google’sMulti-digit Number Recognition from Street View. This paper describes a system for extracting house numbers fromstreet view imagery using a single end-to-end neural network. The authors thengo on to explain how the same network can be applied to breaking Google’s ownCAPTCHA system with human-level accuracy.

In order to get some hands-on experience with implementing neural networks Idecided I’d design a system to solve a similar problem: Automated number platerecognition (automated license plate recognition if you’re in the US). Myreasons for doing this are three-fold:

I should be able to use the same (or a similar) network architecture as theGoogle paper: The Google architecture was shown to work equally well atsolving CAPTCHAs, as such it’s reasonable to assume that it’d perform well onreading number plates too. Having a known good network architecture willgreatly simplify things as I learn the ropes of CNNs.
I can easily generate training data. One of the major issues with trainingneural networks is the requirement for lots of labelled training data.Hundreds of thousands of labelled training images are often required toproperly train a network. Fortunately, the relevant uniformity of UK numberplates means I can synthesize training data.
Curiosity. Traditional ANPR systems have relied on hand-written algorithms for plate localization,normalization, segmentation, character recognition etc. As such these systemstend to be many thousands of lines long. It’d be interesting to see how gooda system I can develop with minimal domain-specific knowledge with arelatively small amount of code.

For this project I’ve used Python, TensorFlow,OpenCV and NumPy. Source codeis available here.

Inputs, outputs and windowing

In order to simplify generating training images and to reduce computationalrequirements I decided my network would operate on 128x64 grayscale inputimages.

128x64 was chosen as the input resolution as this is small enough to permittraining in a reasonable amount of time with modest resources, butalso large enough for number plates to be somewhat readable:

^{Image credit}

In order to detect number plates in larger images a sliding window approach isused at various scales:

^{Image credit}

The image on the right is the 128x64 input that the neural net sees, whereasthe left shows the window in the context of the original input image.

For each window the network should output:

The probability a number plate is present in the input image. (Shown as agreen box in the above animation).
The probability of the digit in each position, ie. for each of the 7 possiblepositions it should return a probability distribution across the 36 possiblecharacters. (For this project I assume number plates have exactly 7characters, as is the case with most UK number plates).

A plate is considered present if and only if:

The plate falls entirely within the image bounds.
The plate’s width is less than 80% of the image’s width, and the plate’sheight is less than 87.5% of the image’s height.
The plate’s width is greater than 60% of the image’s width or the plate’sheight is greater than 60% of the image’s height.

With these numbers we can use a sliding window that moves 8 pixels at a time,and zooms in times between zoom levels and be guaranteed notto miss any plates, while at the same time not generating an excessive numberof matches for any single plate. Any duplicates that do occur are combined in apost-processing step (explained later).

Synthesizing images

To train any neural net a set of training data along with correct outputs mustbe provided. In this case this will be a set of 128x64 images along with theexpected output. Here’s an illustrative sample of training data generated forthis project:

expected output HH41RFP 1.
expected outputFB78PFD 1.
expected outputJW01GAI 0. (Plate partially truncated.)
expected outputAM46KVG 0. (Plate too small.)
expected outputXG86KIO 0. (Plate too big.)
expected outputXH07NYO 0. (Plate not present at all.)

The first part of the expected output is the number the net should output. Thesecond part is the “presence” value that the net should ouput. For datalabelled as not present I’ve included an explanation in brackets.

The process for generating the images is illustrated below:

The text and plate colour are chosen randomly, but the text must be a certainamount darker than the plate. This is to simulate real-world lightingvariation. Noise is added at the end not only to account foractual sensor noise, but also to avoid the network depending too much onsharply defined edges as would be seen with an out-of-focus input image.

Having a background is important as it means the network must learn to identifythe bounds of the number plate without “cheating”: Were a black backgroundused for example, the network may learn to identify plate location based onnon-blackness, which would clearly not work with real pictures of cars.

The backgrounds are sourced from the SUN database, which contains over 100,000images. It’s important the number of images is large to avoid the network“memorizing” background images.

The transformation applied to the plate (and its mask) is an affinetransformation based on a random roll, pitch, yaw, translation, and scale.The range allowed for each parameter was selected according to the ranges thatnumber plates are likely to be seen. For example, yaw is allowed to vary a lotmore than roll (you’re more likely to see a car turning a corner, than on itsside).

The code to generate the images is relatively short (~300 lines). It can beread in gen.py.

The network

Here’s the network architecture used:

See the wikipedia page for a summary of CNN building blocks. The abovenetwork is in fact based on this paper by Stark et al, as it gives more specifics aboutthe architecture used than the Google paper.

The output layer has one node (shown on the left) which is used as the presenceindicator. The rest encode the probability of a particular number plate: Eachcolumn as shown in the diagram corresponds with one of the digits in the numberplate, and each node gives the probability of the corresponding character beingpresent. For example, the node in column 2 row 3 gives the probability that thesecond digit is a C.

As is standard with deep neural nets all but the output layers use ReLUactivation. Thepresence node has sigmoid activation as is typically used for binary outputs.The other output nodes use softmax across characters (ie. so that theprobability in each column sums to one) which is the standard approach formodelling discrete probability distributions.

The code defining the network is in model.py.

The loss function is defined in terms of the cross-entropy between the labeland the network output. For numerical stability the activation functions of thefinal layer are rolled into the cross-entropy calculation usingsoftmax_cross_entropy_with_logits andsigmoid_cross_entropy_with_logits. For adetailed and intuitive introduction to cross-entropy see this sectionin Michael A. Nielsen’s free online book.

Training (train.py) takes about 6 hours using a nVidia GTX 970, with training data beinggenerated on-the-fly by a background process on the CPU.

Output Processing

To actually detect and recognize number plates in an input image a networkmuch like the above is applied to 128x64 windows at various positions andscales, as described in the windowing section.

The network differs from the one used in training in that the last two layersare convolutional rather than fully connected, and the input image can be anysize rather than 128x64. The idea is that the whole image at a particular scalecan be fed into this network which yields an image with a presence / characterprobability values at each “pixel”. The idea here is that adjacent windows willshare many convolutional features, so rolling them into the same network avoidscalculating the same features multiple times.

Visualizing the “presence” portion of the output yields something like thefollowing:

^{Image credit}

The boxes here are regions where the network detects a greater than 99%probability that a number plate is present. The reason for the high thresholdis to account for a bias introduced in training: About half of the trainingimages contained a number plate, whereas in real world images of cars numberplates are much rarer. As such if a 50% threshold is used the detector is proneto false positives.

To cope with the obvious duplicates we apply a form of non-maximum suppressionto the output:

^{Image credit}

The technique used here first groups the rectangles into overlappingrectangles, and for each group outputs:

The intersection of all the bounding boxes.
The license number corresponding with the box in the group that had thehighest probability of being present.

Here’s the detector applied to the image at the top of this post:

^{Image credit}

Whoops, the R has been misread as a P. Here’s the window from the aboveimage which gives the maximum presence response:

^{Image credit}

On first glance it appears that this should be an easy case for the detector,however it turns out to be an instance of overfitting. Here’s the R from thenumber plate font used to generate the training images:

Note how the leg of the R is at a different angle to the leg of the R inthe input image. The network has only ever seen R’s as shown above, so getsconfused when it sees R’s in a different font. To test this hypothesis Imodified the image in GIMP to more closely resemble the training font:

And sure enough, the detector now gets the correct result:

The code for the detector is in detect.py.

Conclusion

I’ve shown that with a relatively short amount of code (~800 lines), itspossible to build an ANPR system without importing any domain-specificlibraries, and with very little domain-specific knowledge. Furthermore I’veside-stepped the problem of needing thousands of training images (as is usuallythe case with deep neural networks) by synthesizing images on the fly.

On the other hand, my system has a number of drawbacks:

It only works with number plates in a specific format. More specifically,the network architecture assumes exactly 7 chars are visible in the output.
It only works on specific number plate fonts.
It’s slow. The system takes several seconds to run on moderately sizedimage.

The Google team solves 1) by splitting the higher levels of their network intodifferent sub-networks, each one assuming a different number of digits in theoutput. A parallel sub-network then decides how many digits are present. Isuspect this approach would work here, however I’ve not implemented it for thisproject.

I showed an instance of 2) above, with the misdetection of an R due to aslightly varied font. The effects would be further exacerbated if I were tryingto detect US number plates rather than UK number plates which have much morevaried fonts. One possible solution would be to make my training data morevaried by drawing from a selection of fonts, although it’s not clear how manyfonts I would need for this approach to be successful.

The slowness (3)) is a killer for many applications: A modestly sized inputimage takes a few seconds to process on a reasonably powerful GPU. I don’tthink its possible to get away from this without introducing a (cascade of)detection stages, for example a Haar cascade, a HOG detector, or a simpler neuralnet.

It would be an interesting exercise to see how other ML techniques compare, inparticular pose regression (with the pose being an affine transformationcorresponding with 3 corners of the plate) looks promising. A much more basicclassification stage could then be tacked on the end. This solution should besimilarly terse if an ML library such as scikit-learn is used.

In conclusion, I’ve shown that a single CNN (with some filtering) can be usedas a passable number plate detector / recognizer, however it does not yetcompete with the traditional hand-crafted (but more verbose) pipelines in termsof performance.

Image Credits

Original “Proton Saga EV” image bySomaditya Bandyopadhyay licensedunder theCreative Commons Attribution-Share Alike 2.0 Generic license.

Original “Google Street View Car” image by Reedy licensed under theCreative Commons Attribution-Share Alike 3.0 Unported license.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。