ABSTRACT
In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta , designed to process images uploaded daily at Facebook scale. Sharing of image content has become one of the primary ways to communicate information among internet users within social networks such as Facebook, and the understanding of such media, including its textual information, is of paramount importance to facilitate search and recommendation applications. We present modeling techniques for efficient detection and recognition of text in images and describe Rosetta 's system architecture. We perform extensive evaluation of presented technologies, explain useful practical approaches to build an OCR system at scale, and provide insightful intuitions as to why and how certain components work based on the lessons learnt during the development and deployment of the system.
Supplemental Material
- 2016. PyTorch. (2016). http://pytorch.org/Google Scholar
- 2017. Caffe2. (2017). https://caffe2.ai/Google Scholar
- 2017. ICDAR2017 Robust Reading Challenge on COCO-Text. (2017). http://rrc.cvc.uab.es/?ch=5/Google Scholar
- 2017. Open Neural Network Exchange (ONNX). (2017). https://onnx.ai/Google Scholar
- 2018. Detectron: FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet. (2018). https://github.com/facebookresearch/Detectron/Google Scholar
- Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. Google ScholarDigital Library
- Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Improving Object Detection With One Line of Code. CoRR abs/1704.04503 (2017).Google Scholar
- Dhruba Borthakur. 2013. Under the Hood: Building and open-sourcing RocksDB. (2013). https://code.facebook.com/posts/666746063357648/under-the-hood-building-and-open-sourcing-rocksdb/.Google Scholar
- Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In USENIX Conference on Annual Technical Conference. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.Google Scholar
- Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona. 2014. Fast Feature Pyramids for Object Detection. TPAMI (2014). Google ScholarDigital Library
- Yunze Gao, Yingying Chen, Jinqiao Wang, and Hanqing Lu. 2017. Reading Scene Text with Attention Convolutional Sequence Modeling. CoRR abs/1709.04303 (2017).Google Scholar
- Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017).Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. Google ScholarDigital Library
- Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In CVPR.Google Scholar
- Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G. Ororbia II, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with Cascaded Instance Aware Segmentation for Arbitrary Oriented Word Spotting in the Wild. In CVPR.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.Google Scholar
- Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016).Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In NIPS Deep Learning Workshop.Google Scholar
- Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. IJCV (2016). Google ScholarDigital Library
- Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast Oriented Text Spotting with a Unified Network. CoRR abs/1801.01671 (2018).Google Scholar
- Zichuan Liu, YIxing Li, Fengbo Ren, Hao Yu, and Wangling Goh. 2018. SqueezedText: A Real-time Scene Text Recognition by Binary Convolutional Encoderdecoder Network. AAAI.Google Scholar
- George Nagy. 2000. Twenty years of document image analysis in PAMI. (2000). Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI (2017). Google ScholarDigital Library
- Baoguang Shi, Xiang Bai, and Serge J. Belongie. 2017. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR.Google Scholar
- Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. TPAMI (2016).Google Scholar
- Cooper Smith. 2013. Facebook Users Are Uploading 350 Million New Photos Each Day. (2013). http://www.businessinsider.com/facebook-350-million-photos-each-day-2013-9.Google Scholar
- Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. 2016. Coco-text: Dataset and benchmark for text detection and recognition in natural images. CoRR abs/1601.07140 (2016). https://vision.cornell.edu/se3/coco-text-2/Google Scholar
- Kai Wang and Serge Belongie. 2010. Word Spotting in the Wild. In ECCV. Google ScholarDigital Library
- Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, and Cheng-Lin Liu. 2017. Scene Text Recognition with Sliding Convolutional Character Models. CoRR abs/1709.01727 (2017).Google Scholar
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CoRR abs/1707.01083 (2017).Google Scholar
- Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR.Google Scholar
- C. Lawrence Zitnick and Piotr Dollár. 2014. Edge Boxes: Locating Object Proposals from Edges. In ECCV.Google Scholar
Index Terms
- Rosetta: Large Scale System for Text Detection and Recognition in Images
Recommendations
A blind deconvolution model for scene text detection and recognition in video
Text detection and recognition in poor quality video is a challenging problem due to unpredictable blur and distortion effects caused by camera and text movements. This affects the overall performance of the text detection and recognition methods. This ...
Hybrid OCR Techniques for Cursive Script Languages - A Review and Applications
CICSYN '10: Proceedings of the 2010 2nd International Conference on Computational Intelligence, Communication Systems and NetworksSoftware-based Arabic optical character recognition (OCR) has been used quite successfully for many years. However, the hardware-based implementations of the OCR – which can be 10-100 times faster than the software-only method – seem to not have been ...
MAPS: midline analysis and propagation of segmentation
ICVGIP '12: Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image ProcessingScenic word images undergo degradations due to motion blur, uneven illumination, shadows and defocussing, which lead to difficulty in segmentation. As a result, the recognition results reported on the scenic word image datasets of ICDAR have been low. ...
Comments