research-article

Rosetta: Large Scale System for Text Detection and Recognition in Images

Authors:
Fedor Borisyuk

Facebook Inc., Menlo Park, CA, USA

Facebook Inc., Menlo Park, CA, USA
View Profile

,
Albert Gordo

Facebook Inc., Menlo Park, CA, USA

Facebook Inc., Menlo Park, CA, USA
View Profile

,
Viswanath Sivakumar

Facebook Inc., Menlo Park, CA, USA

Facebook Inc., Menlo Park, CA, USA
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 71–79https://doi.org/10.1145/3219819.3219861

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 71–79

ABSTRACT

In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta , designed to process images uploaded daily at Facebook scale. Sharing of image content has become one of the primary ways to communicate information among internet users within social networks such as Facebook, and the understanding of such media, including its textual information, is of paramount importance to facilitate search and recommendation applications. We present modeling techniques for efficient detection and recognition of text in images and describe Rosetta 's system architecture. We perform extensive evaluation of presented technologies, explain useful practical approaches to build an OCR system at scale, and provide insightful intuitions as to why and how certain components work based on the lessons learnt during the development and deployment of the system.

Supplemental Material

borisyuk_rosetta.mp4

mp4

291.3 MB

Download

References

2016. PyTorch. (2016). http://pytorch.org/Google Scholar
2017. Caffe2. (2017). https://caffe2.ai/Google Scholar
2017. ICDAR2017 Robust Reading Challenge on COCO-Text. (2017). http://rrc.cvc.uab.es/?ch=5/Google Scholar
2017. Open Neural Network Exchange (ONNX). (2017). https://onnx.ai/Google Scholar
2018. Detectron: FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet. (2018). https://github.com/facebookresearch/Detectron/Google Scholar
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In ICML. Google ScholarDigital Library
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Improving Object Detection With One Line of Code. CoRR abs/1704.04503 (2017).Google Scholar
Dhruba Borthakur. 2013. Under the Hood: Building and open-sourcing RocksDB. (2013). https://code.facebook.com/posts/666746063357648/under-the-hood-building-and-open-sourcing-rocksdb/.Google Scholar
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In USENIX Conference on Annual Technical Conference. Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Li Fei-fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.Google Scholar
Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona. 2014. Fast Feature Pyramids for Object Detection. TPAMI (2014). Google ScholarDigital Library
Yunze Gao, Yingying Chen, Jinqiao Wang, and Hanqing Lu. 2017. Reading Scene Text with Attention Convolutional Sequence Modeling. CoRR abs/1709.04303 (2017).Google Scholar
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017).Google Scholar
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML. Google ScholarDigital Library
Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In CVPR.Google Scholar
Dafang He, Xiao Yang, Chen Liang, Zihan Zhou, Alexander G. Ororbia II, Daniel Kifer, and C. Lee Giles. 2017. Multi-scale FCN with Cascaded Instance Aware Segmentation for Arbitrary Oriented Word Spotting in the Wild. In CVPR.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR.Google Scholar
Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016).Google Scholar
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In NIPS Deep Learning Workshop.Google Scholar
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading Text in the Wild with Convolutional Neural Networks. IJCV (2016). Google ScholarDigital Library
Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Fast Oriented Text Spotting with a Unified Network. CoRR abs/1801.01671 (2018).Google Scholar
Zichuan Liu, YIxing Li, Fengbo Ren, Hao Yu, and Wangling Goh. 2018. SqueezedText: A Real-time Scene Text Recognition by Binary Convolutional Encoderdecoder Network. AAAI.Google Scholar
George Nagy. 2000. Twenty years of document image analysis in PAMI. (2000). Google ScholarDigital Library
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. TPAMI (2017). Google ScholarDigital Library
Baoguang Shi, Xiang Bai, and Serge J. Belongie. 2017. Detecting Oriented Text in Natural Images by Linking Segments. In CVPR.Google Scholar
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition. TPAMI (2016).Google Scholar
Cooper Smith. 2013. Facebook Users Are Uploading 350 Million New Photos Each Day. (2013). http://www.businessinsider.com/facebook-350-million-photos-each-day-2013-9.Google Scholar
Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge Belongie. 2016. Coco-text: Dataset and benchmark for text detection and recognition in natural images. CoRR abs/1601.07140 (2016). https://vision.cornell.edu/se3/coco-text-2/Google Scholar
Kai Wang and Serge Belongie. 2010. Word Spotting in the Wild. In ECCV. Google ScholarDigital Library
Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, and Cheng-Lin Liu. 2017. Scene Text Recognition with Sliding Convolutional Character Models. CoRR abs/1709.01727 (2017).Google Scholar
Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. CoRR abs/1707.01083 (2017).Google Scholar
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. EAST: An Efficient and Accurate Scene Text Detector. In CVPR.Google Scholar
C. Lawrence Zitnick and Piotr Dollár. 2014. Edge Boxes: Locating Object Proposals from Edges. In ECCV.Google Scholar

Index Terms

Rosetta: Large Scale System for Text Detection and Recognition in Images
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

A blind deconvolution model for scene text detection and recognition in video

Text detection and recognition in poor quality video is a challenging problem due to unpredictable blur and distortion effects caused by camera and text movements. This affects the overall performance of the text detection and recognition methods. This ...
Read More
Hybrid OCR Techniques for Cursive Script Languages - A Review and Applications
CICSYN '10: Proceedings of the 2010 2nd International Conference on Computational Intelligence, Communication Systems and Networks

Software-based Arabic optical character recognition (OCR) has been used quite successfully for many years. However, the hardware-based implementations of the OCR – which can be 10-100 times faster than the software-only method – seem to not have been ...
Read More
MAPS: midline analysis and propagation of segmentation
ICVGIP '12: Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing

Scenic word images undergo degradations due to motion blur, uneven illumination, shadows and defocussing, which lead to difficulty in segmentation. As a result, the recognition results reported on the scenic word image datasets of ICDAR have been low. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
optical character recognition
text detection
text recognition
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 201
  Total Citations
  View Citations
- 6,500
  Total Downloads
- Downloads (Last 12 months)125
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rosetta: Large Scale System for Text Detection and Recognition in Images

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A blind deconvolution model for scene text detection and recognition in video

Hybrid OCR Techniques for Cursive Script Languages - A Review and Applications

MAPS: midline analysis and propagation of segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Rosetta: Large Scale System for Text Detection and Recognition in Images

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A blind deconvolution model for scene text detection and recognition in video

Hybrid OCR Techniques for Cursive Script Languages - A Review and Applications

MAPS: midline analysis and propagation of segmentation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media