skip to main content
10.1145/3196494.3196504acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
short-paper
Public Access

BCD: Decomposing Binary Code Into Components Using Graph-Based Clustering

Published: 29 May 2018 Publication History

Abstract

Complex software is built by composing components implementing largely independent blocks of functionality. However, once the sources are compiled into an executable, that modularity is lost. This is unfortunate for code recipients, for whom knowing the components has many potential benefits, such as improved program understanding for reverse-engineering, identifying shared code across different programs, binary code reuse, and authorship attribution. A novel approach for decomposing such source-free program executables into components is here proposed. Given an executable, the approach first statically builds a decomposition graph, where nodes are functions and edges capture three types of relationships: code locality, data references, and function calls. It then applies a graph-theoretic approach to partition the functions into disjoint components. A prototype implementation, BCD, demonstrates the approach's efficacy: Evaluation of BCD with 25 C++ binary programs to recover the methods belonging to each class achieves high precision and recall scores for these tested programs.

References

[1]
Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. OBA2: An Onion Approach to Binary Code Authorship Attribution. In Digital Investigation, Vol. 11. S94--S103.
[2]
Nicolas Anquetil and Timothy C. Lethbridge. 1999. Experiments with Clustering as a Software Remodularization Method. In Proc. 6th Working Conf. Reverse Engineering (WCRE). 235--255.
[3]
Tiffany Bao, Jonathan Burket, Maverick Woo, Rafael Turner, and David Brumley. 2014. BYTEWEIGHT: Learning to Recognize Functions in Binary Code. In Proc. 23rd USENIX Security Sym. 845--860.
[4]
David Brumley, JongHyup Lee, Edward J. Schwartz, and Maverick Woo. 2013. Native x86 Decompilation Using Semantics-preserving Structural Analysis and Iterative Control-flow Structuring. In Proc. 22nd USENIX Security Sym.
[5]
Juan Caballero, Noah M. Johnson, Stephen McCamant, and Dawn Song. 2010. Binary Code Extraction and Interface Identification for Security Applications. In Proc. 17th Annual Network &Distributed System Security Sym. (NDSS).
[6]
Juan Caballero, Pongsin Poosankam, Stephen McCamant, Domagoj Babic, and Dawn Song. 2010. Input Generation Via Decomposition and Re-Stitching: Finding Bugs in Malware. In Proc. 17th ACM Conf. Computer and Communications Security (CCS). 413--425.
[7]
Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. 2004. Dictionary Matching and Indexing with Errors and Don't Cares. In Proc. 36th Annual ACM Sym. Theory of Computing (STOC). 91--100.
[8]
Chris Eagle. 2008. The IDA Pro Book: The Unofficial Guide to the World's Most Popular Disassembler. No Starch Press, San Francisco, CA, USA.
[9]
Manuel Egele, Theodoor Scholte, Engin Kirda, and Christopher Kruegel. 2012. A Survey on Automated Dynamic Malware-analysis Techniques and Tools. ACM Computing Surveys (CSUR) 44, 2 (2012).
[10]
Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components. In Proc. 23rd USENIX Security Sym.
[11]
M. Van Emmerik and T. Waddington. 2004. Using a Decompiler for Real-world Source Recovery. In Proc. 11th Working Conf. Reverse Engineering (WCRE). 27--36.
[12]
Free Software Foundation. 1983. GNU Software Repository. www.gnu.org/ software/software.html. (1983). Retrieved 3/30/2018.
[13]
Debin Gao, Michael K. Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In ICICS.
[14]
Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller. 2011. Labeling Library Functions in Stripped Binaries. In Proc. 10th ACM SIGPLAN-SIGSOFT Work. Program Analysis for Software Tools and Engineering (PASTE). 1--8.
[15]
Lingxiao Jiang and Zhendong Su. 2009. Automatic Mining of Functionally Equiv- alent Code Fragments Via Random Testing. In Proc. 18th Int. Sym. Software Testing and Analysis (ISSTA). 81--92.
[16]
Wei Ming Khoo, Alan Mycroft, and Ross Anderson. 2013. Rendezvous: A Search Engine for Binary Code. In Proc. 10th Working Conf. Mining Software Repositories (MSR). 329--338.
[17]
Dohyeong Kim, William N. Sumner, Xiangyu Zhang, Dongyan Xu, and Hira Agrawal. 2014. Reuse-oriented Reverse Engineering of Functional Components From x86 Binaries. In Proc. 36th Int. Conf. Software Engineering (ICSE). 1128--1139.
[18]
Clemens Kolbitsch, Thorsten Holz, Christopher Kruegel, and Engin Kirda. 2010. Inspector Gadget: Automated Extraction of Proprietary Gadgets From Malware Binaries. In Proc. 31st IEEE Sym. Security &Privacy (S&P).
[19]
Spiros Mancoridis, Brian S. Mitchell, Yihfarn Chen, and Emden R. Gansner. 1999. Bunch: A Clustering Tool for the Recovery and Maintenance of Software System Structures. In Proc. IEEE Int. Conf. Software Maintenance (ICSM). 50--59.
[20]
Xiaozhu Meng. 2016. Fine-grained Binary Code Authorship Identification. In Proc. 24th ACM SIGSOFT Int. Sym. Foundations Software Engineering (FSE). 1097--1099.
[21]
Microsoft. 2007. Visual Studio sample codes. code.msdn.microsoft.com/vstudio. (2007). Retrieved 3/30/2018.
[22]
Mark E.J. Newman. 2004. Fast Algorithm for Detecting Community Structure in Networks. Physical Review E 69, 6 (2004), 066133.
[23]
Beng Heng Ng and Atul Prakash. 2013. Exposé: Discovering Potential Binary Code Re-use. In Proc. 37th IEEE Annual Computer Software and Applications Conf. (COMPSAC). 492--501.
[24]
Andreas Noack. 2009. LinLogLayout: Graph Clustering and Force-directed Graph Layout. code.google.com/p/linloglayout. (2009). Retrieved 3/30/2018.
[25]
Arzucan Özgür, Levent Özgür, and Tunga Güngör. 2005. Text Categorization with Class-based and Corpus-based Keyword Selection. In Proc. 20th Int. Sym. Computer and Information Sciences (ISCIS). 606--615.
[26]
Nathan Rosenblum, Xiaojin Zhu, and Barton P. Miller. 2011. Who Wrote This Code? Identifying the Authors of Program Binaries. In Proc. 16th European Conf. Research in Computer Security (ESORICS). 172--189.
[27]
Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting Code Clones in Binary Executables. In Proc. 18th Int. Sym. Software Testing and Analysis (ISSTA). 117--128.
[28]
Slashdot Media. 1999. SourceForge. sourceforge.net. (1999). Retrieved 3/30/2018.
[29]
Randy Smith and Susan Horwitz. 2009. Detecting and Measuring Similarity in Code Clones. In Proc. 3rd REF/TCSE Int. Work. Software Clones (IWSC). 28--34.
[30]
Venkatesh Srinivasan and Thomas Reps. 2014. Recovery of Class Hierarchies and Composition Relationships From Machine Code. In Proc. 23rd Int. Conf. Compiler Construction (CC). 61--84.
[31]
Wenhao Wang, Xiaoyang Xu, and Kevin W. Hamlen. 2017. Object Flow Integrity. In Proceedings of the 24th ACM Conference on Computer and Communications Security (CCS). 1909--1924.
[32]
Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu. 2009. Behavior Based Software Theft Detection. In Proc. 16th ACM Conf. Computer and Communications Security (CCS). 280--290.
[33]
Fangfang Zhang, Yoon-Chan Jhi, Dinghao Wu, Peng Liu, and Sencun Zhu. 2012. A First Step Towards Algorithm Plagiarism Detection. In Proc. 21st Int. Sym. Software Testing and Analysis (ISSTA). 111--121.
[34]
Zynamics. 2004. BinDiff. www.zynamics.com/bindiff.html. (2004). Retrieved 3/30/2018.

Cited By

View all
  • (2024)DeLink: Source File Information Recovery in BinariesProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680338(1009-1021)Online publication date: 11-Sep-2024
  • (2024)PromeTrans: Bootstrap binary functionality classification with knowledge transferred from pre-trained modelsEmpirical Software Engineering10.1007/s10664-024-10593-y30:1Online publication date: 27-Nov-2024
  • (2023)ModDiff: Modularity Similarity-Based Malware Homologation DetectionElectronics10.3390/electronics1210225812:10(2258)Online publication date: 16-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASIACCS '18: Proceedings of the 2018 on Asia Conference on Computer and Communications Security
May 2018
866 pages
ISBN:9781450355766
DOI:10.1145/3196494
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary code decomposition
  2. components
  3. graph-based clustering

Qualifiers

  • Short-paper

Funding Sources

Conference

ASIA CCS '18
Sponsor:

Acceptance Rates

ASIACCS '18 Paper Acceptance Rate 52 of 310 submissions, 17%;
Overall Acceptance Rate 418 of 2,322 submissions, 18%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)205
  • Downloads (Last 6 weeks)31
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)DeLink: Source File Information Recovery in BinariesProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680338(1009-1021)Online publication date: 11-Sep-2024
  • (2024)PromeTrans: Bootstrap binary functionality classification with knowledge transferred from pre-trained modelsEmpirical Software Engineering10.1007/s10664-024-10593-y30:1Online publication date: 27-Nov-2024
  • (2023)ModDiff: Modularity Similarity-Based Malware Homologation DetectionElectronics10.3390/electronics1210225812:10(2258)Online publication date: 16-May-2023
  • (2023)Improving Security Tasks Using Compiler Provenance Information Recovered At the Binary-LevelProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623098(2695-2709)Online publication date: 15-Nov-2023
  • (2022)ModXProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510627(1393-1405)Online publication date: 21-May-2022
  • (2022)Software Module Clustering: An In-Depth Literature AnalysisIEEE Transactions on Software Engineering10.1109/TSE.2020.304255348:6(1905-1928)Online publication date: 1-Jun-2022
  • (2022)DeMal: Module decomposition of malware based on community discoveryComputers & Security10.1016/j.cose.2022.102680117(102680)Online publication date: Jun-2022
  • (2022)Improving binary diffing speed and accuracy using community detection and locality-sensitive hashing: an empirical studyJournal of Computer Virology and Hacking Techniques10.1007/s11416-022-00452-z19:2(319-337)Online publication date: 18-Oct-2022
  • (2021)A Survey of Binary Code SimilarityACM Computing Surveys10.1145/344637154:3(1-38)Online publication date: 17-Apr-2021
  • (2019)Function Identification in Android Binaries with Deep Learning2019 Seventh International Symposium on Computing and Networking (CANDAR)10.1109/CANDAR.2019.00019(92-101)Online publication date: Nov-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media