AI-driven reverse engineering of Java applications

Conference July 1, 2025 Uncategorized

Traditional reverse engineering

It is not an uncommon task to understand the innerworkings of an existing Java project, whether it is proprietary or open source. This can range from a simple task such as decompiling and reviewing the source code of an existing library to understanding how a large codebase is architected, built and deployed. In many cases developers are looking for proper documentation that describes in details and with examples the concepts implemented in the target project but quite often such a documentation is simply missing. In the case of decompilation there are tools like JD or IDE-specific decompiler plugins that do the job straight away. However if we consider the case a completely new and unknown code repository there a number of things we typically start with to understand how it is structured:

What build tools do we use ? Are there any build tool-specific plugins and configuration we should consider during build ? In this case we need to review the build-specific configuration files i.e. like pom.xml or build.gradle to understand the build process.
How is the code structured ? What is the relationship between the different packages and classes ? In this case we can generate UML class diagrams to visually analyse and comprehend how is the code structured.
How do the difference objects interact with each other ? In this case we can generate UML sequence diagrams to visually analyse and comprehend this information.
What propriety and open source libraries do we use ? How do they work in general and where are they used in the project ? This information requires typically a bit of digging and research if we haven’t used the particular library and what to understand at least at a very basic level how it works.
Describe certain patterns and code smells found in the code. This requires good understanding of design patterns and general bad coding practices specifically related to Java applications.
What application specific configuration do we use and for what purpose ? This requires understanding and reviewing how does the application store configuration: whether it is via plain Spring application.yml, generated via Kubernetes configmaps, using a config server like Spring cloud config to give a few examples.
What external systems do we interact with and via what protocols ? In many cases this boils down to the previous point whereby this information is stored in application configuration but it can come from other places like i.e. an external database the application interacts with.
Understand how the application is initialized upon startup. This typically requires starting manuallying from the main() method and diving into the initialization process.

And this is not a complete list of all the reverse engineering activities you may need to do when understanding an existing Java project … Certain tools simplify many of these activities:

Tools generating UML diagrams from existing code such as propriety frameworks like Sparx Enterprise Architect or IDE-specific plugins.
Static analysis tools that provide pattern mining and analysis of code smells or potential vulnerabilities like PMD, FindBugs and SonarQube.
Dynamic code analysis tools that, for example, scan for potential vulnerabilities such as Burp and Veracode.

But it is 21st century and everyone is talking about AI … So can AI help further in the process of reverse engineering apart from the code analysis tools we already mentioned ?

AI-assisted reverse engineering

Not only AI can help in that area but it is even defined as a distinct area of research called AI-assisted reverse engineering (AIARE). While in essence certain AI techniques like deep learning and specifically LLMs build using these techniques overlap in terms of what can be achieved simply by static analysis tools, there are certain activities that AI can achieve way better in order to “comprehend” how a system behaves compared to traditional code analysis tools:

provide an analysis of the interaction of components in a system and give answers such as “class A is used as a wrapper for the communication with the Kafka message bus”;
provide an in-depth vulnerability and malware analysis based on recognized patterns by the LLM;
provide information about calls to external systems such as “system X is called by first retriving configuration from Postgres database and making an API call to it from class A”;
reconstucting high-level code constructs from binary and obfuscated code.

At present many of these capabilities and built on top of existing reverse engineering tools like Ghidra (with plugins like RevEng.AI or ReVa), Radare2 (with plugins like r2ai and decai) or IDA Pro (with a third part MCP server to facilitate reverse engineering during decompilation). You can think of these as a step further on top of dissasemblers where the LLM is used to refine decompiled code and provide a higher-level code construct.

On the other hand certain AI code assstants provide features that can facilitate the reverse engineering process of an existing codebase. This is for example the case with Github Copilot that provides a feature to explain existing code or Claude Code that provides the possibility to answer questions about the architecture and logic of a codebase.

A case study: Understanding third-party libraries used by a Java project

Let’s take a look at a particular use case: describing briefly Java libraries used by an application. Tradionally when we look at an existing repository we identify third-party dependencies and if we see an unfamiliar library used in the project we typically do a search for official documentation or blogs to do some basic research on how this library works. LLMs make this task straight-forward if we craft the proper prompt for the purpose. Let’s assume we have a tool that parses the build filies (Maven or Gradle) and identifies we are using the org.apache.pdfbox to generate PDFs. If we ask ChatGPT to give as an example we can try the following prompt:

As a professional software developer
Give me a code example of the org.apache.pdfbox:pdfbox:3.0.5 Maven library

What the model gives is a detailed explanation and not only the code example as we wish:

If we refine our prompt a bit and use the following one instead:

As a professional software developer
Give me a code example of the org.apache.pdfbox:pdfbox:3.0.5 Maven library. 
Reply only with code sample.

Then we get just a code sample:

As you can see using the proper prompts or even code (even decompiled if need be) the model can be used to facilitate a number of activities that we typically need to do manually and are hard to automate with proper code analysis tools.

Blog

AI-driven reverse engineering of Java applications

Traditional reverse engineering

AI-assisted reverse engineering

A case study: Understanding third-party libraries used by a Java project

Conference (Website)

Leave a Reply Cancel reply