Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
example.ql	example.ql
qlpack.yml	qlpack.yml

CodeQL workshop for Java: Unsafe deserialization in Apache Dubbo

@adityasharad and @pwntester, moderated by @aeisenberg @jkcso @jf205 @xcorail

If you are attending this workshop at GitHub Universe, or watching the recording, the facilitators will guide you through the steps below. You can use this document as a written reference.

Prerequisites and setup instructions

Please complete this section before the workshop, if possible.

Install Visual Studio Code.
Install the CodeQL extension for Visual Studio Code.
You do not need to install the CodeQL CLI: the extension will handle this for you.
Clone this repository:
```
git clone --recursive https://github.com/githubuniverseworkshops/codeql
```
- Please don't forget --recursive: This allows you to obtain the standard CodeQL query libraries, which are included as a Git submodule of this repository.
- What if I forgot to add --recursive? If you've already cloned the repository, please set up the submodule by running:
```
git submodule update --init
```
Open the repository in Visual Studio Code: File > Open (or Open Folder) > Browse to the checkout of githubuniverseworkshops/codeql.
Import the CodeQL database to be used in the workshop:
- Click the CodeQL rectangular icon in the left sidebar.
- Place your mouse over Databases, and click the icon labelled Download Database.
- Copy and paste this URL into the box, then press OK/Enter: https://github.com/githubuniverseworkshops/codeql/releases/download/universe-2021/codeql-java-workshop-apache-dubbo.zip
- Click on the database name, and click Set Current Database.
Create a new file in the workshop-2021 directory called UnsafeDeserialization.ql.

Problem statement

Serialization is the process of converting in memory objects to text or binary output formats, usually for the purpose of sharing or saving program state. This serialized data can then be loaded back into memory at a future point through the process of deserialization.

In languages such as Java, Python and Ruby, deserialization provides the ability to restore not only primitive data, but also complex types such as library and user defined classes. This provides great power and flexibility, but introduces a signficant attack vector if the deserialization happens on untrusted user data without restriction.

Apache Dubbo is a popular open-source RPC framework in Java. In 2021, a researcher from the GitHub Security Lab found multiple vulnerabilities leading to remote code execution (RCE) through different deserialization formats.

In this workshop, we will write a query to find variants for CVE-2020-11995 in a database built from the known vulnerable version of Apache Dubbo.

The problem occurred because user-controlled data received by the different network libraries used by Apache Dubbo were deserialized using insecure deserialization formats.

Documentation links

If you get stuck, try searching our documentation and blog posts for help and ideas. Below are a few links to help you get started:

CodeQL overview
CodeQL for Java
Analyzing data flow in Java
Using the CodeQL extension for VS Code
GitHub Security Lab research
CodeQL on GitHub Learning Lab
For more advanced CodeQL development in future, you may wish to set up the CodeQL starter workspace for all languages.

Useful commands

Run a query using the following commands from the Command Palette (Cmd/Ctrl + Shift + P) or right-click menu:
- CodeQL: Run Query (run the entire query)
- CodeQL: Quick Evaluation (run only the selected predicate or snippet)
Click the links in the query results to navigate to the source code.
Explore the CodeQL libraries in your IDE using:
- autocomplete suggestions (Cmd/Ctrl + Space)
- jump-to-definition (F12, or Cmd/Ctrl + F12 in a Codespace)
- documentation hovers (place your cursor over an element)
- the AST viewer on an open source file (View AST from the CodeQL sidebar or Command Palette)

Workshop

The workshop is split into several steps. You can write one query per step, or work with a single query that you refine at each step. Each step has a hint that describes useful classes and predicates in the CodeQL standard libraries for Java.

Section 1: Finding ObjectInput deserialization

Apache Dubbo uses an abstraction layer to wrap multiple deserialization formats. Most of the supported serialization libraries might lead to arbitrary code execution upon deserialization of untrusted data. The SPI interface used for deserialization is called ObjectInput. It provides multiple readXXX methods for deserializing data to a Java object. By default, the input is not validated in any way, and is vulnerable to remote code execution exploits.

In this section, we will identify calls to ObjectInput.readXXX methods in the codebase. The qualifiers of these calls are the values being deserialized, and hence are sinks for deserialization vulnerabilities.

Find all method calls in the program.
Hint
- A method call is represented by the MethodAccess type in the CodeQL Java library.
Solution
```
import java

from MethodAccess call
select call
```
Update your query to report the method being called by each method call.
Hints
- Add a CodeQL variable called method with type Method.
- Add a where clause.
- MethodAccess has a predicate called getMethod() for returning the method.
- Use the equality operator = to assert that two CodeQL expressions are the same.
Solution
```
import java

from MethodAccess call, Method method
where call.getMethod() = method
select call, method
```
Find all calls in the program to methods starting with read.
Hint
- Method.getName() returns a string representing the name of the method.
- string.matches("foo%") can be used to check if a string starts with foo.
- Use the and keyword to add multiple conditions to the where clause.
Solution
```
import java

from MethodAccess read, Method method
where
  read.getMethod() = method and
  method.getName().matches("read%")
select read
```
Refine your query to only match calls to read methods on classes implementing the org.apache.dubbo.common.serialize.ObjectInput interface.
Hint
- Method.getDeclaringType() returns the RefType this method is declared on. A Class is one kind of RefType.
- RefType.getASourceSupertype() returns the immediate parent/supertypes for a given type, as defined in the Java source. (Hover to see the documentation.)
- Use the "reflexive transitive closure" operator * on a call to a predicate with 2 arguments, e.g. getASourceSupertype*(), to apply the predicate 0 or more times in succession.
- RefType.hasQualifiedName("package", "class") holds if the given RefType has the fully-qualified name package.class. For example, the query
```
from RefType r
where r.hasQualifiedName("java.lang", "String")
select r
```
  will find the type java.lang.String.
Solution
```
import java

from MethodAccess read, Method method
where
  read.getMethod() = method and
  method.getName().matches("read%") and
  method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput")
select read
```
The ObjectInput.readXXX methods deserialize the qualifier argument (i.e. the this argument, or the object before the .). Update your query to report the deserialized argument.
Hint
- MethodAccess.getQualifier() returns the qualifier of the method call.
- The qualifier is an expression in the program, represented by the CodeQL class Expr.
- Introduce a new variable in the from clause to hold this expression, and output the variable in the select clause.
Solution
```
import java

from MethodAccess read, Method method, Expr qualifier
where
  read.getMethod() = method and
  method.getName().matches("read%") and
  method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and
  qualifier = read.getQualifier()
select read, qualifier
```
Recall that predicates allow you to encapsulate logical conditions in a reusable format. Convert your previous query to a predicate which identifies the set of expressions in the program which are deserialized directly by ObjectInput.readXXX methods. You can use the following template:
```
predicate isDeserialized(Expr arg) {
  exists(MethodAccess read, Method method |
    // TODO fill me in
  )
}
```
exists is a mechanism for introducing temporary variables with a restricted scope. You can think of them as their own from-where-select. In this case, we use exists to introduce the variable read with type MethodAccess, and the variable method with type Method.
Hint
- You can translate from the previous query clause to a predicate by:
  - Converting some variable declarations in the from part to the variable declarations of an exists
  - Placing the where clause conditions (if any) in the body of the exists
  - Adding a condition which equates the select to one of the parameters of the predicate.
Solution
```
import java

predicate isDeserialized(Expr qualifier) {
  exists(MethodAccess read, Method method |
    read.getMethod() = method and
    method.getName().matches("read%") and
    method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and
    qualifier = read.getQualifier()
  )
}

from Expr arg
where isDeserialized(arg)
select arg
```

Section 2: Find the implementations of the decodeBody method from DubboCodec

Classes that implement the interface org.apache.dubbo.remoting.Codec2 process user input in their decodeBody methods. In this section we will find these methods and their parameters, which are sources of untrusted user input.

Like predicates, classes in CodeQL can be used to encapsulate reusable portions of logic. Classes represent sets of values, and they can also include operations (known as member predicates) specific to that set of values. You have already seen numerous instances of CodeQL classes (MethodAccess, Method etc.) and associated member predicates (MethodAccess.getMethod(), Method.getName(), etc.).

Create a CodeQL class called DubboCodec to find the interface org.apache.dubbo.remoting.Codec2. You can use this template:

class DubboCodec extends RefType {
  // Characteristic predicate
  DubboCodec() {
      // TODO Fill me in
  }
}

Hint

Use RefType.hasQualifiedName("package", "class") to identify classes with the given package name and class name.
Within the characteristic predicate, use the special variable this to refer to the RefType we are describing.

Solution

import java

/** The interface `org.apache.dubbo.remoting.Codec2`. */
class DubboCodec extends RefType {
  DubboCodec() {
    this.hasQualifiedName("org.apache.dubbo.remoting", "Codec2")
  }
}

Create a CodeQL class called DubboCodecDecodeBody for identfying Methods called decodeBody on classes whose direct super-types include DubboCodec.
Hint
- Use Method.getName() to identify the name of the method.
- To identify whether the method is declared on a class whose direct super-type includes DubboCodec, you will need to:
  - Identify the declaring type of the method using Method.getDeclaringType().
  - Identify the super-types of that type using RefType.getASuperType()
  - Use instanceof to assert that one of the super-types is a DubboCodec
Solution
```
/** A `decodeBody` method on a subtype of `org.apache.dubbo.remoting.Codec2`. */
class DubboCodecDecodeBody extends Method {
  DubboCodecDecodeBody() {
    this.getDeclaringType().getASupertype*() instanceof DubboCodec and
    this.hasName("decodeBody")
  }
}
```
decodeBody methods should consider the second and third parameters as untrusted user input. Add a member predicate to your DubboCodecDecodeBody class that finds these parameters of decodeBody methods.
Hint
- Create a predicate Parameter getAnUntrustedParameter() { ... } within the class. This has result type Parameter.
- Within the predicate, use the special variable result to refer to the values to be "returned" or identified by the predicate.
- Within the predicate, use the special variable this to refer to the DubboCodecDecodeBody method.
- Use Method.getParameter(int index) to get the i-th index parameter. Indices are 0-based, so we want index 1 and index 2 here.
- Use Quick Evaluation to run your predicate.
Solution
```
class DubboCodecDecodeBody extends Method {
  DubboCodecDecodeBody() {
    this.getDeclaringType().getASupertype*() instanceof DubboCodec and
    this.hasName("decodeBody")
  }

  Parameter getAnUntrustedParameter() { result = this.getParameter([1, 2]) }
}
```

Section 3: Unsafe deserialization

We have now identified (a) places in the program which receive untrusted data and (b) places in the program which potentially perform unsafe deserialization. We now want to tie these two together to ask: does the untrusted data ever flow to the potentially unsafe deserialization call?

In program analysis we call this a data flow problem. Data flow helps us answer questions like: does this expression ever hold a value that originates from a particular other place in the program?

We can visualize the data flow problem as one of finding paths through a directed graph, where the nodes of the graph are elements in program, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two nodes.

Consider this example Java method:

int func(int tainted) {
   int x = tainted;
   if (someCondition) {
     int y = x;
     callFoo(y);
   } else {
     return x;
   }
   return -1;
}

The data flow graph for this method will look something like this:

This graph represents the flow of data from the tainted parameter. The nodes of graph represent program elements that have a value, such as function parameters and expressions. The edges of this graph represent flow through these nodes.

CodeQL for Java provides data flow analysis as part of the standard library. You can import it using semmle.code.java.dataflow.DataFlow or semmle.code.java.dataflow.TaintTracking. The library models nodes using the DataFlow::Node CodeQL class. These nodes are separate and distinct from the AST (Abstract Syntax Tree, which represents the basic structure of the program) nodes, to allow for flexibility in how data flow is modeled.

There are a small number of data flow node types – expression nodes and parameter nodes are most common. We can use the asExpr() and asParameter() methods to convert a DataFlow::Node into the corresponding AST node.

In this section we will create a data flow query by populating this template:

/**
 * @name Unsafe deserialization
 * @kind problem
 * @id java/unsafe-deserialization
 */
import java
import semmle.code.java.dataflow.TaintTracking

// TODO add previous class and predicate definitions here

class DubboUnsafeDeserializationConfig extends TaintTracking::Configuration {
  DubboUnsafeDeserializationConfig() { this = "DubboUnsafeDeserializationConfig" }
  override predicate isSource(DataFlow::Node source) {
    exists(/** TODO fill me in **/ |
      source.asParameter() = /** TODO fill me in **/
    )
  }
  override predicate isSink(DataFlow::Node sink) {
    /** TODO fill me in **/
  }
  override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) {
    exists(/** TODO fill me in **/ |
      /** TODO fill me in **/
    )
  }
}

from DubboUnsafeDeserializationConfig config, DataFlow::Node source, DataFlow::Node sink
where config.hasFlow(source, sink)
select sink, "Unsafe deserialization"

Complete the isSource predicate, using the logic you wrote for Section 2.
Hint
- Remember the DubboCodecDecodeBody class and getAnUntrustedParameter predicate you defined earlier.
- Use asParameter() to convert a DataFlow::Node into a Parameter.
- Use exists to declare new variables, and = to assert that two values are the same.
Solution
```
  override predicate isSource(DataFlow::Node source) {
    exists(DubboCodecDecodeBody decodeBodyMethod |
      source.asParameter() = decodeBodyMethod.getAnUntrustedParameter()
  }
```
Complete the isSink predicate, using the logic you wrote for Section 1.
Hint
- Complete the same process as above.
- Remember the isDeserialized predicate you defined earlier.
- Use asExpr() to convert a DataFlow::Node into an Expr.
Solution
```
  override predicate isSink(DataFlow::Node sink) {
    isDeserialized(sink.asExpr())
  }
```
Teach CodeQL about extra data flow steps that it should follow. Complete the isAdditionalTaintStep predicate by modelling the Serialization.deserialize() method, which connects its first argument with the return value.
Hint
- As before, use exists to declare new variables, asExpr() to convert from DataFlow::Node to Expr, and = to assert equality.
- isAdditionalTaintStep has two arguments: the node where data starts, and the node where data ends.
Solution
```
  override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) {
    exists(MethodAccess ma |
      ma.getMethod().getName() = "deserialize" and
      ma.getMethod().getDeclaringType().getName() = "Serialization" and
      
      ma.getArgument(1) = n1.asExpr() and
      ma = n2.asExpr()
    )
  }
```

You can now run the completed query. You should find exactly eleven results, which will include the original CVE-2020-11995 but also new variants that were reported by our security researchers!

For some results, it is easy to verify that it is correct, because both the source and sink are may be in the same method. However, for many data flow problems this is not the case.

We can update the query so that it not only reports the sink, but it also reports the source and the path to that source. We can do this by making these changes: The answer to this is to convert the query to a path problem query. There are five parts we will need to change:

Convert the @kind from problem to path-problem. This tells the CodeQL toolchain to interpret the results of this query as path results.
Add a new import DataFlow::PathGraph, which will report the path data alongside the query results.
Change source and sink variables from DataFlow::Node to DataFlow::PathNode, to ensure that the nodes retain path information.
Use hasFlowPath instead of hasFlow.
Change the select clause to report the source and sink as the second and third columns. The toolchain combines this data with the path information from PathGraph to build the paths.

Convert your previous query to a path-problem query. Run the query to see the paths in the results view.

Solution

/**
* @name Unsafe deserialization
* @kind path-problem
* @id java/unsafe-deserialization
*/
import java
import semmle.code.java.dataflow.TaintTracking
import DataFlow::PathGraph

predicate isDeserialized(Expr qualifier) {
  exists(MethodAccess read, Method method |
    read.getMethod() = method and
    method.getName().matches("read%") and
    method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and
    qualifier = read.getQualifier()
  )
}

/** The interface `org.apache.dubbo.remoting.Codec2`. */
class DubboCodec extends RefType {
  DubboCodec() {
    this.hasQualifiedName("org.apache.dubbo.remoting", "Codec2")
  }
}

/** A `decodeBody` method on a subtype of `org.apache.dubbo.rpc.protocol.dubbo.DubboCodec`. */
class DubboCodecDecodeBody extends Method {
  DubboCodecDecodeBody() {
    this.getDeclaringType().getASupertype*() instanceof DubboCodec and
    this.hasName("decodeBody")
  }

  Parameter getAnUntrustedParameter() {
    result = this.getParameter([1, 2])
  }
}

class DubboUnsafeDeserializationConfig extends TaintTracking::Configuration {
  DubboUnsafeDeserializationConfig() { this = "DubboUnsafeDeserializationConfig" }
  override predicate isSource(DataFlow::Node source) {
    exists(DubboCodecDecodeBody decodeBodyMethod |
      source.asParameter() = decodeBodyMethod.getAnUntrustedParameter()
    )
  }
  override predicate isSink(DataFlow::Node sink) {
    isDeserialized(sink.asExpr())
  }
  override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) {
    exists(MethodAccess ma |
      ma.getMethod().getName() = "deserialize" and
      ma.getMethod().getDeclaringType().getName() = "Serialization" and
      
      ma.getArgument(1) = n1.asExpr() and
      ma = n2.asExpr()
    )
  }
}

from DubboUnsafeDeserializationConfig config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink, source, sink, "Unsafe deserialization"

For more information on how the vulnerability was identified, read the blog post on the original problem.

What's next?

CodeQL overview
CodeQL for Java
Analyzing data flow in Java
Using the CodeQL extension for VS Code
Try out the Capture-the-Flag challenges on the GitHub Security Lab website!
Read about more vulnerabilities found using CodeQL on the GitHub Security Lab research blog.
Explore the open-source CodeQL queries and libraries, and learn how to contribute a new query.
Configure CodeQL code scanning in your open-source repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workshop-2021

workshop-2021

README.md

CodeQL workshop for Java: Unsafe deserialization in Apache Dubbo

@adityasharad and @pwntester, moderated by @aeisenberg @jkcso @jf205 @xcorail

Prerequisites and setup instructions

Overview

Problem statement

Documentation links

Useful commands

Workshop

Section 1: Finding ObjectInput deserialization

Section 2: Find the implementations of the decodeBody method from DubboCodec

Section 3: Unsafe deserialization

What's next?

Files

workshop-2021

Directory actions

More options

Directory actions

More options

Latest commit

History

workshop-2021

Folders and files

parent directory

README.md

CodeQL workshop for Java: Unsafe deserialization in Apache Dubbo

@adityasharad and @pwntester, moderated by @aeisenberg @jkcso @jf205 @xcorail

Prerequisites and setup instructions

Overview

Problem statement

Documentation links

Useful commands

Workshop

Section 1: Finding ObjectInput deserialization

Section 2: Find the implementations of the decodeBody method from DubboCodec

Section 3: Unsafe deserialization

What's next?