If you are attending this workshop at GitHub Universe, or watching the recording, the facilitators will guide you through the steps below. You can use this document as a written reference.
Please complete this section before the workshop, if possible.
- Install Visual Studio Code.
- Install the CodeQL extension for Visual Studio Code.
- You do not need to install the CodeQL CLI: the extension will handle this for you.
- Clone this repository:
git clone --recursive https://github.com/githubuniverseworkshops/codeql
- Please don't forget
--recursive
: This allows you to obtain the standard CodeQL query libraries, which are included as a Git submodule of this repository. - What if I forgot to add
--recursive
? If you've already cloned the repository, please set up the submodule by running:git submodule update --init
- Please don't forget
- Open the repository in Visual Studio Code: File > Open (or Open Folder) > Browse to the checkout of
githubuniverseworkshops/codeql
. - Import the CodeQL database to be used in the workshop:
- Click the CodeQL rectangular icon in the left sidebar.
- Place your mouse over Databases, and click the icon labelled
Download Database
. - Copy and paste this URL into the box, then press OK/Enter: https://github.com/githubuniverseworkshops/codeql/releases/download/universe-2021/codeql-java-workshop-apache-dubbo.zip
- Click on the database name, and click Set Current Database.
- Create a new file in the
workshop-2021
directory calledUnsafeDeserialization.ql
.
Serialization is the process of converting in memory objects to text or binary output formats, usually for the purpose of sharing or saving program state. This serialized data can then be loaded back into memory at a future point through the process of deserialization.
In languages such as Java, Python and Ruby, deserialization provides the ability to restore not only primitive data, but also complex types such as library and user defined classes. This provides great power and flexibility, but introduces a signficant attack vector if the deserialization happens on untrusted user data without restriction.
Apache Dubbo is a popular open-source RPC framework in Java. In 2021, a researcher from the GitHub Security Lab found multiple vulnerabilities leading to remote code execution (RCE) through different deserialization formats.
In this workshop, we will write a query to find variants for CVE-2020-11995 in a database built from the known vulnerable version of Apache Dubbo.
The problem occurred because user-controlled data received by the different network libraries used by Apache Dubbo were deserialized using insecure deserialization formats.
If you get stuck, try searching our documentation and blog posts for help and ideas. Below are a few links to help you get started:
- CodeQL overview
- CodeQL for Java
- Analyzing data flow in Java
- Using the CodeQL extension for VS Code
- GitHub Security Lab research
- CodeQL on GitHub Learning Lab
- For more advanced CodeQL development in future, you may wish to set up the CodeQL starter workspace for all languages.
- Run a query using the following commands from the Command Palette (
Cmd/Ctrl + Shift + P
) or right-click menu:CodeQL: Run Query
(run the entire query)CodeQL: Quick Evaluation
(run only the selected predicate or snippet)
- Click the links in the query results to navigate to the source code.
- Explore the CodeQL libraries in your IDE using:
- autocomplete suggestions (
Cmd/Ctrl + Space
) - jump-to-definition (
F12
, orCmd/Ctrl + F12
in a Codespace) - documentation hovers (place your cursor over an element)
- the AST viewer on an open source file (
View AST
from the CodeQL sidebar or Command Palette)
- autocomplete suggestions (
The workshop is split into several steps. You can write one query per step, or work with a single query that you refine at each step. Each step has a hint that describes useful classes and predicates in the CodeQL standard libraries for Java.
Apache Dubbo uses an abstraction layer to wrap multiple deserialization formats. Most of the supported serialization libraries might lead to arbitrary code execution upon deserialization of untrusted data. The SPI interface used for deserialization is called ObjectInput. It provides multiple readXXX
methods for deserializing data to a Java object. By default, the input is not validated in any way, and is vulnerable to remote code execution exploits.
In this section, we will identify calls to ObjectInput.readXXX
methods in the codebase. The qualifiers of these calls are the values being deserialized, and hence are sinks for deserialization vulnerabilities.
-
Find all method calls in the program.
Hint
- A method call is represented by the
MethodAccess
type in the CodeQL Java library.
Solution
import java from MethodAccess call select call
- A method call is represented by the
-
Update your query to report the method being called by each method call.
Hints
- Add a CodeQL variable called
method
with typeMethod
. - Add a
where
clause. MethodAccess
has a predicate calledgetMethod()
for returning the method.- Use the equality operator
=
to assert that two CodeQL expressions are the same.
Solution
import java from MethodAccess call, Method method where call.getMethod() = method select call, method
- Add a CodeQL variable called
-
Find all calls in the program to methods starting with
read
.Hint
Method.getName()
returns a string representing the name of the method.string.matches("foo%")
can be used to check if a string starts withfoo
.- Use the
and
keyword to add multiple conditions to thewhere
clause.
Solution
import java from MethodAccess read, Method method where read.getMethod() = method and method.getName().matches("read%") select read
-
Refine your query to only match calls to
read
methods on classes implementing theorg.apache.dubbo.common.serialize.ObjectInput
interface.Hint
Method.getDeclaringType()
returns theRefType
this method is declared on. AClass
is one kind ofRefType
.RefType.getASourceSupertype()
returns the immediate parent/supertypes for a given type, as defined in the Java source. (Hover to see the documentation.)- Use the "reflexive transitive closure" operator
*
on a call to a predicate with 2 arguments, e.g.getASourceSupertype*()
, to apply the predicate 0 or more times in succession. RefType.hasQualifiedName("package", "class")
holds if the givenRefType
has the fully-qualified namepackage.class
. For example, the querywill find the typefrom RefType r where r.hasQualifiedName("java.lang", "String") select r
java.lang.String
.
Solution
import java from MethodAccess read, Method method where read.getMethod() = method and method.getName().matches("read%") and method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") select read
-
The
ObjectInput.readXXX
methods deserialize the qualifier argument (i.e. thethis
argument, or the object before the.
). Update your query to report the deserialized argument.Hint
MethodAccess.getQualifier()
returns the qualifier of the method call.- The qualifier is an expression in the program, represented by the CodeQL class
Expr
. - Introduce a new variable in the
from
clause to hold this expression, and output the variable in theselect
clause.
Solution
import java from MethodAccess read, Method method, Expr qualifier where read.getMethod() = method and method.getName().matches("read%") and method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and qualifier = read.getQualifier() select read, qualifier
-
Recall that predicates allow you to encapsulate logical conditions in a reusable format. Convert your previous query to a predicate which identifies the set of expressions in the program which are deserialized directly by
ObjectInput.readXXX
methods. You can use the following template:predicate isDeserialized(Expr arg) { exists(MethodAccess read, Method method | // TODO fill me in ) }
exists
is a mechanism for introducing temporary variables with a restricted scope. You can think of them as their ownfrom
-where
-select
. In this case, we useexists
to introduce the variableread
with typeMethodAccess
, and the variablemethod
with typeMethod
.Hint
- You can translate from the previous query clause to a predicate by:
- Converting some variable declarations in the
from
part to the variable declarations of anexists
- Placing the
where
clause conditions (if any) in the body of the exists - Adding a condition which equates the
select
to one of the parameters of the predicate.
- Converting some variable declarations in the
Solution
import java predicate isDeserialized(Expr qualifier) { exists(MethodAccess read, Method method | read.getMethod() = method and method.getName().matches("read%") and method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and qualifier = read.getQualifier() ) } from Expr arg where isDeserialized(arg) select arg
- You can translate from the previous query clause to a predicate by:
Classes that implement the interface org.apache.dubbo.remoting.Codec2
process user input in their decodeBody
methods. In this section we will find these methods and their parameters, which are sources of untrusted user input.
Like predicates, classes in CodeQL can be used to encapsulate reusable portions of logic. Classes represent sets of values, and they can also include operations (known as member predicates) specific to that set of values. You have already seen numerous instances of CodeQL classes (MethodAccess
, Method
etc.) and associated member predicates (MethodAccess.getMethod()
, Method.getName()
, etc.).
-
Create a CodeQL class called
DubboCodec
to find the interfaceorg.apache.dubbo.remoting.Codec2
. You can use this template:class DubboCodec extends RefType { // Characteristic predicate DubboCodec() { // TODO Fill me in } }
Hint
- Use
RefType.hasQualifiedName("package", "class")
to identify classes with the given package name and class name. - Within the characteristic predicate, use the special variable
this
to refer to theRefType
we are describing.
Solution
import java /** The interface `org.apache.dubbo.remoting.Codec2`. */ class DubboCodec extends RefType { DubboCodec() { this.hasQualifiedName("org.apache.dubbo.remoting", "Codec2") } }
- Use
-
Create a CodeQL class called
DubboCodecDecodeBody
for identfyingMethod
s calleddecodeBody
on classes whose direct super-types includeDubboCodec
.Hint
- Use
Method.getName()
to identify the name of the method. - To identify whether the method is declared on a class whose direct super-type includes
DubboCodec
, you will need to:- Identify the declaring type of the method using
Method.getDeclaringType()
. - Identify the super-types of that type using
RefType.getASuperType()
- Use
instanceof
to assert that one of the super-types is aDubboCodec
- Identify the declaring type of the method using
Solution
/** A `decodeBody` method on a subtype of `org.apache.dubbo.remoting.Codec2`. */ class DubboCodecDecodeBody extends Method { DubboCodecDecodeBody() { this.getDeclaringType().getASupertype*() instanceof DubboCodec and this.hasName("decodeBody") } }
- Use
-
decodeBody
methods should consider the second and third parameters as untrusted user input. Add a member predicate to yourDubboCodecDecodeBody
class that finds these parameters ofdecodeBody
methods.Hint
- Create a predicate
Parameter getAnUntrustedParameter() { ... }
within the class. This has result typeParameter
. - Within the predicate, use the special variable
result
to refer to the values to be "returned" or identified by the predicate. - Within the predicate, use the special variable
this
to refer to theDubboCodecDecodeBody
method. - Use
Method.getParameter(int index)
to get thei
-th index parameter. Indices are 0-based, so we want index 1 and index 2 here. - Use Quick Evaluation to run your predicate.
Solution
class DubboCodecDecodeBody extends Method { DubboCodecDecodeBody() { this.getDeclaringType().getASupertype*() instanceof DubboCodec and this.hasName("decodeBody") } Parameter getAnUntrustedParameter() { result = this.getParameter([1, 2]) } }
- Create a predicate
We have now identified (a) places in the program which receive untrusted data and (b) places in the program which potentially perform unsafe deserialization. We now want to tie these two together to ask: does the untrusted data ever flow to the potentially unsafe deserialization call?
In program analysis we call this a data flow problem. Data flow helps us answer questions like: does this expression ever hold a value that originates from a particular other place in the program?
We can visualize the data flow problem as one of finding paths through a directed graph, where the nodes of the graph are elements in program, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two nodes.
Consider this example Java method:
int func(int tainted) {
int x = tainted;
if (someCondition) {
int y = x;
callFoo(y);
} else {
return x;
}
return -1;
}
The data flow graph for this method will look something like this:
This graph represents the flow of data from the tainted parameter. The nodes of graph represent program elements that have a value, such as function parameters and expressions. The edges of this graph represent flow through these nodes.
CodeQL for Java provides data flow analysis as part of the standard library. You can import it using semmle.code.java.dataflow.DataFlow
or semmle.code.java.dataflow.TaintTracking
. The library models nodes using the DataFlow::Node
CodeQL class. These nodes are separate and distinct from the AST (Abstract Syntax Tree, which represents the basic structure of the program) nodes, to allow for flexibility in how data flow is modeled.
There are a small number of data flow node types – expression nodes and parameter nodes are most common. We can use the asExpr()
and asParameter()
methods to convert a DataFlow::Node
into the corresponding AST node.
In this section we will create a data flow query by populating this template:
/**
* @name Unsafe deserialization
* @kind problem
* @id java/unsafe-deserialization
*/
import java
import semmle.code.java.dataflow.TaintTracking
// TODO add previous class and predicate definitions here
class DubboUnsafeDeserializationConfig extends TaintTracking::Configuration {
DubboUnsafeDeserializationConfig() { this = "DubboUnsafeDeserializationConfig" }
override predicate isSource(DataFlow::Node source) {
exists(/** TODO fill me in **/ |
source.asParameter() = /** TODO fill me in **/
)
}
override predicate isSink(DataFlow::Node sink) {
/** TODO fill me in **/
}
override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) {
exists(/** TODO fill me in **/ |
/** TODO fill me in **/
)
}
}
from DubboUnsafeDeserializationConfig config, DataFlow::Node source, DataFlow::Node sink
where config.hasFlow(source, sink)
select sink, "Unsafe deserialization"
-
Complete the
isSource
predicate, using the logic you wrote for Section 2.Hint
- Remember the
DubboCodecDecodeBody
class andgetAnUntrustedParameter
predicate you defined earlier. - Use
asParameter()
to convert aDataFlow::Node
into aParameter
. - Use
exists
to declare new variables, and=
to assert that two values are the same.
Solution
override predicate isSource(DataFlow::Node source) { exists(DubboCodecDecodeBody decodeBodyMethod | source.asParameter() = decodeBodyMethod.getAnUntrustedParameter() }
- Remember the
-
Complete the
isSink
predicate, using the logic you wrote for Section 1.Hint
- Complete the same process as above.
- Remember the
isDeserialized
predicate you defined earlier. - Use
asExpr()
to convert aDataFlow::Node
into anExpr
.
Solution
override predicate isSink(DataFlow::Node sink) { isDeserialized(sink.asExpr()) }
-
Teach CodeQL about extra data flow steps that it should follow. Complete the
isAdditionalTaintStep
predicate by modelling theSerialization.deserialize()
method, which connects its first argument with the return value.Hint
- As before, use
exists
to declare new variables,asExpr()
to convert fromDataFlow::Node
toExpr
, and=
to assert equality. isAdditionalTaintStep
has two arguments: the node where data starts, and the node where data ends.
Solution
override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) { exists(MethodAccess ma | ma.getMethod().getName() = "deserialize" and ma.getMethod().getDeclaringType().getName() = "Serialization" and ma.getArgument(1) = n1.asExpr() and ma = n2.asExpr() ) }
- As before, use
You can now run the completed query. You should find exactly eleven results, which will include the original CVE-2020-11995 but also new variants that were reported by our security researchers!
For some results, it is easy to verify that it is correct, because both the source and sink are may be in the same method. However, for many data flow problems this is not the case.
We can update the query so that it not only reports the sink, but it also reports the source and the path to that source. We can do this by making these changes: The answer to this is to convert the query to a path problem query. There are five parts we will need to change:
- Convert the
@kind
fromproblem
topath-problem
. This tells the CodeQL toolchain to interpret the results of this query as path results. - Add a new import
DataFlow::PathGraph
, which will report the path data alongside the query results. - Change
source
andsink
variables fromDataFlow::Node
toDataFlow::PathNode
, to ensure that the nodes retain path information. - Use
hasFlowPath
instead ofhasFlow
. - Change the
select
clause to report thesource
andsink
as the second and third columns. The toolchain combines this data with the path information fromPathGraph
to build the paths.
-
Convert your previous query to a path-problem query. Run the query to see the paths in the results view.
Solution
/** * @name Unsafe deserialization * @kind path-problem * @id java/unsafe-deserialization */ import java import semmle.code.java.dataflow.TaintTracking import DataFlow::PathGraph predicate isDeserialized(Expr qualifier) { exists(MethodAccess read, Method method | read.getMethod() = method and method.getName().matches("read%") and method.getDeclaringType().getASourceSupertype*().hasQualifiedName("org.apache.dubbo.common.serialize", "ObjectInput") and qualifier = read.getQualifier() ) } /** The interface `org.apache.dubbo.remoting.Codec2`. */ class DubboCodec extends RefType { DubboCodec() { this.hasQualifiedName("org.apache.dubbo.remoting", "Codec2") } } /** A `decodeBody` method on a subtype of `org.apache.dubbo.rpc.protocol.dubbo.DubboCodec`. */ class DubboCodecDecodeBody extends Method { DubboCodecDecodeBody() { this.getDeclaringType().getASupertype*() instanceof DubboCodec and this.hasName("decodeBody") } Parameter getAnUntrustedParameter() { result = this.getParameter([1, 2]) } } class DubboUnsafeDeserializationConfig extends TaintTracking::Configuration { DubboUnsafeDeserializationConfig() { this = "DubboUnsafeDeserializationConfig" } override predicate isSource(DataFlow::Node source) { exists(DubboCodecDecodeBody decodeBodyMethod | source.asParameter() = decodeBodyMethod.getAnUntrustedParameter() ) } override predicate isSink(DataFlow::Node sink) { isDeserialized(sink.asExpr()) } override predicate isAdditionalTaintStep(DataFlow::Node n1, DataFlow::Node n2) { exists(MethodAccess ma | ma.getMethod().getName() = "deserialize" and ma.getMethod().getDeclaringType().getName() = "Serialization" and ma.getArgument(1) = n1.asExpr() and ma = n2.asExpr() ) } } from DubboUnsafeDeserializationConfig config, DataFlow::PathNode source, DataFlow::PathNode sink where config.hasFlowPath(source, sink) select sink, source, sink, "Unsafe deserialization"
For more information on how the vulnerability was identified, read the blog post on the original problem.
- CodeQL overview
- CodeQL for Java
- Analyzing data flow in Java
- Using the CodeQL extension for VS Code
- Try out the Capture-the-Flag challenges on the GitHub Security Lab website!
- Read about more vulnerabilities found using CodeQL on the GitHub Security Lab research blog.
- Explore the open-source CodeQL queries and libraries, and learn how to contribute a new query.
- Configure CodeQL code scanning in your open-source repository.