The following fork of phpjoern is intended to be a fix to the import process of the AST (nodex.csv and rels.csv) in recent versions of Neo4j (major version 4).
The tool has not been edited from the original version of Malte Skoruppa but it contains a new file, the Neo4j4Exporter
that is able to format the csv files in such a way that Neo4j 4.x can read and import them correctly.
You should follow the original installation instructions, run the parser ./php2ast -f neo4j4 <phpsource>
and then import into Neo4j using the neo4j-admin import tool
.
You can also import the CPG edges if you are using the discontinued version of joern at this link https://github.com/octopus-platform/joern
Please note that joern is able to read only the original format of nodes.csv and rels.csv files in ouput from phpjoern.
Basically you will have to:
- generate nodes and rels in the original format and feed joern. Output => cpg_edges.csv.
- generate nodes and rels with the new
neo4j4
format./php2ast -f neo4j4 <phpsource>
. Output => nodes.csv, rels.csv - import in neo4j the previous output files
$ ./neo4j-admin import --nodes="nodes.csv" --relationships="rels.csv" --relationships="../import/cpg_edges.csv" --processors=<PROCESSORS> --high-io=true --max-memory=1G --delimiter="," --array-delimiter="TAB" --id-type="INTEGER"
Please note that this project is no longer being maintained. It is only kept here for historical purposes.
This is the phpjoern utility for Joern. It uses the php-ast
extension
to generate ASTs from PHP projects and exports these to CSV files
suitable to be parsed by Joern.
More information on Joern and PHP may be found in our paper Efficient and Flexible Discovery of PHP Application Vulnerabilities published at EuroS&P 2017.
First off, you need a working installation of PHP 7. Next, you need to
set up the php-ast
extension, available at:
https://github.com/nikic/php-ast
Essentially, clone the repository, then compile and install the extension as follows:
git clone https://github.com/nikic/php-ast
cd php-ast
git checkout 701e853
phpize
./configure
make
sudo make install
Lastly, add the line extension=ast.so
to your php.ini
file.
The parser is implemented in PHP and makes use of the php-ast
extension.
A simple Bash wrapper script in the repository's root directory called
php2ast
serves as an entry point. It takes the path to a PHP file or to
a directory as an argument. If the provided argument is a directory, the
parser will recursively search for all PHP files in that directory and
generate an AST for each of them.
Before executing the script, the environment variable $PHP7
should be
set to the location of the php
executable of PHP 7. If no such variable
is set, the location /usr/bin/php
will be used by default.
Example usage:
./php2ast somefile.php
./php2ast somedirectory/
Either of these calls will generate two CSV files nodes.csv
and rels.csv
representing the nodes of the generated AST(s) and their relationships,
respectively. In addition, directory and file nodes are also created and
connected to the individual AST root nodes to reflect a scanned directory's
structure and obtain a single large tree.
By default, the specific format of the CSV files is the format required by
the batch-import
tool for Neo4J (see below), available at:
https://github.com/jexp/batch-import
Other output formats are supported, such as Neo4J's own CSV format and GraphML. See
./php2ast --help
for help. However, note that Joern currently only supports the default format as an input format. In addition, Joern outputs code property graph edges only in this same format, although additional output modules should be easy to implement.
The CSV files generated in the previous step can now be passed to Joern. Joern will read these files, analyze the ASTs, generate control flow and program dependence edges for them, and output the calculated edges in another CSV file. First off, obtain Joern here:
https://github.com/octopus-platform/joern
Essentially, clone the repository and build the project:
git clone https://github.com/octopus-platform/joern
gradle build
In Joern's root directory, there is a small Bash wrapper script that serves
as an entry point for generating code property graphs for PHP, called
phpast2cpg
. It takes two arguments: The node files and the edges file
generated in the previous step, in that order. Use it as follows:
./phpast2cpg nodes.csv rels.csv
Joern will then output a file cpg_edges.csv
, representing the calculated
control flow and program dependence edges.
You should now have three CSV files, named nodes.csv
, rels.csv
and
cpg_edges.csv
by default. These files can be used to create a Neo4J
database using the tool batch-import.
It is easiest to download a precompiled batch-import
for the particular
Neo4J version you intend to use. For instance, for Neo4J 2.1:
mkdir batch-import
cd batch-import
curl -O https://dl.dropboxusercontent.com/u/14493611/batch_importer_21.zip
unzip batch_importer_21.zip
In the following, let let $JEXP_HOME
be the absolute path to the newly
created directory batch-import/
, and $PHPJOERN_HOME
the absolute path
to your installation of the present repository.
To import the generated CSV files into a Joern Neo4J database, simply use the following:
java -classpath "$JEXP_HOME/lib/*" -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer $PHPJOERN_HOME/conf/batch.properties graph.db nodes.csv rels.csv,cpg_edges.csv
The performance you experience will mainly depend on the heap size that you
allocate. You should edit the file $PHPJOERN_HOME/conf/batch.properties
accordingly, see here.
The batch.properties
file that comes with phpjoern
is optimized for heap
sizes larger than 4 GB that you should allocate accordingly, e.g.,
HEAP=6G
java -classpath "$JEXP_HOME/lib/*" -Xmx$HEAP -Xms$HEAP -Dfile.encoding=UTF-8 org.neo4j.batchimport.Importer conf/batch.properties graph.db nodes.csv rels.csv
Once the import is finished, you will have a directory graph.db
suitable for Neo4J.
You may now point your Neo4J installation to that database and start your analysis.
For further discussion, refer to http://joern.readthedocs.io.