Skip to content

PDA Files reverse engineering

Guillaume Prevost edited this page Nov 18, 2013 · 5 revisions

In order to extract metadata directly from the proprietary Softmax Pro format, some analysis and reverse engineering of that format had to be done, in order to know how to read the PDA files and where to find the metadata in it.

The PDA files are serialised in hexadecimal. In order to analyse them, a typical file was opened with a powerful hexadecimal editor tool, allowing to define and map grammars to hexadecimal files ("Synalyze It!").

To help decyphering the content, the PDA file was also opened in Softmax Pro and exported in both XML and Tab Separated Values (text) formats, in order to know which values to look for. These differents versions of the file exposing the same data as the unknown PDA format were used as a "Rosetta Stone".

This wiki page presents:

  • the details of the PDA files structure, starting from the global level, down to the detail of the each of identified data structures.
  • examples of PDA files and the analysis of their structures.

PDA files structure

The PDA main structure is composed of one header and N datasets, as follows:

Header

The header is composed of:

  • an initial number coded on 2 bytes (example: '\x00 \x04')
  • the number of the SoftMax version that created the file, ended by a '0' ('\x00')
  • a string under the form ' ##BLOCK= N ' where N is the number of datasets following the header in the file
  • a sequence not analysed (yet) which ends with the characters sequence "\x48\x00\x00\x00\x48\x00\x00\x00", marking the end of the header.

Structures

Each structure starts as a number encoded on 1 byte in hexadecimal, defining the length of the structure name. For example: a template group structure is preceded by an hexadecimal "0B" (= 11 in decimal), defining the length of the structure name "CSTmplGroup" following.

Most of the structures have a title that comes right after the structure name. This title is defined as a zero-terminated string. As an example, here is the beginning of an experiment section structure as string and in hexadecimal:

" CSExperimentSectionExperiment#1 "
13  43 53 45 78 70 65 72 69 6D 65 6E 74 53 65 63 74 69 6F 6E     45 78 70 65 72 69 6D 65 6E 74 23 31 00

13 (= 19 in decimal) defines the length of the structure name "CSExperimentSection", and "Experiment#1" followed by a "00" byte is the title of the experiment section.

Sections

In the PDA format, sections are structures enclosing other structures. The root section is the Experiment sections (Datasets).

Experiment Section (Dataset)

Datasets are represented by the "CSExperimentSection" structure. They can be composed of several different sections and structures:

  • Definition of Template groups (CSTmplGroup structures)
  • Definition of template samples (CSTmplSample structures)
  • Analysis notes (CSAnalysisSection structures)
  • Wells (number N followed by N CSWell structures)
  • Plate section (CSPlateSection)
  • Group section (CSGroupSection)
  • Graph section (CSGraphSection)

These structures can potentially come in any order and be present more than once within a dataset (See examples).

Plate Section (CSPlateSection)

A plate section is composed of:

  • Plate data (CSPlateData)
  • Plate descriptor (CSPlateDescriptor)
  • Actual data, as a list of N Flex sites (CSFlexSite), where N is the number of rows and columns read
  • Description of the calculations done on the read data (CSCalcPlateBody)
  • Morph plate table (CSMorphPlateTable)

Template Group (CSTmplGroup)

Template Groups are always composed of:

  • the structure name and the template group title string (see structures)
  • 8 bytes
  • the template group descriptor unit string, ended by the '\x00' delimiter
  • 4 bytes
  • the template group descriptor title string, ended by the '\x00' delimiter
  • 21 bytes

These components are always found in this order.

In the filter, the function readTmplGroup(self, f) is used to read a template group.

Template Sample (CSTmplSample)

Template Samples are always composed of:

  • the structure name and the template sample title string (see structures)
  • 24 bytes

These components are always found in this order.

In the filter, the function readTmplSample(self, f) is used to read a template sample.

Analysis Notes (CSAnalysisSection)

Analysis Notes are always composed of:

  • the structure name and the analysis section title string (see structures)
  • 28 bytes
  • a 4-byte number (integer) defining the length of the string content that follows
  • the string content of the analysis section, of length defined by the previous number
  • sequence of bytes ended by the character '\xFF' repeated 32 times.

These components are always found in this order.

In the filter, the function readAnalysisSection(self, f) is used to read an analysis section.

Wells (CSWell)

The Wells structures contains information about the micro-wells.

The list of micro-wells is preceded by an integer coded on 4 bytes defining the number of Well structure following in the file.

Each Well structure is composed of:

  • the structure name and the analysis section title string (see structures)
  • a 2-bytes integer representing the number of the row where the micro-well is located on the grid
  • a 2-byte integer representing the number of the column where the micro-well is located on the grid
  • 3 * 2-bytes numbers (which it is unknown what thy represent)
  • a string ended by the '\x00' delimiter, representing the name of the plate on which the micro-well is located
  • 4 bytes

These components are always found in this order.

In the filter, the function readWell(self, f) is used to read a well, and readWells(self, f) to read the number of wells and each of the wells that follow.

Plate Data (CSPlateData)

Plate Date are always composed of:

  • the structure name and the analysis section title string (see structures)
  • a 2-bytes integer representing the first column read in this plate data
  • a 2-bytes integer representing the number of columns read in this plate data
  • a 4-bytes integer representing the number of reads
  • a 4-bytes integer representing the number of wavelengths used for this plate data
  • for the number of wavelengths:
    • a 4-bytes integer representing the value of the wavelength
    • 1 byte The values of the wavelengths are concatenated in a string separated by spaces (eg: '520 520').
  • an 8-bytes double representing the read duration (in seconds)
  • an 8-bytes double representing the interval duration (in seconds)
  • for the number of wavelengths:
    • a 4-bytes integer representing the value of the excitation wavelength
    • 4 bytes The values of the excitation wavelengths are concatenated in a string separated by spaces (eg: '340 380').
  • 659 bytes
  • for the number of wavelengths:
    • a 4-bytes integer representing the 'R' component of the Trans value
    • a 4-bytes integer representing the '@' component of the Trans value
    • an 8-bytes double representing the 'V' component of the Trans value (in seconds)
    • a 4-bytes integer representing the 'H' component of the Trans value The four components described above are concatenated as follows: 'TransN: H=Wµ, R=X, V=Yµ, @Z' where 'N' is the iteration number, 'W' is the H component, 'X' is the R component, 'Y' is the V component and 'Z' is the @ component. Each 'Trans' value is concatenated with a '.' separator (eg: 'Trans1: H=80µ, R=4, V=20.0µ, @15. Trans2: H=100µ, R=4, V=25.0µ, @80').
  • 75 bytes

These components are always found in this order.

In the filter, the function readPlateDate(self, f) is used to read the plate data.

Plate descriptor (CSPlateDescriptor)

Plate Descriptor are always composed of:

  • the structure name (see #structure)
  • 1 byte
  • a 4-bytes integer representing the number of plates
  • for the number of plate:
    • 4 bytes
    • a 4-bytes float representing the temperature of the plate
  • 27 bytes

In the filter, the function readPlateDescriptor(self, f) is used to read a plate descriptor.

Flex Sites - actual data (CSFlexSite)

The Flex Site structure contains the actual data for a well or a cuvette. There is one Flex Site for each well or cuvette read this number is based on the number of columns read multiplied by the number of rows.

Each Flex Site structure is composed of:

  • the structure name (see structures)
  • a 4-bytes integer representing the number of data chunk in the structure
  • a 4-bytes integer representing the number of read
  • a 4-bytes integer representing the id of the well or cuvette
  • a 4-bytes integer representing the length of each data chunk in the structure
  • for each data chunk (up to the data chunk umber read above)
    • N bytes of data (with N = the length of the data chunks read above)

These components are always found in this order.

The list of Flex Sites structure is ended by a single '\x00' byte.

In the filter, the function readFlexSite(self, f) is used to read a single flex site structure, and readFlexSites(self, f, numberOfColumns) to read the list of flex sites structures.

Plate Body Calculations (CSCalcPlateBody)

This structure represents the plate body calculations and information about the instrument.

The plate body calculations structure is composed of:

  • the structure name (see structures)
  • 23 bytes
  • the wavelength variable name string, ended by the '\x00' delimiter
  • the wavelength combination formula string, ended by the '\x00' delimiter. This formula is used on the raw data in order to reduce the amount of data points.
  • the formula string, ended by the '\x00' delimiter
  • 175 bytes
  • an unknown variable string, ended by the '\x00' delimiter. So far this always had the value "Unknown"
  • the instrument information string, ended by the '\x00' delimiter

In the filter, the function readCalcPlateBody(self, f) is used to read the plate body calculation.

Morph Plate Table (CSMorphPlateTable)

The morph plate table is composed of:

  • the structure name (see structures)
  • 77 bytes. Since this structure doesn't contain any relevant meta-data, it hasn't been analysed in details.

In the filter, the function readMorphPlateTable(self, f) is used to read the morph plate table.

Group (CSGroupSection)

This structure has not yet been analysed: it doesn't seem to contain relevant meta-data to extract.

Graph (CSGraphSection)

This structure has not yet been analysed: it doesn't seem to contain relevant meta-data to extract.

Examples of PDA files structure:

Example with 1 Dataset

See PDA file '050511V1 Pmutants rep1.pda'

The following is an example of the typical structure with 1 dataset:

  • Header
  • CSExperimentSection
    • CSTmplGroup
    • CSTmplSample
    • CSTmplGroup
    • CSTmplSample
    • CSWell * [number of wells]
    • CSAnalysisSection
    • CSPlateSection
      • CSPlateData
      • CSPlateDescriptor
      • CSFlexSite * [rows * columns read]
      • CSCalcPlateBody
      • CSMorphPlateTable

Example with 2 datasets

See PDA file 'BGD131010 3759 and 3720.pda'

  • Header
  • CSExperimentSection
    • CSTmplGroup
    • CSTmplSample
    • CSTmplGroup
    • CSTmplSample
    • CSAnalysisSection
    • CSAnalysisSection
  • CSExperimentSection
    • CSTmplGroup
    • CSTmplGroup
    • CSTmplSample
    • CSWell * [number of wells]
    • CSTmplGroup
    • CSTmplGroup
    • CSTmplGroup
    • CSTmplGroup
    • CSAnalysisSection
    • CSPlateSection
      • CSPlateData
      • CSPlateDescriptor
      • CSFlexSite * [rows * columns read]
      • CSCalcPlateBody
      • CSMorphPlateTable
    • CSGroupSection
    • CSGroupSection
    • CSGroupSection
    • CSGroupSection
    • CSGraphSection