Skip to content

Commit

Permalink
text review wip
Browse files Browse the repository at this point in the history
  • Loading branch information
gabrimaine committed Dec 19, 2024
1 parent 681e98e commit 84e9e9f
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 36 deletions.
80 changes: 44 additions & 36 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,64 +7,67 @@ The purpose of this note is to compare the final catalogs produced by the proces
## Introduction
In this note, we report the results of the comparison between DP0.2 catalogs produced at FrDF and the reference catalog produced at IDF.

In the context of the Data Preview 0.2 (DP0.2), the Data Release Production pipelines have been executed on the DC-2 simulated dataset (generated by the Dark Energy Science Collaboration, DESC).
This dataset includes 20 000 simulated exposures, representing 300 square degrees of Rubin images with a typical depth of 5 years.
DP0.2 ran at the Interim Data Facility, and the full exercise was independently replicated at FrDF (CC-IN2P3) and described in {cite:t}`LeBoulch.2024`.
In the context of the Data Preview 0.2 (DP0.2), the Data Release Production pipelines have been executed on the DC-2 simulated dataset (generated by {cite:t}`2021ApJS..253...31L`).
This dataset includes 20,000 simulated exposures, representing 300 square degrees of Rubin Observatory images with a typical depth equivalent to five years of observations.
DP0.2 was run at the Interim Data Facility, and the full exercise was independently replicated at FrDF (CC-IN2P3), as described in {cite:t}`LeBoulch.2024`.

In this note we will start describing the catalogs and how we retrieved the data.
Then we report the analysis performed on each table, checking two main objectives: how the sources' positions in the sky match, and how the sources' fluxes compare (when applicable).
In this note, we will begin by describing the catalogs and explaining how we selected the data.
Then, we report the analysis performed on each table, focusing on two main objectives: assessing how the sources' positions in the sky match, and comparing the sources' fluxes (when applicable).

The data and notebooks used are available in the [CC-IN2P3 GitLab](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis).
The data and notebooks used are available on [CC-IN2P3 GitLab](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis).

## The catalogs in Qserv

The catalogs have been ingested in Qserv production instance at FrDF:
The catalogs have been ingested in Qserv {cite}`Wang:2011:QDS:2063348.2063364` production instance at FrDF:
1. dp02_dc2_catalogs_frdf catalog produced at CC-IN2P3 (hereafter FrDF catalog)
2. dp02_dc2_catalogs catalog produced at IDF (hereafter IDF catalog)

For the FrDF catalog, two tables are missing (TruthSummary and MatchesTruth) because that tables require post processing before to be ingested in Qserv.
For the FrDF catalog, two tables are missing (`TruthSummary` and `MatchesTruth`) because that tables require post processing before to be ingested in Qserv.

In the following image you can see the number of line per table in FrDF and IDF catalogs. CcdVisit and Visit, produced by pipeline Step 7, have been produced twice at FrDF. For our scope, the FrDF data have been filtered to remove the rows in double. This problem needs to be adressed: **we have to be able to flag the non valid tables to detect them before the ingestion process**.

In the following image, you can see the number of lines per table in the FrDF and IDF catalogs. `CcdVisit` and `Visit`, produced by pipeline Step 7, have been produced twice at FrDF. For our purposes, the FrDF data have been filtered to remove the duplicate rows. This problem needs to be addressed: **we have to be able to flag the invalid tables to detect them before the ingestion process**.

```{figure} ./images/table_qserv.png
Number of lines per table in DPO.2 FrDF and IDF catalogs.
```
It is not possible compare the full catalogs, so for the analysis reported here we used a subsample of both the catalogs selected using a spatial query as this:

```{rst-class} technote-wide-content
```
It is not possible to compare the full catalogs, so for the analysis reported here, we used subsamples of both catalogs, selected using a spatial query such as this:
```sql

SELECT <column1>, <column2>, ...,<columnN> from <table> where scisql_s2PtInCircle(<ra>, <decl>, 60.0, -30.0, 0.5) = 1 limit 5000000
```

We limited the number of retrieved lines to 5M but for tables with a large number of lines (sources) we reduced also the query radius: a radius of 0.5 degree is to large and the number of sources in the area defined by the circle exceed largley the limits we imposed on the number of the lines. For this reason the catalogs retrieved are not comparable because the sources in the tables are not covering the same region as shown in the following image for ForcedSource table retrieved in a radius of 0.5deg, where in red you see the objects extracted from FrDF and in blue the objects extracted from the IDF.
We limited the number of retrieved lines to 5 million, but for tables with a large number of lines (sources), we also reduced the query radius: a radius of 0.5 degrees is too large, and the number of sources in the area defined by the circle largely exceeds the limits we imposed on the number of lines.
For this reason, the catalogs retrieved are not comparable because the sources in the tables do not cover the same region, as shown in the following image for the `ForcedSource` table retrieved with a radius of 0.5 degrees. In the image, the objects extracted from FrDF are shown in black, and the objects extracted from the IDF are shown in red.



```{figure} ./images/forced_source.png
Example of source extraction not covering the same region.
```
To reduce the data size, we also retrieved a subsample of columns (ra, dec, and fluxes). We retrieved all columns only for a few small tables.

To reduce the table size we also retrieved a subsamble of columns (ra, dec and fluxes). Only for few small table we retrieved all columns.
The fluxes has been converted to AB magnitude using UDF SQL function `scisql_nanojanskyToAbMag` [^1] integrated in Qserv.

The fluxes has been converted to magnitude AB using UDF SQL function `scisql_nanojanskyToAbMag` integrated in Qserv.
All the queries used for each table are reported in [query notebook](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis/-/blob/main/notebooks/query.ipynb?ref_type=heads).

All the queries used for each table are reported in query notebook.
The analysis has been performed offline: all the tables have been retrieved once and stored locally as FITS files (available in the ['fits' directory](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis/-/tree/main/fits?ref_type=heads) in the Gitlab repository).

The analysis has been performed offline: all the tables have been retrieved once and stored locally as fits files (available in fits directory in this repository). For each table a file called `<df>_<table>.fits` has been generated and for each table a new column (DF) has been added allowing to identify easily the data origin during the analysis.
For each table, a file called `<df>_<table>.fits` has been generated, and a new column (`DF`) has been added to each table, allowing easy identification of the data origin during analysis.

Topcat has been used to quick validate the retrieved datasets and to filter out the line in double for Visit and CcdVisit tables.
Topcat [^2] has been used to quickly validate the retrieved datasets and to filter out duplicate lines in the `Visit` and `CcdVisit` tables.

## Comparison

For the analysis there is an interactive notebook allowing table selection and type of plot to generate. For each table we also created a notebook (available in notebook directory) and we generated an interactive html with all the plot (coordinates and magnitudes) and available in the html directory.

We performed a match between the catalogs to make a correct correspondances between the rows and to avoid odd results (i.e comparison between sources in different region of the sky), for this we use astropy module when the number of the row in the retrieved table was the same (in this case it is as we reordered the tables to have match between the row). When the number of row in tables was not the same, we used Topcat stilt functions (as implemented in `pystitls`. We used stilts because it allows the "symmetric match" i.e. it allows only one match per source: with astropy this is not possible and you can have multi-match than could lead to wrong results.
For the analysis, there is an [interactive notebook](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis/-/blob/main/notebooks/interactive_notebook.ipynb?ref_type=heads) that allows table selection and the type of plot to generate.
For each table, we also created a notebook (available in the [notebook directory](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis/-/tree/main/notebooks?ref_type=heads)), and we generated an interactive HTML file with all the plots (coordinates and magnitudes), which is available in the [HTML directory](https://gitlab.in2p3.fr/gabriele.mainetti/dp02_analysis/-/tree/main/html?ref_type=heads).

But also with these conditions, in the case of the ForcedSource table the match can be very difficult if not impossible.
We performed a match between the catalogs to make correct correspondences between the rows and to avoid odd results (i.e., comparison between sources in different regions of the sky).
For this, we used the Astropy [^3] module when the number of rows in the retrieved tables was the same (in this case, we reordered the tables to ensure matching rows). When the number of rows in the tables was not the same, we used Topcat STILTS [^4] functions (as implemented in `pystilts`). We used STILTS because it allows the "symmetric match," i.e., it allows only one match per source; with Astropy, this is not possible, and you can get multiple matches that could lead to wrong results.

To better understand this point see next figures.
However, even with these conditions, matching in the case of the `ForcedSource` table can be very difficult, if not impossible.

The following table and figure shows the data retrieved for three Qserv tables:
To better understand this point, see the following table and figures showing the data retrieved from three Qserv tables.

```{figure} ./images/table_numb_sources.png
Number of sources extracted from Object,Source and ForcedSource tables.
Expand All @@ -82,10 +85,10 @@ Comparison of the radii used for source extraction in the Object (green), Source



In `ForcedSource` we have 4 times more entries than Object table in a region 100 times smaller.

Taking a look to the density of source, we see how it could be complicated for a matching algorithm to find the good match. The next figures show the number of sources per "pixel" in the different tables. The pixel size is 7.2x7.2 arcsec.
In `ForcedSource` we have 4 times more entries than `Object` table in a region 100 times smaller.

Taking a look to the density of source, we see how it could be complicated for a matching algorithm to find the good match.
The following figures show the number of sources per "pixel" for the different tables. The pixel size used is 7.2 x 7.2 arcseconds.

```{figure} ./images/object_pixel.png
Expand All @@ -102,18 +105,18 @@ Source density in the Source table (in a 7.2 × 7.2 arcsecond 'pixel').
Source density in the ForcedSource table (in a 7.2 × 7.2 arcsecond 'pixel').
```
You can see the number of sources per pixel in the scale on the right. **ForcedSource** has an incredible number of sources per pixel—in some cases, more than three thousand. It's clear that a matching algorithm using a separation of 1 arcsecond as a parameter to match two points cannot be 100% reliable in this case.

In our case, if we exclude the **ForcedSource** table, the matching algorithm worked as we expected.
You can see the number of sources per pixel on the scale to the right. **ForcedSource** has an incredibly high number of sources per pixel, in some cases, more than three thousand.
It's clear that a matching algorithm using a separation of 1 arcsecond as a parameter to match two points cannot be 100% reliable under these circumstances.

If we exclude the `ForcedSource` table, the matching algorithm worked as expected for the other tables.

### Sources positions analysis

To analyze how well the positions of the objects in the sky match, we compared RA and Dec, and we also analyzed the sky separation (i.e., the great-circle distance) estimated using Astropy.


```python

c1 = SkyCoord(df[ra_1]*u.deg, df[decl_1]*u.deg, frame='icrs')
c2 = SkyCoord(df[ra_2]*u.deg, df[decl_2]*u.deg, frame='icrs')
sep=c1.separation(c2).degree
Expand All @@ -127,21 +130,20 @@ An exemple of the distribution of the sky separation is visible in the next figu

### Fluxes (magnitudes) analysis

For the fluxes comparison we converted nJy to magnitude AB.

For each table we select fluxes columns and we convert them to AB magnitude using te UDF function scisql_nanojanskyToAbMagintegrated in Qserv.
For the flux comparison, we converted nJy to AB magnitudes. For each table, we selected the flux columns and converted them to AB magnitudes using the UDF `scisql_nanojanskyToAbMag` available in Qserv.

Then, for each magnitude, we plot the histogram and the box plot of the distribution for each catalog, as shown in the next figure.

Then, for each magnitude we plot the histogram and the box plot of the distribution for each catalogs, as shown in the next figure.

```{figure} ./images/diasource_tot.png
```

We plot also the histogram and the box plot of the distribution of the differences of magnitudes, i.e. the difference calculated per matching source.
We also plot the histogram and box plot of the distribution of the magnitude differences, i.e., the differences calculated per matching source.

```{figure} ./images/diasource_tot_diff.png
```

For reference, the following image explain the mean of a box plot.
For reference, the following image explains the meaning of a box plot.


```{figure} ./images/boxplot.png
Expand Down Expand Up @@ -391,3 +393,9 @@ For these tables, there are no fluxes to analyse and, as expected, there is a pe

```{bibliography}
```


[^1]: https://smonkewitz.github.io/scisql/
[^2]: https://www.star.bris.ac.uk/~mbt/topcat/
[^3]: https://www.astropy.org/
[^4]: https://www.star.bris.ac.uk/~mbt/stilts/
4 changes: 4 additions & 0 deletions local.bib
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,7 @@ @article{LeBoulch.2024
pages = "04049",
year = "2024"
}
@misc{scisql,
title = "{sciSQL 0.3}: Science Tools for MySQL",
howpublished = {\url{https://smonkewitz.github.io/scisql/}}
}

0 comments on commit 84e9e9f

Please sign in to comment.