Skip to content

Latest commit

 

History

History
2714 lines (2231 loc) · 76.5 KB

index.md

File metadata and controls

2714 lines (2231 loc) · 76.5 KB

Survival rate of prostate cancer in a Sudameric Ocncologyc Instutute

Prostate cancer is the most common malignant tumor in men and the vast majority of cases occur after the age of 65 years and rarely in those younger than 45 years.

The survival rate is the percentage of people who survive after being diagnosed with a disease within a certain period of time (5 to 10 years). In general, prostate cancer survival rates are very good when the disease is diagnosed in early stages, reaching 100%, 98% and 93% at 5, 10 and 15 years, respectively. This rate drops to 30% at 5 years if the diagnosis is made in advanced stages. Therefore, the survival rate is directly related to the clinical stage of the disease.

The data for this small review were obtained from the medical records of 1639 patients treated in the period 2000-2010 in an Oncological Institution in South America specialized in cancer treatment. Therefore, patient data are confidential and have been anonymized.

The aim of this small review is to know what is the survival rate in this oncological institute for patients diagnosed in the period indicated above.

Exploratory Data Analysis

from dateutil.relativedelta import relativedelta
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import math
import re

1.- Read data

df_ca=pd.read_excel("patient_survival_ca_prostate_00-10.xlsx", 
                    index_col=None,
                    na_values= np.nan)

To anonymize the database, the column containing the medical record number of all patients is removed. A new dataFarme is also created to assign each medical record to a number and, subsequently, to be able to trace each one.

df_final=df_ca.sort_values(["F.Diagnóstico"],ascending=True).reset_index()
df_HCL=df_final[["index","Num.HCL"]]
df_final.drop(["Num.HCL","Num_HCL"], axis=1,inplace=True)

2.- Data exploration

pd.set_option('display.max_columns', 15)
pd.set_option("max_colwidth", 18)
pd.set_option('display.max_rows', 22)
df_final.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index Edad Diag. Cod Diag Diagnóstico Cod Sitio Sitio SEER Estadio ... Fec.Reg.Seguim. Estado Vital Clase Caso Clase Caso.1 F. Pase Control E.Tratam. Estado Tratamiento
0 0 59 81403 Adenocarcinoma... C619 Glandula prost... 9 ... 2017/01/27 Vivo 42 Dx fuera y sol... 2000/12/15 0 Caso completo
1 1 61 81403 Adenocarcinoma... C619 Glandula prost... 7 ... 2002/06/11 Muerto 32 Dx y TODO el T... 2001/04/20 0 Caso completo
2 2 74 81403 Adenocarcinoma... C619 Glandula prost... 7 ... 2012/03/07 Muerto 14 Dx y TODO el T... 2001/04/09 0 Caso completo
3 3 59 81403 Adenocarcinoma... C619 Glandula prost... 3 ... 2013/08/15 Vivo 14 Dx y TODO el T... 2000/05/08 0 Caso completo
4 4 62 81403 Adenocarcinoma... C619 Glandula prost... 2 ... 2013/06/04 Muerto 14 Dx y TODO el T... 2000/03/09 0 Caso completo

5 rows × 52 columns

df_final.tail()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index Edad Diag. Cod Diag Diagnóstico Cod Sitio Sitio SEER Estadio ... Fec.Reg.Seguim. Estado Vital Clase Caso Clase Caso.1 F. Pase Control E.Tratam. Estado Tratamiento
1634 1634 80 81403 Adenocarcinoma... C619 Glandula prost... 9 ... 2018/05/21 Vivo 22 Dx fuera y TOD... 2011/04/06 0 Caso completo
1635 1635 71 81403 Adenocarcinoma... C619 Glandula prost... 1 ... 2018/05/28 Vivo 22 Dx fuera y TOD... 2011/03/23 0 Caso completo
1636 1636 70 81403 Adenocarcinoma... C619 Glandula prost... 7 ... 2012/07/19 Muerto 14 Dx y TODO el T... 2011/03/01 0 Caso completo
1637 1637 61 81403 Adenocarcinoma... C619 Glandula prost... 9 ... 2018/05/16 Vivo 32 Dx y TODO el T... 2010/12/23 0 Caso completo
1638 1638 81 81403 Adenocarcinoma... C619 Glandula prost... 7 ... 2018/05/24 Muerto 22 Dx fuera y TOD... 2011/03/04 0 Caso completo

5 rows × 52 columns

df_final.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index Edad Diag. Cod Diag SEER Estadio TIPO CX Sitio Primario Ciclos o Gray Recib.1 Tipo Rec.1 Tipo Rec.2 Clase Caso E.Tratam.
count 1639.000000 1639.000000 1639.000000 1639.000000 267.000000 326.000000 1639.000000 1639.000000 1639.000000 1639.000000
mean 819.000000 69.472849 81474.109823 5.082367 48.696629 68.644172 62.370958 838.461257 22.038438 0.768761
std 473.282861 8.768998 674.772300 3.323969 10.529305 8.998257 81.952525 199.876402 9.756766 19.205771
min 0.000000 35.000000 80003.000000 0.000000 5.000000 12.000000 0.000000 10.000000 0.000000 0.000000
25% 409.500000 63.000000 81403.000000 2.000000 50.000000 70.000000 10.000000 888.000000 14.000000 0.000000
50% 819.000000 70.000000 81403.000000 7.000000 50.000000 70.000000 70.000000 888.000000 22.000000 0.000000
75% 1228.500000 76.000000 81403.000000 9.000000 50.000000 70.000000 99.000000 888.000000 22.000000 0.000000
max 1638.000000 96.000000 96803.000000 9.000000 99.000000 120.000000 888.000000 888.000000 99.000000 777.000000
df_final.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1639 entries, 0 to 1638
Data columns (total 52 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   index                       1639 non-null   int64  
 1   Edad Diag.                  1639 non-null   int64  
 2   Cod Diag                    1639 non-null   int64  
 3   Diagnóstico                 1639 non-null   object 
 4   Cod Sitio                   1639 non-null   object 
 5   Sitio                       1639 non-null   object 
 6   SEER Estadio                1639 non-null   int64  
 7   SEER Estadio.1              1639 non-null   object 
 8   TNM Estadio RH              1639 non-null   object 
 9   Otra Extensión              1639 non-null   object 
 10  F.Diagnóstico               1639 non-null   object 
 11  Tto.afuera                  282 non-null    object 
 12  F.Tto.afuera                342 non-null    object 
 13  Ttto no curativos           419 non-null    object 
 14  F.Ttto no curat             421 non-null    object 
 15  Fecha 1erTto. - Razón       806 non-null    object 
 16  F.Aband.Tto                 217 non-null    object 
 17  Fecha CIRUGIA               277 non-null    object 
 18  Razón PARA NO CX            277 non-null    object 
 19  TIPO CX Sitio Primario      267 non-null    float64
 20  fecha SIN TTO CLINICO       993 non-null    object 
 21  razon PARA SIN TTO CLINICO  993 non-null    object 
 22  Tipo RADIOTERAPIA           328 non-null    object 
 23  Fecha RT                    340 non-null    object 
 24  Razón PARA NO RT            340 non-null    object 
 25  Ciclos o Gray  Recib.1      326 non-null    float64
 26  Fecha HORMONOTERAPIA        129 non-null    object 
 27  Razón PARA NO HT            129 non-null    object 
 28  Fecha ORQUIECTOMIA          235 non-null    object 
 29  Razón PARA NO ORQUIECTOMIA  235 non-null    object 
 30  Tipo OTROS TTOS             7 non-null      object 
 31  Fecha OTROS TTOS            7 non-null      object 
 32  Razón PARA NO OTROS TTOS    7 non-null      object 
 33  Metast.A Rec.               1639 non-null   object 
 34  Metast.B Rec.               1639 non-null   object 
 35  Metast.C Rec.               1639 non-null   object 
 36  Tipo Rec.1                  1639 non-null   int64  
 37  Tipo Rec.2                  1639 non-null   int64  
 38  F.Recurrencia               1182 non-null   object 
 39  F.Tto Recurrencia           418 non-null    object 
 40  Tipos Tto Recurrencia       976 non-null    object 
 41  Fecha Defun.                871 non-null    object 
 42  C.Defun.                    1639 non-null   object 
 43  Causa Defunción             870 non-null    object 
 44  Fec.Ult.Contacto            1639 non-null   object 
 45  Fec.Reg.Seguim.             1639 non-null   object 
 46  Estado Vital                1639 non-null   object 
 47  Clase Caso                  1639 non-null   int64  
 48  Clase Caso.1                1639 non-null   object 
 49  F. Pase Control             1624 non-null   object 
 50  E.Tratam.                   1639 non-null   int64  
 51  Estado Tratamiento          1639 non-null   object 
dtypes: float64(2), int64(8), object(42)
memory usage: 666.0+ KB
df_final.isnull().sum()[df_final.isnull().sum() !=0]
Tto.afuera               1357
F.Tto.afuera             1297
Ttto no curativos        1220
F.Ttto no curat          1218
Fecha 1erTto. - Razón     833
                         ... 
F.Tto Recurrencia        1221
Tipos Tto Recurrencia     663
Fecha Defun.              768
Causa Defunción           769
F. Pase Control            15
Length: 28, dtype: int64
df_final.groupby(['Diagnóstico']).count()\
                           .assign(Count=lambda dataset:dataset['Edad Diag.'],
                                   Percentage=lambda dataset:dataset['Edad Diag.']*100/dataset['Edad Diag.'].sum(),
                                  )[["Count","Percentage"]].sort_values("Count",ascending=False)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Count Percentage
Diagnóstico
Adenocarcinoma SAI 1591 97.071385
Carcinoma de cel.acinosas 25 1.525320
Neo. maligna 7 0.427090
Adenocar.tubular 2 0.122026
Ca.de cel.transicionales SAI 2 0.122026
Carcinoma SAI 2 0.122026
Carcinoma indiferenciado SAI 2 0.122026
Carcinoma neuroendocrino SAI 2 0.122026
Adenocar. mucinoso 1 0.061013
Adenocarcinoma .de cel. claras SAI 1 0.061013
Carcinoma de cel.pequenas SAI Neuroendocrino 1 0.061013
Carcinoma in situ SAI 1 0.061013
Leiomiosarcoma SAI 1 0.061013
Linfoma maglino cels B grandes difuso SAI 1 0.061013

3.- Data cleaning

Because a review of the survival rate is to be performed, columns that are not necessary are eliminated.

delete_keys=['Cod Diag','Cod Sitio','Sitio','F.Ttto no curat',
             'Razón PARA NO CX', 'TIPO CX Sitio Primario',
             'fecha SIN TTO CLINICO','Razón PARA NO RT',
             'Razón PARA NO HT','Tipo OTROS TTOS','Razón PARA NO ORQUIECTOMIA',
             'Fecha OTROS TTOS', 'Razón PARA NO OTROS TTOS',
             'Metast.B Rec.', 'Metast.C Rec.','C.Defun.',
             'Clase Caso','Clase Caso.1','Num_HCL','Otra Extensión']

target_keys=[item for item in df_final.keys() if item not in delete_keys]
df_final=df_final[target_keys]
df_final.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index Edad Diag. Diagnóstico SEER Estadio SEER Estadio.1 TNM Estadio RH F.Diagnóstico ... Causa Defunción Fec.Ult.Contacto Fec.Reg.Seguim. Estado Vital F. Pase Control E.Tratam. Estado Tratamiento
0 0 59 Adenocarcinoma... 9 No estadificad... T777; N777; M7... 2000/01/02 ... NaN 2014/06/16 2017/01/27 Vivo 2000/12/15 0 Caso completo
1 1 61 Adenocarcinoma... 7 Metastasis dis... T777; N777; M7... 2000/01/04 ... Tumor maligno ... 2001/04/20 2002/06/11 Muerto 2001/04/20 0 Caso completo
2 2 74 Adenocarcinoma... 7 Metastasis dis... TX; NX; MX; EIV 2000/01/05 ... Enfermedades d... 2010/07/15 2012/03/07 Muerto 2001/04/09 0 Caso completo
3 3 59 Adenocarcinoma... 3 Regional A Los... T777; N777; M7... 2000/01/05 ... NaN 2011/08/09 2013/08/15 Vivo 2000/05/08 0 Caso completo
4 4 62 Adenocarcinoma... 2 Regional Por E... TX; NX; MX; EIV 2000/01/13 ... Sintomas/ sign... 2002/12/12 2013/06/04 Muerto 2000/03/09 0 Caso completo

5 rows × 33 columns

Now, the names of variables (columns) whose format makes correct data manipulation impossible are changed.

df_final=df_final.rename(columns={'Edad Diag.':'Edad_diag',
                                  'SEER Estadio':'Cod_SEER_Estadio',
                                  'SEER Estadio.1':'SEER_Estadio',
                                  'Diagnóstico': 'Diagnostico',
                                  'TNM Estadio RH':'TNM_Estadio_RH',
                                  'F.Diagnóstico':'Fecha_Diag',
                                  'Tto.afuera':'Tto_afuera',
                                  'F.Tto.afuera':'Fecha_Tto_afuera',
                                  'Ttto no curativos':'Ttto_paliativo',
                                  'Fecha 1erTto. - Razón':'Tto_1',
                                  'F.Aband.Tto':'Fecha_Aband_Tto',
                                  'Fecha CIRUGIA':'Fecha_CX',
                                  'razon PARA SIN TTO CLINICO':'Muere_antes_Tto',
                                  'Tipo RADIOTERAPIA':'Radioterapia',
                                  'Fecha RT':'Fecha_RT',
                                  'Ciclos o Gray  Recib.1':'Dosis_Recib',
                                  'Fecha HORMONOTERAPIA':'Fecha_HT',
                                  'Fecha ORQUIECTOMIA':'Fecha_orquiectomia',
                                  'Metast.A Rec.':'Metastasis', 
                                  'Tipo Rec.1':'Tipo_Rec_1',
                                  'Tipo Rec.2':'Tipo_Rec_2',
                                  'F.Recurrencia':'Fecha_Rec',
                                  'F.Tto Recurrencia':'Fecha_Tto_Rec',
                                  'Tipos Tto Recurrencia':'Tipos_Tto_Rec',
                                  'Fecha Defun.':'Fecha_Defun',
                                  'Causa Defunción':'Causa_Defuncion',
                                  'Fec.Ult.Contacto':'Fecha_Ult_Contacto',
                                  'Fec.Reg.Seguim.':'Fecha_Reg_Seguim',
                                  'Estado Vital':'Estado_Vital',
                                  'F. Pase Control':'Fecha_Pase_Control',
                                  'E.Tratam.':'Cod_Estado_Tratam',
                                  'Estado Tratamiento':'Estado_Tratamiento'})

In order to be able to manipulate the data through dataframe.assign(), the "na" values are replaced with dataframe.fillna() by a blank space ("").

df_final=df_final.fillna("")

Continuing with the cleaning we notice that there are 4 variables contained in one ("TNM_Estadio_RH"), which are in string format, so this string is divided with str.split() considering ";" as separator.The value of each is stored in a new variable. Finally, the column "TNM_State_RH" is deleted.

dp=df_final.assign(T=lambda dataset:dataset["TNM_Estadio_RH"]\
                                 .apply(lambda row:row.split(sep=';')[0]\
                                 .split(sep='T')[1]),
                   N=lambda dataset:dataset["TNM_Estadio_RH"]\
                                 .apply(lambda row:row.split(sep=';')[1]\
                                 .split(sep='N')[1]),
                   M=lambda dataset:dataset["TNM_Estadio_RH"]\
                                 .apply(lambda row:row.split(sep=';')[2]\
                                 .split(sep='M')[1]),
                   E=lambda dataset:dataset["TNM_Estadio_RH"]\
                                 .apply(lambda row:row.split(sep=';')[3]\
                                 .split(sep='E')[1])
                  ).drop(["TNM_Estadio_RH"], axis=1)

Now, there is also the variables "Tto_afuera" and "Tipos_Tto_Rec" which has several variables contained in it. TThese variables will be created as dummy variables to represent their presence or absence.

dp=dp.assign(CX_fuera=lambda dataset:dataset["Tto_afuera"]\
                                     .apply(lambda row:1 if re.search('CX',row)\
                                                       else 0),
             HT_fuera=lambda dataset:dataset["Tto_afuera"]\
                                     .apply(lambda row:1 if re.search('HT',row)\
                                                       else 0),
             QT_fuera=lambda dataset:dataset["Tto_afuera"]\
                                     .apply(lambda row:1 if re.search('QT',row)\
                                                       else 0),
             OT_fuera=lambda dataset:dataset["Tto_afuera"]\
                                     .apply(lambda row:1 if re.search('OT',row)\
                                                       else 0),
             CX_Rec=lambda dataset:dataset["Tipos_Tto_Rec"]\
                                 .apply(lambda row:1 if re.search('CX',row)\
                                                       else 0),
             HT_Rec=lambda dataset:dataset["Tipos_Tto_Rec"]\
                                 .apply(lambda row:1 if re.search('HT',row)\
                                                       else 0),
             QT_Rec=lambda dataset:dataset["Tipos_Tto_Rec"]\
                                 .apply(lambda row:1 if re.search('QT',row)\
                                                       else 0),
             RT_Rec=lambda dataset:dataset["Tipos_Tto_Rec"]\
                                 .apply(lambda row:1 if re.search('RT',row)\
                                                       else 0)
                  ).drop(["Tipos_Tto_Rec"], axis=1)

The value of the variable "Fecha_Tto_1" contains values of two different variables separated by a hyphen. This variable expresses the treatment start date followed by the type of treatment.

dp=dp.assign(Fecha_Tto_1=lambda dataset:dataset["Tto_1"]\
                                        .apply(lambda row:row.split(sep='-')[0]),
             Tto_1=lambda dataset:dataset["Tto_1"]\
                                        .apply(lambda row: re.search(r'[a-zA-Z0]+$',row)[0] 
                                                           if re.search(r'[a-zA-Z0]+$',row)
                                                           else row))

Since there is only the variable with the date of the patients who underwent a certain procedure/treatment, a new variable is created containing information on the presence or absence of the procedure/treatment.Invalid values must be considered:

  • " " : Blank space (absence of information)
  • 777 : No information available
  • 888 : Not applicable
dp=dp.assign(Aband_Tto=lambda dataset:dataset["Fecha_Aband_Tto"]\
                                            .apply(lambda row: 1 if row!=""and row!="888" and row!="777"\
                                                                else 0),
             Hormonoterapia=lambda dataset:dataset["Fecha_HT"]\
                                           .apply(lambda row: 1 if row!=""and row!="888" and row!="777"\
                                                                 else 0),
             Orquiectomia=lambda dataset:dataset["Fecha_orquiectomia"]\
                                           .apply(lambda row: 1 if row!=""and row!="888" and row!="777"\
                                                                 else 0),
              Recurrecia=lambda dataset:dataset["Fecha_Rec"]\
                                            .apply(lambda row: 1 if row!=""and row!="888" and row!="777"\
                                                                  else 0)
            )

In the variable "Muere_antes_Tto" there are several codes that represent different events of which we only need to know if the patient died before the treatment was started, so we replace them.

dp=dp.assign(Muere_antes_Tto=lambda dataset:dataset["Muere_antes_Tto"]\
                                            .apply(lambda row:1 if row=='ST02'\
                                                                  else 0))

A binary value representing the presence or absence of treatment is used to determine whether the patient received a certain treatment. This type of value is also used to determine whether or not the cancer has spread in the patient. Invalid values must be considered:

  • " " : Blank space (absence of information)
  • 777 : No information available
  • 888 : Not applicable
dp=dp.assign(Ttto_paliativo=lambda dataset:dataset["Ttto_paliativo"]\
                                 .apply(lambda row:1 if row!=""and row!="888" and row!="777"\
                                                       else 0),
             Radioterapia=lambda dataset:dataset["Radioterapia"]\
                                 .apply(lambda row:1 if re.search(r'^RT',row)\
                                                       else 0),
             Metastasis=lambda dataset:dataset["Metastasis"]\
                                 .apply(lambda row:1 if row!="" and row!="888" and row!="777"\
                                                       else 0)
                  )

In order to respect the SEER classification that classifies the stage of the patients, we replace only with the categories obtained from the official website in which there is a table of Summary Stage SS2018 of prostate cancer with their respective weighting.

dp=dp.assign(SEER_Estadio=lambda dataset:dataset["SEER_Estadio"].replace(
  {
    "No estadificado/ desconoce/ no especificado":"Desconocido",
    "Metastasis distante / enferm. sistematica":"Distante",
    "Regional Por Extension Directa":"Regional sólo por extensión directa",    
    "Regional A Los Ganglios Linfaticos":"Regional solo por Ganglios Linfaticos",
    'Regional NEO':'Regional (2 y 3 )'
  }
  ))

Also, in order to respect the SEER stadium coding, the values that do not belong are replaced by their corresponding value obtained from Summary Stage 2018: Prostate

dp["Cod_SEER_Estadio"].unique()
array([9, 7, 3, 2, 1, 0, 4, 5])
dp[~dp["Cod_SEER_Estadio"].isin([9,7,4,3,2,1,0])]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
index Edad_diag Diagnostico Cod_SEER_Estadio SEER_Estadio Fecha_Diag Tto_afuera ... QT_Rec RT_Rec Fecha_Tto_1 Aband_Tto Hormonoterapia Orquiectomia Recurrecia
1501 1501 52 Adenocarcinoma... 5 Regional (2 y 3 ) 2010/04/20 CX ... 0 0 2010/09/13 0 0 0 1

1 rows × 48 columns

All values different from 9,7,4,3,2,1 and 0, must be replaced with their corresponding value obtained in the page mentioned above.

dp=dp.assign(Cod_SEER_Estadio=lambda dataset:dataset["Cod_SEER_Estadio"].replace(
              {
                5:4
              })
             ) 

The ["Causa_Defuncion"] variable must keep the categories related to causes that indicate the influence of cancer on it, so the categories that have no relevance are replaced by "Otros".

dp2=dp.assign(Causa_Defuncion=lambda dataset:dataset["Causa_Defuncion"].replace(
  {
    "Sintomas/ signos y hallazgos ":"Otros",
    "Tumor maligno de Organos digestivos":"Otros",    
    "Enfermedades del sistema circulatorio":"Otros",
    "Enfermedades del aparato digestivo":"Otros",             
    "Enfermedades endocrinas/ ":"Otros",      
    "Tumores malignos tejido linfatico/ de ":"Tumores malignos tejido linfatico",      
    "Tumor maligno de Piel"  :"Otros", 
    "Tumor maligno de Labio/ cavidad ":"Otros",             
    "sin dato":np.nan,                                   
    "Tumores malignos de sitios mal ":"Otros",             
    "Enfermedades de la sangre y organos " :"Otros",      
    "Tumores malignos (primarios) de "  :"Otros",           
    "Causas extremas de morbilidad y de " :"Otros",        
    "Tumor maligno de Ojo/ encefalo y " :"Otros",          
    "Tumor maligno de Tejidos " :"Otros"    
  }
  ))

The categories in the column ["Causa_Defuncion"] are checked using dataFrame.Series.value_counts()

dp2["Causa_Defuncion"].value_counts()
                                       769
Tumor maligno de Organos genitales     582
Otros                                  262
Tumor maligno de Vias urinarias          9
Enfermedades del sistema                 7
Tumores malignos tejido linfatico        5
Enfermedades del aparato                 3
Name: Causa_Defuncion, dtype: int64

Now, we check the format of the variables using dataFrame.info()

dp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1639 entries, 0 to 1638
Data columns (total 48 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   index               1639 non-null   int64 
 1   Edad_diag           1639 non-null   int64 
 2   Diagnostico         1639 non-null   object
 3   Cod_SEER_Estadio    1639 non-null   int64 
 4   SEER_Estadio        1639 non-null   object
 5   Fecha_Diag          1639 non-null   object
 6   Tto_afuera          1639 non-null   object
 7   Fecha_Tto_afuera    1639 non-null   object
 8   Ttto_paliativo      1639 non-null   int64 
 9   Tto_1               1639 non-null   object
 10  Fecha_Aband_Tto     1639 non-null   object
 11  Fecha_CX            1639 non-null   object
 12  Muere_antes_Tto     1639 non-null   int64 
 13  Radioterapia        1639 non-null   int64 
 14  Fecha_RT            1639 non-null   object
 15  Dosis_Recib         1639 non-null   object
 16  Fecha_HT            1639 non-null   object
 17  Fecha_orquiectomia  1639 non-null   object
 18  Metastasis          1639 non-null   int64 
 19  Tipo_Rec_1          1639 non-null   int64 
 20  Tipo_Rec_2          1639 non-null   int64 
 21  Fecha_Rec           1639 non-null   object
 22  Fecha_Tto_Rec       1639 non-null   object
 23  Fecha_Defun         1639 non-null   object
 24  Causa_Defuncion     1639 non-null   object
 25  Fecha_Ult_Contacto  1639 non-null   object
 26  Fecha_Reg_Seguim    1639 non-null   object
 27  Estado_Vital        1639 non-null   object
 28  Fecha_Pase_Control  1639 non-null   object
 29  Cod_Estado_Tratam   1639 non-null   int64 
 30  Estado_Tratamiento  1639 non-null   object
 31  T                   1639 non-null   object
 32  N                   1639 non-null   object
 33  M                   1639 non-null   object
 34  E                   1639 non-null   object
 35  CX_fuera            1639 non-null   int64 
 36  HT_fuera            1639 non-null   int64 
 37  QT_fuera            1639 non-null   int64 
 38  OT_fuera            1639 non-null   int64 
 39  CX_Rec              1639 non-null   int64 
 40  HT_Rec              1639 non-null   int64 
 41  QT_Rec              1639 non-null   int64 
 42  RT_Rec              1639 non-null   int64 
 43  Fecha_Tto_1         1639 non-null   object
 44  Aband_Tto           1639 non-null   int64 
 45  Hormonoterapia      1639 non-null   int64 
 46  Orquiectomia        1639 non-null   int64 
 47  Recurrecia          1639 non-null   int64 
dtypes: int64(22), object(26)
memory usage: 614.8+ KB

Each variable representing a date must be changed to "datetime64[ns]" format.

dp2=dp2.assign(Fecha_Diag=lambda dataset:dataset["Fecha_Diag"].astype("datetime64[ns]"),
              Fecha_Tto_afuera=lambda dataset:dataset["Fecha_Tto_afuera"]\
                                                         .astype("datetime64[ns]"),
              Fecha_Tto_1=lambda dataset:dataset["Fecha_Tto_1"]\
                                                         .astype("datetime64[ns]"),
              Fecha_Aband_Tto=lambda dataset:dataset["Fecha_Aband_Tto"]\
                                                         .astype("datetime64[ns]"),
              Fecha_CX=lambda dataset:dataset["Fecha_CX"].astype("datetime64[ns]"),
              Fecha_RT=lambda dataset:dataset["Fecha_RT"].astype("datetime64[ns]"),
              Fecha_HT=lambda dataset:dataset["Fecha_HT"].astype("datetime64[ns]"),
              Fecha_orquiectomia=lambda dataset:dataset["Fecha_orquiectomia"]\
                                                          .astype("datetime64[ns]"),
              Fecha_Rec=lambda dataset:dataset["Fecha_Rec"].astype("datetime64[ns]"),
              Fecha_Tto_Rec=lambda dataset:dataset["Fecha_Tto_Rec"]\
                                                          .astype("datetime64[ns]"),
              Fecha_Defun=lambda dataset:dataset["Fecha_Defun"]\
                                                          .astype("datetime64[ns]"),
              Fecha_Ult_Contacto=lambda dataset:dataset["Fecha_Ult_Contacto"]\
                                                          .astype("datetime64[ns]"),
              Fecha_Reg_Seguim=lambda dataset:dataset["Fecha_Reg_Seguim"]\
                                                          .astype("datetime64[ns]"),
              Fecha_Pase_Control=lambda dataset:dataset["Fecha_Pase_Control"]\
                                                          .astype("datetime64[ns]"),
               )

For this type of review, only the stages are needed and not their subclassifications. A search is made for all the categories present in column "E", which contains the stages of each patient.

dp2["E"].value_counts()
IV      628
777     350
II      268
III     202
99      100
I        84
IIIB      4
88        2
IC        1
Name: E, dtype: int64

Only stages I, II, III and IV are retained, and their subclassifications are attached to the main branch of stages.

dp2 = dp2.replace('IC', 'I')
dp2 = dp2.replace('IIIB', 'III')

Finally we replace the missing data with np.nan and rearrange the position of the columns.. Invalid values must be considered:

  • " " : Blank space (absence of information)
  • 777 : No information available
  • 888 : Not applicable
  • 99 : Not data
  • 88 : Not data
dp2 = dp2.replace(['888','777','88','99',''], np.nan)
dp2 = dp2[[ "index","Edad_diag", "Diagnostico", "Cod_SEER_Estadio", "SEER_Estadio",
           "Fecha_Diag", "T", "N", "M", "E", "CX_fuera", "HT_fuera", 
           "QT_fuera", "OT_fuera","Fecha_Tto_afuera", "Ttto_paliativo",
           "Tto_1", "Muere_antes_Tto","Fecha_Tto_1","Aband_Tto", 
           "Fecha_Aband_Tto","Fecha_CX", "Hormonoterapia","Fecha_HT",
           "Orquiectomia","Fecha_orquiectomia","Radioterapia","Dosis_Recib",
           "Fecha_RT","Recurrecia","Tipo_Rec_1", "Tipo_Rec_2","Fecha_Rec",
           "Metastasis","CX_Rec", "HT_Rec", "QT_Rec", "RT_Rec", 
           "Fecha_Tto_Rec","Fecha_Pase_Control","Fecha_Reg_Seguim",
           "Fecha_Ult_Contacto","Cod_Estado_Tratam","Estado_Tratamiento",
           "Estado_Vital","Fecha_Defun","Causa_Defuncion"]]
dp2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1639 entries, 0 to 1638
Data columns (total 47 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   index               1639 non-null   int64         
 1   Edad_diag           1639 non-null   int64         
 2   Diagnostico         1639 non-null   object        
 3   Cod_SEER_Estadio    1639 non-null   int64         
 4   SEER_Estadio        1639 non-null   object        
 5   Fecha_Diag          1639 non-null   datetime64[ns]
 6   T                   1176 non-null   object        
 7   N                   1229 non-null   object        
 8   M                   1273 non-null   object        
 9   E                   1187 non-null   object        
 10  CX_fuera            1639 non-null   int64         
 11  HT_fuera            1639 non-null   int64         
 12  QT_fuera            1639 non-null   int64         
 13  OT_fuera            1639 non-null   int64         
 14  Fecha_Tto_afuera    342 non-null    datetime64[ns]
 15  Ttto_paliativo      1639 non-null   int64         
 16  Tto_1               806 non-null    object        
 17  Muere_antes_Tto     1639 non-null   int64         
 18  Fecha_Tto_1         806 non-null    datetime64[ns]
 19  Aband_Tto           1639 non-null   int64         
 20  Fecha_Aband_Tto     217 non-null    datetime64[ns]
 21  Fecha_CX            277 non-null    datetime64[ns]
 22  Hormonoterapia      1639 non-null   int64         
 23  Fecha_HT            129 non-null    datetime64[ns]
 24  Orquiectomia        1639 non-null   int64         
 25  Fecha_orquiectomia  235 non-null    datetime64[ns]
 26  Radioterapia        1639 non-null   int64         
 27  Dosis_Recib         326 non-null    float64       
 28  Fecha_RT            340 non-null    datetime64[ns]
 29  Recurrecia          1639 non-null   int64         
 30  Tipo_Rec_1          1639 non-null   int64         
 31  Tipo_Rec_2          1639 non-null   int64         
 32  Fecha_Rec           1182 non-null   datetime64[ns]
 33  Metastasis          1639 non-null   int64         
 34  CX_Rec              1639 non-null   int64         
 35  HT_Rec              1639 non-null   int64         
 36  QT_Rec              1639 non-null   int64         
 37  RT_Rec              1639 non-null   int64         
 38  Fecha_Tto_Rec       418 non-null    datetime64[ns]
 39  Fecha_Pase_Control  1624 non-null   datetime64[ns]
 40  Fecha_Reg_Seguim    1639 non-null   datetime64[ns]
 41  Fecha_Ult_Contacto  1639 non-null   datetime64[ns]
 42  Cod_Estado_Tratam   1639 non-null   int64         
 43  Estado_Tratamiento  1639 non-null   object        
 44  Estado_Vital        1639 non-null   object        
 45  Fecha_Defun         871 non-null    datetime64[ns]
 46  Causa_Defuncion     868 non-null    object        
dtypes: datetime64[ns](14), float64(1), int64(22), object(10)
memory usage: 601.9+ KB

Modification

New variables are created using the clean dataset in order to generate graphs to better visualize the contained data.

Two new variables are created in the dataset:

  • tiempo_defuncion: years from diagnosis to death.
  • tiempo_vivo: years from diagnosis to the time of database generation.
tiempo_final="2020-12-31"
tiempo_final=datetime.strptime(tiempo_final, '%Y-%m-%d')

dp2=dp2.assign(tiempo_defuncion=lambda dataset:round((dataset["Fecha_Defun"]-dataset["Fecha_Diag"]).dt.days / 365,1),
               tiempo_vivo=lambda dataset:round((tiempo_final-dataset["Fecha_Diag"][dataset["Estado_Vital"]=="Vivo"]).dt.days / 365,1),
               tiempo_recurr=lambda dataset:round((dataset["Fecha_Rec"]-dataset["Fecha_Diag"]).dt.days / 365,1))

A query is made to look for negative values in the new variable "tiempo_defuncion" since there should not be any value less than zero.

dp2[["Fecha_Diag","Fecha_Defun"]][dp2["tiempo_defuncion"]<0]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Fecha_Diag Fecha_Defun
493 2004-01-01 2001-02-11
1046 2007-10-09 2006-03-17

two negative values are found. Since there are very few data with respect to the whole dataset, they are eliminated.

dp2.drop(dp2[["Fecha_Diag","Fecha_Defun"]][dp2["tiempo_defuncion"]<0].index,inplace=True)

A query is made to look for negative values in the new variable "tiempo_vivo" since there should not be any value less than zero.

dp2[["Fecha_Diag","Fecha_Defun"]][dp2["tiempo_vivo"]<0]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Fecha_Diag Fecha_Defun

Since no negative value is found, the datase is preserved. Finally, it can be said that the dataset is clean.

4.- Data visualization

sns.set_theme(style="darkgrid")
g=sns.displot(data=dp2,x="Edad_diag",
            bins=30,kde=True,
            kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
            stat='density', height=6,aspect=1.5).set(ylabel=None).set(title='Age distribution plot')

plt.axvline(dp2["Edad_diag"].mean(),c="red", ls='--',lw=2.5);
g.set(ylabel='Density', xlabel='Age at diagnosis');

png

g = sns.catplot(x="E",
                y="Edad_diag",
                data=dp2.sort_values("E",ascending=True),
                kind="box",
                palette="Paired_r",
                saturation=0.7,
                height=5, aspect=1.5).set(title='Boxplot age grouped by stages')

g.set(xlabel='Stage', ylabel='Age at diagnosis');

png

g=sns.displot(data=dp2,x="tiempo_vivo",
            bins=25,kde=True,
            kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
            stat='density', height=6,aspect=1.5).set(ylabel=None).set(title='Time alive distribution plot')

plt.axvline(dp2["tiempo_vivo"].mean(),c="red", ls='--',lw=2.5);
g.set(xlabel='Time alive', ylabel='Density');

png

g=sns.displot(data=dp2,x="tiempo_defuncion",
            bins=40,kde=True,
            kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
            stat='density', height=6,aspect=1.5).set(ylabel=None).set(title='Time of death distribution plot')

plt.axvline(dp2["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);
g.set(xlabel='Time of death', ylabel='Density');

png

plt.figure(figsize = (8,6))
g=sns.countplot(x='E',
                data=dp2.sort_values("E",ascending=True),
                saturation=1)

g.set(ylim=(0, 800))
g.set(title='Countplot grouped by stages')
g.set(xlabel='Stages', ylabel='Count');
for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2['E'].count()), (p.get_x()+0.25, p.get_height()+10))

png

plt.figure(figsize = (8,6))
g=sns.countplot(x='E',
                data=dp2.sort_values("E",ascending=True),
                hue="Estado_Vital",hue_order=["Muerto","Vivo"],
                saturation=1)
g.set(ylim=(0, 600))
g.set(title='Countplot grouped by stages')
g.set(xlabel='Stages', ylabel='Count');
plt.legend(loc='upper left')
for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2['E'].count()), (p.get_x()+.05, p.get_height()+10))

png

g=sns.displot(data=dp2[dp2["E"]=="I"],
              x="tiempo_defuncion",
              bins=10,kde=True,
              kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
              stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Stage I')
g.set(xlabel='Time of death Stage I', ylabel='Density')
plt.axvline(dp2[dp2["E"]=="I"]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

plt.figure(figsize = (7,5))
g=sns.countplot(x='E',
                data=dp2[dp2["E"]=="I"],
                hue="Estado_Vital",hue_order=["Muerto","Vivo"],
                saturation=1)

g.set(ylim=(0, 80))
plt.legend(loc='upper left')
g.set(title='Countplot Stage I')
g.set(xlabel='Stage I', ylabel='Count')

for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2[dp2["E"]=="I"]["index"].count()), (p.get_x()+.15, p.get_height()+2))

png

g=sns.displot(data=dp2[dp2["E"]=="II"],
              x="tiempo_defuncion",
              bins=10,kde=True,
              kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
              stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Stage II')
g.set(xlabel='Time of death Stage II', ylabel='Density')
plt.axvline(dp2[dp2["E"]=="II"]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

plt.figure(figsize = (7,5))
g=sns.countplot(x='E',
                data=dp2[dp2["E"]=="II"],
                hue="Estado_Vital",hue_order=["Muerto","Vivo"],
                saturation=1)
g.set(ylim=(0, 210))
plt.legend(loc='upper left')
g.set(title='Countplot Stage II')
g.set(xlabel='Stage II', ylabel='Count')
for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2[dp2["E"]=="II"]["index"].count()), (p.get_x()+.15, p.get_height()+2))

png

g=sns.displot(data=dp2[dp2["E"]=="III"],
              x="tiempo_defuncion",
              bins=10,kde=True,
              kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
              stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Stage III')
g.set(xlabel='Time of death Stage III', ylabel='Density')
plt.axvline(dp2[dp2["E"]=="III"]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

plt.figure(figsize = (7,5))
g=sns.countplot(x='E',
                data=dp2[dp2["E"]=="III"],
                hue="Estado_Vital",hue_order=["Muerto","Vivo"],
                saturation=1)
g.set(ylim=(0, 150))
plt.legend(loc='upper left')
g.set(title='Countplot Stage III')
g.set(xlabel='Stage III', ylabel='Count')
for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2[dp2["E"]=="III"]["index"].count()), (p.get_x()+.15, p.get_height()+2))

png

g=sns.displot(data=dp2[dp2["E"]=="IV"],
              x="tiempo_defuncion",
              bins=10,kde=True,
              kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
              stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Stage IV')
g.set(xlabel='Time of death Stage IV', ylabel='Density')
plt.axvline(dp2[dp2["E"]=="IV"]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

plt.figure(figsize = (7,5))
g=sns.countplot(x='E',
                data=dp2[dp2["E"]=="IV"],
                hue="Estado_Vital",hue_order=["Muerto","Vivo"],
                saturation=1)
g.set(ylim=(0, 520))
plt.legend(loc='upper right')
g.set(title='Countplot Stage IV')
g.set(xlabel='Stage IV', ylabel='Count')
for p in g.patches:
    g.annotate('{:.2f}%'.format(p.get_height()*100/dp2[dp2["E"]=="IV"]["index"].count()), (p.get_x()+.15, p.get_height()+2))

png

Radiotherapy

g=sns.displot(data=dp2[(dp2["Radioterapia"]==1)],
              x="tiempo_defuncion",
            bins=25,kde=True,
            kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
            stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Radiotherapy patients')
g.set(xlabel='Time of death', ylabel='Density')
plt.axvline(dp2[dp2["Radioterapia"]==1]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

g=sns.displot(data=dp2[dp2["Radioterapia"]==0],x="tiempo_defuncion",
            bins=25,kde=True,
            kde_kws={'bw_adjust':0.9,'bw_method':'scott'},
            stat='density', height=6,aspect=1.5).set(ylabel=None)

g.set(title='Time of death distribution plot: Patients without radiotherapy')
g.set(xlabel='Time of death', ylabel='Density')
plt.axvline(dp2[dp2["Radioterapia"]==0]["tiempo_defuncion"].mean(),c="red", ls='--',lw=2.5);

png

Survival

from lifelines import KaplanMeierFitter
from lifelines.utils import survival_events_from_table
from lifelines.utils import survival_table_from_events
df_survival=dp2.assign(R=dp2["Radioterapia"],
                       C=lambda dataset:dataset["Causa_Defuncion"].apply(lambda row:1 if row=="Otros" or row!=np.nan else 0),
                       S=lambda dataset:dataset["Estado_Vital"].apply(lambda row:1 if row=="Muerto" 
                                                                                   else 0 if row=="Vivo" 
                                                                                   else np.nan),
                       tiempo_vivo=dp2["tiempo_vivo"].replace(np.nan,""),
                       tiempo_defuncion=dp2["tiempo_defuncion"].replace(np.nan,""))
df_survival=df_survival.assign(T=lambda dataset:((dataset["tiempo_defuncion"].astype("str"))+dataset["tiempo_vivo"].astype("str")).astype("float"))
df_survival=df_survival.assign(T=round(df_survival["T"],0))
df_survival=df_survival[["T","S","C","E","R"]].dropna(subset=["E"])
df_survival.set_index("T",inplace=True,drop=False)
time, event, weight = survival_events_from_table(df_survival,
                                                 observed_deaths_col="S",
                                                 censored_col="C")
table=survival_table_from_events(df_survival["T"],
                                 df_survival["S"])
print(table.head())
          removed  observed  censored  entrance  at_risk
event_at                                                
0.0            57        57         0      1186     1186
1.0           101       101         0         0     1129
2.0           109       109         0         0     1028
3.0            57        57         0         0      919
4.0            58        58         0         0      862
kmf=KaplanMeierFitter()
plt.figure(figsize = (8,6))
timelines=range(0,25,5)
kmf.fit(time,event,label="Survival Curve", timeline=timelines)
fig=kmf.plot_survival_function(show_censors=False)
fig.set(ylim=(0, 1.1),xlim=(0, 23.0))
fig.set_title('Survival Curve of all patients')

i=0
for item in kmf.survival_function_["Survival Curve"]:
    if item>.5:
        fig.annotate(str(round(item,2)),xy=(i+2,round(item,2)+.05))
    else:
        fig.annotate(str(round(item,2)),xy=(i+0.8,round(item,2)))
    i+=5

png

kmf.survival_function_
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Survival Curve
timeline
0.0 0.968818
5.0 0.739353
10.0 0.603386
15.0 0.539839
20.0 0.490894
plt.figure(figsize = (8,6))
kmf.fit(df_survival[df_survival["E"]=="I"]["T"],
        df_survival[df_survival["E"]=="I"]["S"],
        label="Stage I")
df_s1=kmf.survival_function_
ax=kmf.plot()
kmf.fit(df_survival[df_survival["E"]=="II"]["T"],
        df_survival[df_survival["E"]=="II"]["S"],
        label="Stage II")
df_s2=kmf.survival_function_
ax=kmf.plot(ax=ax)
kmf.fit(df_survival[df_survival["E"]=="III"]["T"],
        df_survival[df_survival["E"]=="III"]["S"],
        label="Stage III")
df_s3=kmf.survival_function_
ax=kmf.plot(ax=ax)
kmf.fit(df_survival[df_survival["E"]=="IV"]["T"],
        df_survival[df_survival["E"]=="IV"]["S"],
        label="Stage IV")
df_s4=kmf.survival_function_
ax=kmf.plot(ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax.set_title('Survival Curve grouped by stages')
Text(0.5, 1.0, 'Survival Curve grouped by stages')

png

df_stages=pd.merge(df_s1,df_s2,on='timeline')
df_stages=pd.merge(df_stages,df_s3,on='timeline')
df_stages=pd.merge(df_stages,df_s4,on='timeline')
df_stages
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Stage I Stage II Stage III Stage IV
timeline
0.0 0.976190 0.988806 0.995146 0.918790
2.0 0.940476 0.962687 0.946602 0.616242
3.0 0.916667 0.951493 0.912621 0.544586
5.0 0.892857 0.884328 0.859223 0.436306
7.0 0.845238 0.835821 0.810680 0.332803
8.0 0.821429 0.809701 0.766990 0.305732
10.0 0.809524 0.779851 0.718447 0.272293
11.0 0.796032 0.763519 0.707394 0.250969
12.0 0.796032 0.735240 0.686988 0.235160
13.0 0.773288 0.723476 0.669596 0.232735
14.0 0.773288 0.716586 0.647996 0.232735
15.0 0.773288 0.708712 0.647996 0.225989
16.0 0.773288 0.708712 0.647996 0.225989
17.0 0.773288 0.708712 0.647996 0.220195
18.0 0.773288 0.708712 0.647996 0.220195
20.0 0.773288 0.708712 0.647996 0.186319
plt.figure(figsize = (8,6))
kmf.fit(df_survival[df_survival["R"]==1]["T"],df_survival[df_survival["R"]==1]["S"],label="With radiotherapy")
df_R1=kmf.survival_function_
ax=kmf.plot()
kmf.fit(df_survival[df_survival["R"]==0]["T"],df_survival[df_survival["R"]==0]["S"],label="Without radiotherapy")
df_R0=kmf.survival_function_
ax=kmf.plot(ax=ax)
ax.set_title('Survival Curve patients with/without radiotherapy');

png

df_R=pd.merge(df_R1,df_R0,on='timeline')
df_R
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
With radiotherapy Without radiotherapy
timeline
0.0 0.996016 0.940107
1.0 0.988048 0.834225
2.0 0.952191 0.727273
3.0 0.920319 0.674866
4.0 0.904382 0.617112
5.0 0.876494 0.580749
6.0 0.860558 0.527273
7.0 0.828685 0.495187
8.0 0.800797 0.465241
9.0 0.792829 0.445989
10.0 0.772908 0.429947
11.0 0.749767 0.412975
12.0 0.727218 0.395601
13.0 0.711579 0.389089
14.0 0.701963 0.385255
15.0 0.701963 0.378293
16.0 0.701963 0.378293
17.0 0.701963 0.374225
18.0 0.701963 0.374225
19.0 0.701963 0.374225
20.0 0.701963 0.338585

Recurrence

plt.figure(figsize = (8,6))
g=sns.countplot(x='Recurrecia',
                data=dp2,
                saturation=1)
g.set(ylim=(0, 1300))

labels = (["No","Yes"])
g.set_xticklabels(labels)
g.set(xlabel='Recurrence', ylabel='Count')
g.set(title='Countplot Recurrence II')
for p in g.patches:
    g.annotate('{:.2f} [%]'.format(p.get_height()*100/dp2['Edad_diag'].count()), (p.get_x()+0.32, p.get_height()+50))

png

df_recurr=dp2.assign(C=lambda dataset:dataset["Aband_Tto"],
                     S=lambda dataset:dataset["Recurrecia"])
df_recurr=df_recurr.assign(T=round(df_recurr["tiempo_recurr"],0))
df_recurr=df_recurr[["T","S","C","E"]].dropna(subset=["E"])
df_recurr.replace(np.nan,0,inplace=True)
df_recurr.set_index("T",inplace=True,drop=False)
time, event, weight = survival_events_from_table(df_recurr,
                                                 observed_deaths_col="S",
                                                 censored_col="C")
table=survival_table_from_events(df_recurr["T"],
                                 df_recurr["S"])
print(table.head())
          removed  observed  censored  entrance  at_risk
event_at                                                
0.0           380        50       330      1186     1186
1.0            95        95         0         0      806
2.0            90        90         0         0      711
3.0            44        44         0         0      621
4.0            47        47         0         0      577
kmf2=KaplanMeierFitter()
plt.figure(figsize = (8,6))
timelines=range(0,25,5)
kmf2.fit(time,event,label="Recurrence Curve", timeline=timelines)
fig=kmf2.plot_survival_function(show_censors=False)
fig.set(ylim=(0, 1.1),xlim=(0, 23.0))
fig.set_title('Recurrence Curve of all patients')

i=0
for item in kmf2.survival_function_["Recurrence Curve"]:
    if item>.03:
        fig.annotate(str(round(item,2)),xy=(i+2,round(item,2)+.05))
    else:
        fig.annotate(str(round(item,2)),xy=(i+0.8,round(item,2)))
    i+=5

png

kmf2.survival_function_
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Recurrence Curve
timeline
0.0 0.949495
5.0 0.623724
10.0 0.259126
15.0 0.026448
20.0 0.011652
plt.figure(figsize = (8,6))
kmf2.fit(df_recurr[df_recurr["E"]=="I"]["T"],df_recurr[df_recurr["E"]=="I"]["S"],label="Stage I")
df2_s1=kmf2.survival_function_
ax=kmf2.plot()
kmf2.fit(df_recurr[df_recurr["E"]=="II"]["T"],df_recurr[df_recurr["E"]=="II"]["S"],label="Stage II")
df2_s2=kmf2.survival_function_
ax=kmf2.plot(ax=ax)
kmf2.fit(df_recurr[df_recurr["E"]=="III"]["T"],df_recurr[df_recurr["E"]=="III"]["S"],label="Stage III")
df2_s3=kmf2.survival_function_
ax=kmf2.plot(ax=ax)
kmf2.fit(df_recurr[df_recurr["E"]=="IV"]["T"],df_recurr[df_recurr["E"]=="IV"]["S"],label="Stage IV")
df2_s4=kmf2.survival_function_
ax=kmf2.plot(ax=ax)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax.set_title('Recurrence Curve grouped by stages')
Text(0.5, 1.0, 'Recurrence Curve grouped by stages')

png

df_stages_recurr=pd.merge(df2_s1,df2_s2,on='timeline')
df_stages_recurr=pd.merge(df_stages_recurr,df2_s3,on='timeline')
df_stages_recurr=pd.merge(df_stages_recurr,df2_s4,on='timeline')
df_stages_recurr
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Stage I Stage II Stage III Stage IV
timeline
0.0 0.988095 0.981343 0.980583 0.936306
1.0 0.948037 0.922006 0.927096 0.750109
2.0 0.867921 0.858105 0.867667 0.590511
3.0 0.841216 0.812461 0.814181 0.529332
4.0 0.801158 0.789639 0.742866 0.457513
5.0 0.747748 0.766817 0.736923 0.409634
6.0 0.734395 0.730302 0.713151 0.345795
7.0 0.680985 0.689222 0.653722 0.303235
8.0 0.574163 0.534033 0.493263 0.234076
9.0 0.413932 0.410795 0.380347 0.170237
10.0 0.320463 0.333200 0.231774 0.122358
11.0 0.253700 0.287556 0.184231 0.093099
12.0 0.200290 0.205397 0.136687 0.069159
13.0 0.120174 0.146060 0.077258 0.045219
14.0 0.066763 0.073030 0.041600 0.018620
15.0 0.000000 0.004564 0.011886 0.010640

Conclusions

  • The average age at diagnosis of prostate cancer in the sample was around 70 years.

  • Patients with stage IV disease have a median age of 72 years, while patients with stages I, II and III have a median age greater than 65 and less than 70 years.

  • The mean age at death of the patients in the sample who have died is around 5 years, which means that the majority, at least on average, live 5 years.

  • The average time of death of patients is around 5 years.

  • The median time of patients alive in 2020 is around 14-15 years.

  • More than 50% of the patients in the sample have been diagnosed with stage IV.

  • De la muestra de pacientes agrupados por estadío clínico, se pude observar:

    • Estadio I: 78.57% de los pacientes esta vivo hasta el momento de la generación de la base de datos en 2020
    • Estadio II: 72.76% de los pacientes esta vivo hasta el momento de la generación de la base de datos en 2020
    • Estadio III: 67.48% de los pacientes esta vivo hasta el momento de la generación de la base de datos en 2020
    • Estadio IV: 22.93% de los pacientes esta vivo hasta el momento de la generación de la base de datos en 2020
  • Patients who have received radiotherapy, at least on average, have a longer average lifespan than those who have not received radiotherapy.

  • About the overall survival of the patients in the sample:

    • The probability of survival of a patient in the first 5 years after being diagnosed is 0.97.
    • The probability of survival of a patient between 5 and 10 years after diagnosis is 0.74
    • The probability of survival of a patient between 10 and 15 years after diagnosis is 0.6
    • The probability of survival of a patient between 15 and 20 years after diagnosis is 0.54
    • The probability of survival of a patient more than 20 years after diagnosis is 0.49
  • About the survival grouped by stage of the patients in the sample:

    • The 5-year survival probability of a patient presenting stage I is 0.89, while a patient presenting stage IV is 0.44.
    • The 10-year survival probability of a patient presenting with stage I is 0.81, while a patient presenting with stage IV is 0.27.
    • The 15-year survival probability for a patient presenting with stage I is 0.77, while a patient presenting with stage IV is 0.23.
    • The 20-year survival probability for a patient presenting with stage I is 0.77, while a patient presenting with stage IV is 0.19.
  • The probability of survival of a patient with stage I, II and II is much higher than that of a patient with stage IV.

  • As the patient is diagnosed at an advanced stage of the disease, the probability of survival decreases dramatically.

  • On the survival of the sample grouped by those patients who received radiotherapy:

    • The 5-year survival probability of a patient who received radiotherapy is 0.88, while that of a patient who did not receive is 0.58.
    • The 10-year survival probability of a patient who received radiotherapy is 0.77, while that of a patient who did not receive is 0.43.
    • The 15-year survival probability of a patient who received radiotherapy is 0.70, while that of a patient who did not receive is 0.38.
    • The 20-year survival probability of a patient who received radiotherapy is 0.70, while that of a patient who did not receive is 0.34.
  • There is a large difference in the survival probability of patients who received radiotherapy, which varies greatly depending on those who did not receive this treatment.

  • About the recurrence of patients:

    • 78.08% of patients present a recurrence of cancer during the study time.
    • The probability that the disease does not recur in the first 5 years is 0.95
    • The probability that the disease will not recur in the first 10 years is 0.62%.
    • The probability that the disease does not recur in the first 15 years is 0.26
    • The probability that the disease will not recur within the first 20 years is 0.03
    • The probability that the disease will not recur in the first 20 years is 0.01.
  • The probability of non-recurrence in patients with stage IV is much lower than in those with stages I, II and III, but the probability of these stages decreasing together after 10 years, with stages I and II being very similar.