A medical centre offers various services that require customers to book appointments in advance – including medical appointments, small surgery, blood test... Most customers attend their appointments on time, but nonetheless about 20% of the patient fail theirs appointments. A no-show is costly to the medical centre as it blocks diary slots for patients that otherwise may attend – and it also implies that the patient that no-shows misses out on an opportunity to look after their health. The medical centre wants to reduce the impact of no-shows by calling or texting in advance patients that are likely to no-show to remind them of the appointment or offer to reschedule.
- Data
- Basic Data Information
- Data Cleansing
- Exploratory Data Analysis
- Creating a Model for Appointments No Show
The dataset is available on Kaggle: medical-appointment
The dataframe is composed of 110,527 medical appointments and 14 features
For more information on the scholarship, please refer to this Wikipedia page
- PatientId: patient unique ID
- AppointmentID: appointment unique ID
- Gender: Male or Female
- ScheduledDay: the day someone called or registered the appointment, this is before appointment of course
- AppointmentDay: the day of the actual appointment, when they have to visit the doctor
- Age: How old is the patient
- Neighbourhood: where the appointment takes place
- Scholarship: True of False
- Hipertension: True or False
- Diabetes: True or False
- Alcoholism: True or False
- Handcap: True or False
- SMS_received: True or False
- No-show : True or False
This section will provide basic informtion about the data.
In the table above, the first number, the count, shows how many rows have non-missing values. In this instance, we have no missing values.
The second value is the mean, which is the average. Patients in df are on average 37 years old. Under that, std is the standard deviation, which measures how numerically spread out the values are, in other word it tell how close to the mean the datpoints are.
The column Age has a minimum age of -1 which is erronous data, likewise,the maximum age is 115 years old which seems very high as Brazil's life expectancy for 2020 is 77 years old (please see here). We will deal with these errors in next section.
The column Handcap should be binary (True or False) but it has a max value of 4. This will need to be investigated,
In this section, we want to amend some columns in df, such as the data type, misspellings and erronous data:
- PatientId is currently a float, it will be converted it into an integer
- ScheduledDay and AppointmentDay are currently objects, it will be converted them into datetime
- AppointmentDay's time will be dropped (as it is set as 00:00:00)
- Misspelled columns are going to be renamed
- Erronous data from the Age column will be deleted
Also, looking at the distribution of the Age feature, most the patients are between 18 and 55 years old. The patients who are 115 years old are outliars, we will therefore drop these rows as well as the row of the patients aged -1.
Missed appointmemts account for 20% of the total appointments in the dataset.
The dataset does not have duplicated appointments but has 48,228 patients that can be considered as returning/known patients.
The patients that seems most likely to not show-up for their appointments are between 10 and 35 years old.
From the table above, we can clearly see that 'Female' patients usually have more appointments that 'Male' patients, they also have about the double number of missed appointment. However, looking at the percentage of missed appointments by gender shows that it is almost the same rate (about 20%). Therefore, gender does not seem to be an important feature.
The booking are made between 6am and 8pm. There are three bookings that have been taken at 9pm, although these are showed as outliars on the plot above, they will not be dropped as they could be the results of emergencies.
Morning bookings seem less likely to be missed than afternoon and evening ones.
Most of the bookings are at the beginning of the week, this may be explained because the medical centre seems not to be open over the weekends. Althoug Saturdays have the smallest no show rate, they also represents a too small proportion of the data to be significant. Overall, the days of the week do not seem to be an important feature.
The first graph follows the same patterns as section 5.6 "Appointment Show/No Show by Booking Day of the Week" graph. This may be because patient book and have their their appointment on the same day. The percentage of show vs no show dis roughly the same accross the week.
Most of the appointments are taken a month in advance. The graph above, highlights erroneous data and outliars. The negative data will be transformed into 'unknown' wating time category while the outliars will be kept as some medical appointment can take up to six months (like small surgeries).
The graph above suggests that the longer the waiting time is between booking and the appointmnet the more likely the appointment is to be missed. This feature seems to be an important one as it shows a clear distinction for patients show/ no show depending on the number of days ahead of the appontment patients have booked.
Most of the neighbourhouds have a no show rate of about 20%, the significant drops and peaks are because of the porr representation of a particalar neigbourhood in the dataset rather than because it is significant. Therefre, this feature does not seem to be important for the no show prediction.
The graphs above shows that 80% of the patients that do not have a scholarship attended their appointent while 75% of the patient with a scholarship attended. This feature could be helpful in dertermining the no show.
The patients suffering from hypertension tend to attend their appointment more often than those who do not have this condition. However, appointment with hypertension patients represent a small pool in our dataset, just under 20% of the total appointments. This feature could be helpful in dertermining the no show.
The patients suffering from Diabetes tend to attend their appointment more often than those who do not have this condition. This feature may not be helpful in dertermining the no show.
Patients who suffer from alcoholism represent only 3% of all the appointments.
Looking at the graph above, there does not seem to be a difference between the patient suffering from alcoholism and the rest of the dataset. This feature may not be helpful in dertermining the no show.
Patients suffering from a handicap represent 2% of the total appointments.
Looking at the graph above, there does not seem to be a difference between the patient suffering from handicap and the rest of the dataset. This feature may not be helpful in dertermining the no show.
The graphs above do not show expected results: 38% appointments for patients that received the sms were missed while 20% of the appointments for patients that did not received a sms. This feature seem to be important in dertermining appointments no show.
We are using the gradient boosting classifier to predict which customers are going to miss their appointment but first, we created a for loop to test different n-estimators with loss set as ‘deviance’ refering to logistic regression for classification.
The best score occurs at n_estimators = 1200, therefore we are choosing it as our parameter.
Once, we identified the optimal n_estimators, we can fit the final model using it.
THe model has 80% accuracy.