Medical Appointment No-Shows

A medical centre offers various services that require customers to book appointments in advance – including medical appointments, small surgery, blood test... Most customers attend their appointments on time, but nonetheless about 20% of the patient fail theirs appointments. A no-show is costly to the medical centre as it blocks diary slots for patients that otherwise may attend – and it also implies that the patient that no-shows misses out on an opportunity to look after their health. The medical centre wants to reduce the impact of no-shows by calling or texting in advance patients that are likely to no-show to remind them of the appointment or offer to reschedule.

Table of Content

Data
Basic Data Information
Data Cleansing
Exploratory Data Analysis
Creating a Model for Appointments No Show

1. Data

The dataset is available on Kaggle: medical-appointment

The dataframe is composed of 110,527 medical appointments and 14 features

For more information on the scholarship, please refer to this Wikipedia page

Data Dictionary

PatientId: patient unique ID
AppointmentID: appointment unique ID
Gender: Male or Female
ScheduledDay: the day someone called or registered the appointment, this is before appointment of course
AppointmentDay: the day of the actual appointment, when they have to visit the doctor
Age: How old is the patient
Neighbourhood: where the appointment takes place
Scholarship: True of False
Hipertension: True or False
Diabetes: True or False
Alcoholism: True or False
Handcap: True or False
SMS_received: True or False
No-show : True or False

2. Basic Data Information

This section will provide basic informtion about the data.

In the table above, the first number, the count, shows how many rows have non-missing values. In this instance, we have no missing values.

The second value is the mean, which is the average. Patients in df are on average 37 years old. Under that, std is the standard deviation, which measures how numerically spread out the values are, in other word it tell how close to the mean the datpoints are.

The column Age has a minimum age of -1 which is erronous data, likewise,the maximum age is 115 years old which seems very high as Brazil's life expectancy for 2020 is 77 years old (please see here). We will deal with these errors in next section.

The column Handcap should be binary (True or False) but it has a max value of 4. This will need to be investigated,

3. Data Cleansing

In this section, we want to amend some columns in df, such as the data type, misspellings and erronous data:

PatientId is currently a float, it will be converted it into an integer
ScheduledDay and AppointmentDay are currently objects, it will be converted them into datetime
AppointmentDay's time will be dropped (as it is set as 00:00:00)
Misspelled columns are going to be renamed
Erronous data from the Age column will be deleted

Also, looking at the distribution of the Age feature, most the patients are between 18 and 55 years old. The patients who are 115 years old are outliars, we will therefore drop these rows as well as the row of the patients aged -1.

4. Exploratory Data Analysis

Overview of No-Show

Missed appointmemts account for 20% of the total appointments in the dataset.

Finding Duplicates

The dataset does not have duplicated appointments but has 48,228 patients that can be considered as returning/known patients.

Age

The patients that seems most likely to not show-up for their appointments are between 10 and 35 years old.

Gender

From the table above, we can clearly see that 'Female' patients usually have more appointments that 'Male' patients, they also have about the double number of missed appointment. However, looking at the percentage of missed appointments by gender shows that it is almost the same rate (about 20%). Therefore, gender does not seem to be an important feature.

Scheduled Day

Looking at the time of the booking

The booking are made between 6am and 8pm. There are three bookings that have been taken at 9pm, although these are showed as outliars on the plot above, they will not be dropped as they could be the results of emergencies.

Morning bookings seem less likely to be missed than afternoon and evening ones.

Looking at the day of booking

Most of the bookings are at the beginning of the week, this may be explained because the medical centre seems not to be open over the weekends. Althoug Saturdays have the smallest no show rate, they also represents a too small proportion of the data to be significant. Overall, the days of the week do not seem to be an important feature.

Appointment Date

The first graph follows the same patterns as section 5.6 "Appointment Show/No Show by Booking Day of the Week" graph. This may be because patient book and have their their appointment on the same day. The percentage of show vs no show dis roughly the same accross the week.

Waiting Time Between Booking and Medical Appointment

Most of the appointments are taken a month in advance. The graph above, highlights erroneous data and outliars. The negative data will be transformed into 'unknown' wating time category while the outliars will be kept as some medical appointment can take up to six months (like small surgeries).

The graph above suggests that the longer the waiting time is between booking and the appointmnet the more likely the appointment is to be missed. This feature seems to be an important one as it shows a clear distinction for patients show/ no show depending on the number of days ahead of the appontment patients have booked.

Neighbourhood

Most of the neighbourhouds have a no show rate of about 20%, the significant drops and peaks are because of the porr representation of a particalar neigbourhood in the dataset rather than because it is significant. Therefre, this feature does not seem to be important for the no show prediction.

Scholarship

The graphs above shows that 80% of the patients that do not have a scholarship attended their appointent while 75% of the patient with a scholarship attended. This feature could be helpful in dertermining the no show.

Hypertension

The patients suffering from hypertension tend to attend their appointment more often than those who do not have this condition. However, appointment with hypertension patients represent a small pool in our dataset, just under 20% of the total appointments. This feature could be helpful in dertermining the no show.

Diabetes

The patients suffering from Diabetes tend to attend their appointment more often than those who do not have this condition. This feature may not be helpful in dertermining the no show.

Alcoholism

Patients who suffer from alcoholism represent only 3% of all the appointments.

Looking at the graph above, there does not seem to be a difference between the patient suffering from alcoholism and the rest of the dataset. This feature may not be helpful in dertermining the no show.

Handicap

Patients suffering from a handicap represent 2% of the total appointments.

Looking at the graph above, there does not seem to be a difference between the patient suffering from handicap and the rest of the dataset. This feature may not be helpful in dertermining the no show.

SMS Received

The graphs above do not show expected results: 38% appointments for patients that received the sms were missed while 20% of the appointments for patients that did not received a sms. This feature seem to be important in dertermining appointments no show.

5. Creating a Model for Appointments No Show

We are using the gradient boosting classifier to predict which customers are going to miss their appointment but first, we created a for loop to test different n-estimators with loss set as ‘deviance’ refering to logistic regression for classification.

The best score occurs at n_estimators = 1200, therefore we are choosing it as our parameter.

Final Model

Once, we identified the optimal n_estimators, we can fit the final model using it.

THe model has 80% accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Medical Appointment No-Shows

Table of Content

1. Data

Data Dictionary

2. Basic Data Information

3. Data Cleansing

4. Exploratory Data Analysis

Overview of No-Show

Finding Duplicates

Age

Gender

Scheduled Day

Looking at the time of the booking

Looking at the day of booking

Appointment Date

Waiting Time Between Booking and Medical Appointment

Neighbourhood

Scholarship

Hypertension

Diabetes

Alcoholism

Handicap

SMS Received

5. Creating a Model for Appointments No Show

Final Model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Medical Appointment No-Shows

Table of Content

1. Data

Data Dictionary

2. Basic Data Information

3. Data Cleansing

4. Exploratory Data Analysis

Overview of No-Show

Finding Duplicates

Age

Gender

Scheduled Day

Looking at the time of the booking

Looking at the day of booking

Appointment Date

Waiting Time Between Booking and Medical Appointment

Neighbourhood

Scholarship

Hypertension

Diabetes

Alcoholism

Handicap

SMS Received

5. Creating a Model for Appointments No Show

Final Model