For any EDA or visulization, you must encounter the following scenario:
All these visuals are at its essense: represent a given measure with the intensity of a color.
In many tools available, most software take care of the mapping values to color for you. However, have you ever wonder what is the best way to create breaks for any continous variable? In other words, if the deafult setting is not ideal, and nothing interesting in the visual are showing, could it be due to the 'non-ideal' way of breaking breaks and how can you fix such problem?
In this Git Repo, I want to discuss several common way to create breaks for coloring on a given continous measure.
-
mean - sd: you can create breaks based on the mean, and 1/2 or more sd away from mean. While it works well for normally distributed measure, it is not ideal for any measure with extreme values or clustered values.
-
jenks: jenks is another popular way to create color breaks based on a clustering algorithem. However, you have to choose a k (number of groups), which might change the view.
-
kmeans: similar to jenks, we can also try to use k-means as a way to create breaks (groups) to color a continous variable.
-
quantile: quantile is a great way if you want to make sure each groups contains approx. similar points.
In summary, no way are perfect in all cases. You should try different methods with different groups and breaks to see the optimal choice for the given problem. Based on such motivation, I include an R script that will generate the different approach in one shot using the defined function color_analysis
. Looks like the below example.