Skip to content
/ pmap Public

Process Map Visualization of event analysis in R

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

twang2218/pmap

Repository files navigation

Process Map

Build Status Coverage - Codecov status Coverage - Coveralls status Github Version CRAN Version Download Stats License

The goal of pmap is to provide the functionality of generating a process map from an event log with the user's preference.

Installation

An older version of pmap is available on CRAN, if you prefer to install this version, you can install it by:

install.packages("pmap")

However, based on the CRAN policy, a developer shouldn't submit a package to CRAN more than once within a month, therefore the GitHub repo will be the primary release channel, and the package will be submitted to CRAN only when it is possible. That is, the package version in CRAN can be a bit outdated.

To install the latest version, you can install pmap from GitHub directly:

devtools::install_github("twang2218/pmap")

And, the users have the options to choose the installed version by specifying the version number in the command, as I git tagged each release:

devtools::install_github("twang2218/pmap", ref = "v0.6.0")

Usage

This is a demonstration of how to use pmap to create a process map from an event log. sepsis dataset in the eventdataR package will be used in the demonstration.

Data preparation

Like any data analysis task, the first but the most important thing is to prepare our data.

Before the actual preparation steps, we should have a common ground on the terminology to be used later. There are mainly four terms, Case, Activity, Category and Event. The relation between the terms can be described as the following graph.

process map without prune

And eventlog is a collection of Event. So, each row in the eventlog represents an Event object, and each Event contains several attributes, including:

  • when - timestamp;
  • who - case_id;
  • what - activity and category;

Therefore pmap requires three mandatory fields and one optional field in the given eventlog data frame:

  • timestamp: Represent the timestamps of the events when they occurred. The data type should be POSIXct. For the case of data type of timestamp is character, the package will attempt to convert the column to POSIXct, but it's just handy in some cases, it's better to make sure the timestamp column is in correct data type.
  • case_id: Represent Case ID in the process paths. It is used to calculate the activity frequency or process performance.
  • activity: Activity name.
  • category(optional since v0.4.0): It is used to differentiate the grouped activities by different colors for a better visualization purpose. For example, the marketing activities with different purposes can be visualized by different colors, with one purpose each. If category is missing, the activity name will be used as category for coloring by default.

category was previously called event_type, and required before v0.3.2. It is no longer necessary after v0.4.0.

Now, let's do the data preparation.

library(eventdataR)
library(dplyr)
library(pmap)

# Prepare the event log data frame
eventlog <- eventdataR::sepsis %>%
    select(timestamp, case_id, activity) %>%
    na.omit()

Check eventlog data frame structure.

> head(eventlog)
# A tibble: 6 x 3
  timestamp           case_id activity
  <dttm>              <chr>   <chr>
1 2014-10-22 11:15:41 A       ER Registration
2 2014-10-22 11:27:00 A       Leucocytes
3 2014-10-22 11:27:00 A       CRP
4 2014-10-22 11:27:00 A       LacticAcid
5 2014-10-22 11:33:37 A       ER Triage
6 2014-10-22 11:34:00 A       ER Sepsis Triage
> str(eventlog)
eventlog [15,190 × 3] (S3: eventlog/tbl_df/tbl/data.frame)
 $ timestamp: POSIXct[1:15190], format: "2014-10-22 11:15:41" ...
 $ case_id  : chr [1:15190] "A" "A" "A" "A" ...
 $ activity : Factor w/ 16 levels "Admission IC",..: 4 10 3 9 6 5 8 7 2 3 ...
 - attr(*, "case_id")= chr "case_id"
 - attr(*, "activity_id")= chr "activity"
 - attr(*, "activity_instance_id")= chr "activity_instance_id"
 - attr(*, "lifecycle_id")= chr "lifecycle"
 - attr(*, "resource_id")= chr "resource"
 - attr(*, "timestamp")= chr "timestamp"
 - attr(*, "na.action")= 'omit' Named int [1:24] 442 443 444 445 446 447 448 449 450 451 ...
  ..- attr(*, "names")= chr [1:24] "442" "443" "444" "445" ...

Create a process map

You can create a process map from the eventlog directly by running only one command:

# Create process map
p <- create_pmap(eventlog)
# Render the process map
render_pmap(p)

The result will be shown in Viewer window if you're using R Studio, or in a new browser window if you're running the code from a Terminal.

process map without prune

Prune the process map

As you can see, the above result is a bit messy, however, we can prune some edges with smaller volume to simplify the process map. It is a better way to find the common paths in the process.

p %>% prune_edges(0.5) %>% render_pmap()

process map without prune

It's better, but we can improve it even better by pruning some not very important nodes as well.

p %>% prune_nodes(0.5) %>% prune_edges(0.5) %>% render_pmap()

Or, if you want a more interactive approach, you can start a Shiny server app with a slide bar for pruning the nodes and/or edges by a certain percentage. Just be careful, the more the edges and nodes, the slower the process will be. Let's keep 50% nodes and 50% edges in our example:

render_pmap_shiny(p, nodes_prune_percentage = 0.5, edges_prune_percentage = 0.5)

cleaner process map

Expand the loop

The above process map is great to find the valuable insights as we can immediately observe something very interesting insight from the map: the loop between CRP and Leucocyte. Is this because of a small group of cases repeatedly went through these two steps many many times? or is this because most cases went through the loop just a few times? To answer the question, we can expand the loop by distinct the repeated activity.

p <- create_pmap(eventlog, distinct_repeated_activities = TRUE)

By this way, each new activity name will be attached with the occurrence sequence number of the activity in the path, so the same activity occurs multiple times in the path will have a different name, which means different nodes in the final map. The newly generated the process map will be much more complex than before, so we need prune it further.

p %>% prune_nodes(0.5) %>% prune_edges(0.8) %>% render_pmap()

process map with distinct repeated activities

It's interesting to see that there isn't much connection between first time CRP (1) and Leucocyte (1), however, the back and forth happened after Admission NC.

Time is the key

Are this back and forth activity loop because of some kind of regular check after Admission NC? We are not sure from previous process map, because we don't know how long between each activity occurred. By default, the edge label will be the number of cases went between the two connected activities. We can change it for the duration of those connected activities to understand more about the process in a timely manner.

As there are multiple cases went through the path, we need to decide how to summarize the duration, such as:

  • the maximum duration
  • the minimum duration
  • the mean duration
  • the median duration

We can specify the kind of duration by given an edge_label argument to the create_pmap().

p <- create_pmap(eventlog, edge_label = "mean_duration")
p %>% prune_nodes(0.5) %>% prune_edges(0.8) %>% render_pmap()

or, it can be changed after the process map created by using adjust_edge_label() function.

p <- adjust_edge_label(p, label = "mean_duration")
render_pmap(p)

process map with distinct repeated activities with mean duration

By adding the duration between each path into the process map, we can eliminate the problem immediately, as it's not a loop. Leucocyte almost always occurred immediately after CRP occurred, but not the otherwise. It means CRP and Leucocyte occurred together in the same sequence, might belong to a blood test pack, and the patients will be tested regularly after Admission NC.

And we can also discover the patients would normally be released 2 days after CRP and Leucocyte test. It might because the test results came back ok, the patients will be released after 2 days observation without any further issue.

We can get that information from the process map clearly.

Persistent the result

If you're happy with the result, you can save the process map to a PDF or other file format by replace render_pmap() with render_pmap_file().

p <- create_pmap(eventlog, edge_label = "mean_duration")
p %>% prune_nodes(0.5) %>% prune_edges(0.8) %>% render_pmap_file("sepsis_process_map.pdf")