Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a function to split text exceeding 200 characters #19

Open
parikp06 opened this issue Nov 2, 2023 · 16 comments
Open

Develop a function to split text exceeding 200 characters #19

parikp06 opened this issue Nov 2, 2023 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@parikp06
Copy link
Collaborator

parikp06 commented Nov 2, 2023

Feature Idea

Purpose

Split text exceeding 200 characters into multiple variables.
SDTMIG v3.3 | CDISC

Functionality

Based on the current requirement for the SAS v5 Transport file format, it is not possible to store the long text strings exceeding 200 characters using only one variable. Therefore, the SDTMIG has defined conventions for storing long text string using multiple variables. Follow the steps mentioned below to create a function to split long text into multiple variables.

  1. Split the first 200 characters of text into split_var1 and each additional 200 characters of text into split_varN

  2. When splitting a text string into several variables, the text should be split between words to improve readability, so it does not break a word.

  3. Create a variable split_count to hold the number of new variables created by the split.

  4. Default length to split on would be 200. However, please allow this value can be passed via split_length parameter to any other length.

Relevant Input

A character variable to be split into multiple variables.

An integer parameter split_length which would be set to 200 by default.

Relevant Output

One or more variables after splitting: split_var1 – split_varN.

Variable to hold the number of splits: split_count

Reproducible Example/Pseudo Code

split_vars(var_to_split,split_length=200)

@parikp06 parikp06 added the enhancement New feature or request label Nov 2, 2023
@parikp06 parikp06 changed the title Feature Request: <Develop a function to split text exceeding 200 characters> Develop a function to split text exceeding 200 characters Nov 2, 2023
@kamilsi
Copy link
Collaborator

kamilsi commented Nov 3, 2023

Acknowledging the historical context of SAS V5 transport file restrictions, I'd like to point out that these limitations don't apply in R, offering us a chance to opt-out. However, my experience with submissions is limited, and I'm seeking clarity: are we considering this feature to meet current regulatory expectations, or is it more about maintaining legacy compatibility?

@parikp06
Copy link
Collaborator Author

parikp06 commented Nov 3, 2023

Agree that SAS V5 limitations do not apply to R dataframes. However, we still want to have this feature to be compliant with SDTMIG rules. I am also lacking submission experience with R packages, so looking for more feedback.

@pendingintent
Copy link

The submission SAS files still have the V5 limitations, which include no string can exceed 200 characters in length. Even if the xpt files are generated using R, the SASV5 limitations still apply unfortunately.

@pendingintent
Copy link

I had a look at this in SQL and if we consider a text field mapping to DSTERM for example where the field can be a maximum of 600 characters and we cannot split words, the formula for calculating the max number of required output variables is:
(# required target variables) = CEILING(length source string / 200) + 1

The first (up to) 200 characters maps to DSTERM, with the remainder of the source string mapping to the QNAM variables DSTERM1, DSTERM2, DSTERM3.

I have used the SQL code:
dsterm := upper(substr(input_string, 1, instr(substr(input_string, 1, 200), ' ', -1, 1 )));

This will select up to 200 characters without splitting words and insert into the DSTERM DS variable. Similar statements can then be used to select the other strings for each of the subsequent supplemental QVAL values.

@kamilsi
Copy link
Collaborator

kamilsi commented Nov 3, 2023

Hi @parikp06 and @pendingintent,

I appreciate the clarification on the compliance with SDTMIG rules and the current submission requirements. Given this context, I'm ready to take on the development of this feature. I'll await any additional feedback or a decision on moving forward with this approach.

@MaguluriVenkata
Copy link

  • it's not clear where SPLITVARN will be stored in the specs? what's the plan?

@parikp06
Copy link
Collaborator Author

parikp06 commented Nov 7, 2023

This function will be called within a dataframe with input variable to be split. Upon successful execution, it should generate SplitVarN variables along with SplitCount to the output dataframe.

@MaguluriVenkata
Copy link

So it does not push SPLITVARn into SUPP domains?

@parikp06
Copy link
Collaborator Author

parikp06 commented Nov 8, 2023

no, it'll just create required list of variables. SplitCount will help in determining how many SUPP variables to be populated.

@rammprasad
Copy link
Collaborator

Keep the 200 character limit as parameterized and keep the default value as 200.

@galachad
Copy link
Member

galachad commented Nov 22, 2023

Hi @madhan0923 and @GomathiVallinayagam, are you working on this issue, or can I take it over?

@GomathiVallinayagam
Copy link
Collaborator

Yes @galachad , we are working on the issue.

@edgar-manukyan
Copy link
Collaborator

Adam recently did a very nice refactor of a similar function in Roche roak package. I would encourage collaborating with him, at least to get his ideas of function design 🙏

@madhan0923
Copy link
Collaborator

madhan0923 commented Nov 24, 2023

sure @edgar-manukyan, we are about to complete the draft. with Adam's help we can release the 1st version for QC

@ramiromagno
Copy link
Collaborator

ramiromagno commented Feb 10, 2024

Hi Madhan (@madhan0923):

I just had a look at your code in PR #32. I have a few suggestions/remarks.

I think your implementation of split_var() can be significantly simplified if you use either R base strwrap() or stringr::str_wrap().

In addition, it might be a good idea to also implement a reverse function, e.g. str_unwrap() that combines again the strings, something akin to paste(x, collapse = " "). The reason for this suggestion is that by having this reverse function we can set an expectation on what other downstream software should be getting when re-assembling the text.

From reading SDTMIG v3.4. (page 55), I think three questions remain:

  • What qualifies as an inter-word char, i.e. where can a split happen?
  • What happens if the text is a 200 char string (no inter-word chars). Functions such as strwrap() won't split if there are no inter-word chars, which is not acceptable according to the IG. I understand this is an unlikely corner case but still...
  • Should the inter-word chars be preserved in the output? I guess that if spaces are the inter-word chars, then probably not, as you do not want to leave a trailing space at the end of each string... But if the inter-word is another char, e.g. ";", I see that in your test code for split_var() you are indeed preserving the ";":
    > split_var(x)
    [[1]]
    [1] "ASCORBIC ACID;BETACAROTENE;BIOTIN;CALCIUM;CHLORIDE;CHROMIUM;COPPER;FOLIC ACID;IODINE;LYCOPENE;MAGNESIUM;MANGANESE;MOLYBDENUM;NICKEL;NICOTINIC ACID;PANTOTHENIC ACID;PHOSPHORUS;POTASSIUM;"
    
    [[2]]
    [1] "PYRIDOXINE HYDROCHLORIDE;RIBOFLAVIN;SELENIUM;SILICON;THIAMINE;VANADIUM;VITAMIN B12 NOS;VITAMIN D NOS;VITAMIN E NOS;"
    If you do not preserve the splitting chars, then how should downstream software be able to reconstruct the text...?

It might help taking a look at some code I've written in the past that uses R matrices to represent protein alignments, where I developed a similar function to reflow the alignment to a different width: https://github.com/maialab/agvgd/blob/master/R/split_alignment_by_lines.R

@Longfei2
Copy link

Hello everyone,
I have developed an R function for my company and be used in work successfully.
Hope it helpful for oak

#' @param data A data frame or tibble.
#' @param target_var A character variable in input data set that need to be split, like \code{product_term}.
#' if the name of \code{target_var} is same as \code{outvar_name}, then the name of \code{target_var} is appended with a prefix, \code{(.)}.
#' @param outvar_name A string which is the prefix of similar variable names, like \code{"CMDECOD"}.
#' @param sep A character vector that is separator delimiting collapsed values.
#' 
#' @return A tibble including a series of variable calling \code{outvar_name}, which is invisible object.

dso_str_split <- function(data, target_var, outvar_name, sep = ";") {
  # Remove a series of variable named outvar_name with numeric suffix
  .to_keep <- names(data)[!grepl(paste0(outvar_name, "([0-9]+)$"), names(data))]
  .new_data <- data %>%
    dplyr::mutate(dso_unique_id = if (nrow(.) == 0L) {
      numeric(0L)
    } else {
      dplyr::row_number()
    }) %>%
    dplyr::select(dplyr::all_of(.to_keep), dso_unique_id)
  
  .need_split_var <- rlang::enquo(target_var)
  if (is.na(sep)) sep <- "\\;"
  
  # dso_check_strs(outvar_name, sep)
  
  # separate string
  .split_data <- .new_data %>%
    dplyr::mutate(dso_split_var = !!.need_split_var) %>%
    tidyr::separate_rows(dso_split_var, sep = sep) %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep)) %>%
    dplyr::mutate(
      dso_var1 = if (is.null(dso_cumsum(nchar(dso_split_var), x1 > 200L))) {
        numeric(0L)
      } else {
        dso_cumsum(nchar(dso_split_var), x1 > 200L)
      },
      dso_var2 = ifelse(nchar(dso_split_var) == dso_var1, TRUE, FALSE),
      dso_subid = dplyr::coalesce(cumsum(dso_var2), 1L),
      dso_text = dplyr::coalesce(dso_split_var, "")
    ) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep), dso_subid) %>%
    dplyr::summarise(dso_text = paste(dso_text, collapse = sep)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate_at("dso_text", ~ dplyr::if_else(stringr::str_squish(.) == "", NA_character_, .)) %>%
    # Avoid intermediate separator causing redundant symbols at the end
    dplyr::mutate(dso_text = gsub(paste0("(", sep, "+)$"), "", dso_text)) %>%
    tidyr::pivot_wider(names_from = dso_subid, values_from = dso_text, names_prefix = "dso_text") %>%
    dplyr::select(-dso_unique_id)
  
  # if the name of target_var and .out_var is same, then rename target_var to paste(., target_var)
  
  if (identical(rlang::as_label(.need_split_var), outvar_name)) {
    names(.split_data)[names(.split_data) == outvar_name] <- paste0(".", outvar_name)
  }
  
  # modify the name of series variable calling outvar_name
  .text_name <- names(dplyr::select(.data = .split_data, dplyr::starts_with("dso_text")))
  
  if (length(.text_name) == 1L) {
    names(.split_data)[names(.split_data) %in% .text_name] <- outvar_name
  } else {
    names(.split_data)[names(.split_data) %in% .text_name] <- c(outvar_name, paste0(outvar_name, 1L:(length(.text_name) - 1)))
  }
  
  invisible(.split_data)
}

dso_cumsum <- function(num, cond) {
  .cond <- rlang::enexpr(cond)
  Reduce(function(x1, x2) {
    x1 <- sum(x1, (x2 + 1L), na.rm = TRUE)
    if (eval(.cond)) {
      x1 <- 0L
      x1 <- sum(x1, x2, na.rm = TRUE)
    }
    return(x1)
  }, num, accumulate = TRUE)
}

library(dplyr)

test_df <- data.frame(
  id = 1:2,
  text = c(
    paste(strrep("a", 150), ";", strrep("a", 50), strrep("a", 130), sep = ";"),
    paste(strrep("a", 150), strrep("a", 30), strrep("a", 150), strrep("a", 80), sep = ";")
  )
)


new_test_df <- test_df %>% dso_str_split(text, "text", "\\;")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Sprint Backlog
Development

No branches or pull requests