Develop a function to split text exceeding 200 characters #19

parikp06 · 2023-11-02T16:47:32Z

Feature Idea

Purpose

Split text exceeding 200 characters into multiple variables.
SDTMIG v3.3 | CDISC

Functionality

Based on the current requirement for the SAS v5 Transport file format, it is not possible to store the long text strings exceeding 200 characters using only one variable. Therefore, the SDTMIG has defined conventions for storing long text string using multiple variables. Follow the steps mentioned below to create a function to split long text into multiple variables.

Split the first 200 characters of text into split_var1 and each additional 200 characters of text into split_varN
When splitting a text string into several variables, the text should be split between words to improve readability, so it does not break a word.
Create a variable split_count to hold the number of new variables created by the split.
Default length to split on would be 200. However, please allow this value can be passed via split_length parameter to any other length.

Relevant Input

A character variable to be split into multiple variables.

An integer parameter split_length which would be set to 200 by default.

Relevant Output

One or more variables after splitting: split_var1 – split_varN.

Variable to hold the number of splits: split_count

Reproducible Example/Pseudo Code

split_vars(var_to_split,split_length=200)

kamilsi · 2023-11-03T13:17:23Z

Acknowledging the historical context of SAS V5 transport file restrictions, I'd like to point out that these limitations don't apply in R, offering us a chance to opt-out. However, my experience with submissions is limited, and I'm seeking clarity: are we considering this feature to meet current regulatory expectations, or is it more about maintaining legacy compatibility?

parikp06 · 2023-11-03T13:41:21Z

Agree that SAS V5 limitations do not apply to R dataframes. However, we still want to have this feature to be compliant with SDTMIG rules. I am also lacking submission experience with R packages, so looking for more feedback.

pendingintent · 2023-11-03T14:05:46Z

The submission SAS files still have the V5 limitations, which include no string can exceed 200 characters in length. Even if the xpt files are generated using R, the SASV5 limitations still apply unfortunately.

pendingintent · 2023-11-03T14:15:01Z

I had a look at this in SQL and if we consider a text field mapping to DSTERM for example where the field can be a maximum of 600 characters and we cannot split words, the formula for calculating the max number of required output variables is:
(# required target variables) = CEILING(length source string / 200) + 1

The first (up to) 200 characters maps to DSTERM, with the remainder of the source string mapping to the QNAM variables DSTERM1, DSTERM2, DSTERM3.

I have used the SQL code:
dsterm := upper(substr(input_string, 1, instr(substr(input_string, 1, 200), ' ', -1, 1 )));

This will select up to 200 characters without splitting words and insert into the DSTERM DS variable. Similar statements can then be used to select the other strings for each of the subsequent supplemental QVAL values.

kamilsi · 2023-11-03T14:16:16Z

Hi @parikp06 and @pendingintent,

I appreciate the clarification on the compliance with SDTMIG rules and the current submission requirements. Given this context, I'm ready to take on the development of this feature. I'll await any additional feedback or a decision on moving forward with this approach.

MaguluriVenkata · 2023-11-07T18:56:46Z

it's not clear where SPLITVARN will be stored in the specs? what's the plan?

parikp06 · 2023-11-07T19:25:41Z

This function will be called within a dataframe with input variable to be split. Upon successful execution, it should generate SplitVarN variables along with SplitCount to the output dataframe.

MaguluriVenkata · 2023-11-08T16:45:12Z

So it does not push SPLITVARn into SUPP domains?

parikp06 · 2023-11-08T17:02:15Z

no, it'll just create required list of variables. SplitCount will help in determining how many SUPP variables to be populated.

rammprasad · 2023-11-08T17:46:29Z

Keep the 200 character limit as parameterized and keep the default value as 200.

galachad · 2023-11-22T17:54:01Z

Hi @madhan0923 and @GomathiVallinayagam, are you working on this issue, or can I take it over?

GomathiVallinayagam · 2023-11-22T17:56:25Z

Yes @galachad , we are working on the issue.

edgar-manukyan · 2023-11-22T19:43:00Z

Adam recently did a very nice refactor of a similar function in Roche roak package. I would encourage collaborating with him, at least to get his ideas of function design 🙏

madhan0923 · 2023-11-24T12:14:53Z

sure @edgar-manukyan, we are about to complete the draft. with Adam's help we can release the 1st version for QC

ramiromagno · 2024-02-10T05:59:12Z

Hi Madhan (@madhan0923):

I just had a look at your code in PR #32. I have a few suggestions/remarks.

I think your implementation of split_var() can be significantly simplified if you use either R base strwrap() or stringr::str_wrap().

In addition, it might be a good idea to also implement a reverse function, e.g. str_unwrap() that combines again the strings, something akin to paste(x, collapse = " "). The reason for this suggestion is that by having this reverse function we can set an expectation on what other downstream software should be getting when re-assembling the text.

From reading SDTMIG v3.4. (page 55), I think three questions remain:

What qualifies as an inter-word char, i.e. where can a split happen?
What happens if the text is a 200 char string (no inter-word chars). Functions such as strwrap() won't split if there are no inter-word chars, which is not acceptable according to the IG. I understand this is an unlikely corner case but still...
Should the inter-word chars be preserved in the output? I guess that if spaces are the inter-word chars, then probably not, as you do not want to leave a trailing space at the end of each string... But if the inter-word is another char, e.g. ";", I see that in your test code for split_var() you are indeed preserving the ";":
```
> split_var(x)
[[1]]
[1] "ASCORBIC ACID;BETACAROTENE;BIOTIN;CALCIUM;CHLORIDE;CHROMIUM;COPPER;FOLIC ACID;IODINE;LYCOPENE;MAGNESIUM;MANGANESE;MOLYBDENUM;NICKEL;NICOTINIC ACID;PANTOTHENIC ACID;PHOSPHORUS;POTASSIUM;"

[[2]]
[1] "PYRIDOXINE HYDROCHLORIDE;RIBOFLAVIN;SELENIUM;SILICON;THIAMINE;VANADIUM;VITAMIN B12 NOS;VITAMIN D NOS;VITAMIN E NOS;"
```
If you do not preserve the splitting chars, then how should downstream software be able to reconstruct the text...?

It might help taking a look at some code I've written in the past that uses R matrices to represent protein alignments, where I developed a similar function to reflow the alignment to a different width: https://github.com/maialab/agvgd/blob/master/R/split_alignment_by_lines.R

Longfei2 · 2024-03-20T02:18:28Z

Hello everyone,
I have developed an R function for my company and be used in work successfully.
Hope it helpful for oak

#' @param data A data frame or tibble.
#' @param target_var A character variable in input data set that need to be split, like \code{product_term}.
#' if the name of \code{target_var} is same as \code{outvar_name}, then the name of \code{target_var} is appended with a prefix, \code{(.)}.
#' @param outvar_name A string which is the prefix of similar variable names, like \code{"CMDECOD"}.
#' @param sep A character vector that is separator delimiting collapsed values.
#' 
#' @return A tibble including a series of variable calling \code{outvar_name}, which is invisible object.

dso_str_split <- function(data, target_var, outvar_name, sep = ";") {
  # Remove a series of variable named outvar_name with numeric suffix
  .to_keep <- names(data)[!grepl(paste0(outvar_name, "([0-9]+)$"), names(data))]
  .new_data <- data %>%
    dplyr::mutate(dso_unique_id = if (nrow(.) == 0L) {
      numeric(0L)
    } else {
      dplyr::row_number()
    }) %>%
    dplyr::select(dplyr::all_of(.to_keep), dso_unique_id)
  
  .need_split_var <- rlang::enquo(target_var)
  if (is.na(sep)) sep <- "\\;"
  
  # dso_check_strs(outvar_name, sep)
  
  # separate string
  .split_data <- .new_data %>%
    dplyr::mutate(dso_split_var = !!.need_split_var) %>%
    tidyr::separate_rows(dso_split_var, sep = sep) %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep)) %>%
    dplyr::mutate(
      dso_var1 = if (is.null(dso_cumsum(nchar(dso_split_var), x1 > 200L))) {
        numeric(0L)
      } else {
        dso_cumsum(nchar(dso_split_var), x1 > 200L)
      },
      dso_var2 = ifelse(nchar(dso_split_var) == dso_var1, TRUE, FALSE),
      dso_subid = dplyr::coalesce(cumsum(dso_var2), 1L),
      dso_text = dplyr::coalesce(dso_split_var, "")
    ) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(dso_unique_id, !!!rlang::syms(.to_keep), dso_subid) %>%
    dplyr::summarise(dso_text = paste(dso_text, collapse = sep)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate_at("dso_text", ~ dplyr::if_else(stringr::str_squish(.) == "", NA_character_, .)) %>%
    # Avoid intermediate separator causing redundant symbols at the end
    dplyr::mutate(dso_text = gsub(paste0("(", sep, "+)$"), "", dso_text)) %>%
    tidyr::pivot_wider(names_from = dso_subid, values_from = dso_text, names_prefix = "dso_text") %>%
    dplyr::select(-dso_unique_id)
  
  # if the name of target_var and .out_var is same, then rename target_var to paste(., target_var)
  
  if (identical(rlang::as_label(.need_split_var), outvar_name)) {
    names(.split_data)[names(.split_data) == outvar_name] <- paste0(".", outvar_name)
  }
  
  # modify the name of series variable calling outvar_name
  .text_name <- names(dplyr::select(.data = .split_data, dplyr::starts_with("dso_text")))
  
  if (length(.text_name) == 1L) {
    names(.split_data)[names(.split_data) %in% .text_name] <- outvar_name
  } else {
    names(.split_data)[names(.split_data) %in% .text_name] <- c(outvar_name, paste0(outvar_name, 1L:(length(.text_name) - 1)))
  }
  
  invisible(.split_data)
}

dso_cumsum <- function(num, cond) {
  .cond <- rlang::enexpr(cond)
  Reduce(function(x1, x2) {
    x1 <- sum(x1, (x2 + 1L), na.rm = TRUE)
    if (eval(.cond)) {
      x1 <- 0L
      x1 <- sum(x1, x2, na.rm = TRUE)
    }
    return(x1)
  }, num, accumulate = TRUE)
}

library(dplyr)

test_df <- data.frame(
  id = 1:2,
  text = c(
    paste(strrep("a", 150), ";", strrep("a", 50), strrep("a", 130), sep = ";"),
    paste(strrep("a", 150), strrep("a", 30), strrep("a", 150), strrep("a", 80), sep = ";")
  )
)


new_test_df <- test_df %>% dso_str_split(text, "text", "\\;")

parikp06 added the enhancement New feature or request label Nov 2, 2023

parikp06 changed the title ~~Feature Request: <Develop a function to split text exceeding 200 characters>~~ Develop a function to split text exceeding 200 characters Nov 2, 2023

GomathiVallinayagam assigned GomathiVallinayagam and madhan0923 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a function to split text exceeding 200 characters #19

Develop a function to split text exceeding 200 characters #19

parikp06 commented Nov 2, 2023 •

edited

Loading

kamilsi commented Nov 3, 2023

parikp06 commented Nov 3, 2023

pendingintent commented Nov 3, 2023

pendingintent commented Nov 3, 2023

kamilsi commented Nov 3, 2023

MaguluriVenkata commented Nov 7, 2023

parikp06 commented Nov 7, 2023

MaguluriVenkata commented Nov 8, 2023

parikp06 commented Nov 8, 2023

rammprasad commented Nov 8, 2023

galachad commented Nov 22, 2023 •

edited

Loading

GomathiVallinayagam commented Nov 22, 2023

edgar-manukyan commented Nov 22, 2023

madhan0923 commented Nov 24, 2023 •

edited by GomathiVallinayagam

Loading

ramiromagno commented Feb 10, 2024 •

edited

Loading

Longfei2 commented Mar 20, 2024

Develop a function to split text exceeding 200 characters #19

Develop a function to split text exceeding 200 characters #19

Comments

parikp06 commented Nov 2, 2023 • edited Loading

Feature Idea

Relevant Input

Relevant Output

Reproducible Example/Pseudo Code

kamilsi commented Nov 3, 2023

parikp06 commented Nov 3, 2023

pendingintent commented Nov 3, 2023

pendingintent commented Nov 3, 2023

kamilsi commented Nov 3, 2023

MaguluriVenkata commented Nov 7, 2023

parikp06 commented Nov 7, 2023

MaguluriVenkata commented Nov 8, 2023

parikp06 commented Nov 8, 2023

rammprasad commented Nov 8, 2023

galachad commented Nov 22, 2023 • edited Loading

GomathiVallinayagam commented Nov 22, 2023

edgar-manukyan commented Nov 22, 2023

madhan0923 commented Nov 24, 2023 • edited by GomathiVallinayagam Loading

ramiromagno commented Feb 10, 2024 • edited Loading

Longfei2 commented Mar 20, 2024

parikp06 commented Nov 2, 2023 •

edited

Loading

galachad commented Nov 22, 2023 •

edited

Loading

madhan0923 commented Nov 24, 2023 •

edited by GomathiVallinayagam

Loading

ramiromagno commented Feb 10, 2024 •

edited

Loading