Skip to content

lsms-worldbank/selector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

selector

Select variables by characteristic (char) or regular expression (regex)

Lifecycle: experimental

The goal of selector is to provide methods for selecting variables of interest in ways that base Stata cannot. For those who find Stata's glob patterns limiting, selector offers selection by regex pattern. For those who use Survey Solutions, selector enables variable selection based on questionnaire metadata (e.g., question type).

Installation

Development version

To get a bug fix, a release before SSC publication, or test bleeding-edge features, you can install code from other branches of the repository. The release is version 0.5.

To install the version in a particular branch:

* set tag to be the name of the target branch
* for example, the development branch, which contains code for the next release
local tag "dev"
* similarly, version 0.5, which contains the code for the current, pre-SSC release
* download the code from that GitHub branch
* install the package
net install selector, ///
  from("https://raw.githubusercontent.com/lsms-worldbank/selector/`tag'/src") replace

Previous versions

If you need to install a previously releases version of selector, then you can use the following method. This can be useful, for example, during reproducibility verifications. To install the version in a particular release, set the local tag to the target release you want to install in this code:

* set the tag to the name of the target release
* for example v1.0, say, if the current version were v2.0
local tag "v1.0"
* download the code from that GitHub release
* install the package
net install selector, ///
  from("https://raw.githubusercontent.com/lsms-worldbank/selector/`tag'/src") replace

Commands

Command Description
selector Package command with utilities for the rest of the package
sel_matches_regex Get variables that match a regular expression.
sel_add_metadata Apply SuSo metadata to current data
sel_remove_metadata Clean up metadata only needed during cleaning
sel_char Select varaibles based on char value
sel_vars List variables with matching characteristics in the Survey Solutions’ Designer.

Usage

selector currently provides two means of selecting variables:

  1. Regular expressions
  2. Survey Solutions questionnaire metadata (and other chars)

Regular expressions

By default, Stata allows users to select variables by either specifying a variable range (e.g., var1 - var5) or a variable name (glob) pattern (e.g., var*).

However, there is no straight-forward way to specify a list of variables that match a regular expression--a pattern specification that is typically more precise than either of the foregoing Stata options. The sel_matches_regex command fills that gap in functionality.

In particular, this function aims to meet a few needs:

Select variables more precisely

* create sets of variables
gen housing_unit = .
gen s01q01_quantity = .
gen s01q01_unit = .
gen s01q02_quantity = .
gen s01q02_unit = .
gen s01q03_quantity = .
gen s01q03_unit = .
gen s01q04_quantity = .
gen s01q04_unit = .

* select variables that end in _unit
sel_matches_regex "_unit$"

* select variables that end in _unit for questions 02 and 03
sel_matches_regex "0[23]_unit$"

Identify variables that fail to follow a pattern

* create a set of variables that mostly follow a pattern
* importantly, some don't
gen s01q01 = .
gen s01q02 = .
gen s01_q03 = .
gen s01q04 = .
gen S01q04 = .
gen s01q05a = .
gen s01q05_unit = .

* identify variables that do NOT follow the pattern
sel_matches_regex "s01q0[0-9][a-z]*$", negate

* assert that there are no variables fail to follow the pattern
* preventing variable naming problems, say, in disseminated data
local pattern_for_data "s01q0[0-9][a-z]*$"
qui: sel_matches_regex "`pattern_for_data'", negate
local not_follow = r(varlist)
local n_not_follow : list sizeof not_follow
capture assert n_not_follow == 0
if _rc != 1 {
    di as error "Some variables do not follow the desired pattern (`pattern_for_data')"
    di as text "`not_follow'"
}

Survey Solutions metadata (and other chars)

Survey Solutions metadata

The workflow involves the following steps:

Get the Survey Solutions questionnaire in JSON format

The short answer: download it from your Survey Solutions server. See here for more details.

Create a questionnaire metadata data set from the JSON file

The short answer: use the susometa R package to transform the questionnaire metadata from JSON to a data frame, and to save that data frame as a .dta file for selector to use it. See here for more details.

Add the questionnaire metadata to the survey microdata

For selector to use Survey Solutions' questionnaire metadata, it must be added to the data set in memory. The sel_add_metadata command does exactly that: ingects metadata so that other selector commands can use this information.

* add Survey Solutions questionnaire metadata
sel_add_metadata using "path/to/your/metadata.dta"
Select based on metadata

Once metadata have been added to microdata, selector commands can select variables based on their characteristics in the Survey Solutions questionnaire that generated them.

For example, one can select by inidividual characteristics like question type:

* select by question type

* numeric
* any type of numeric
sel_vars is_numeric
* any with decimals
sel_vars is_demical

* multi-select
* any type of multi-select
sel_vars is_multi_select
* yes/no
sel_vars is_multi_yn
* checkboxes
sel_vars is_multi_checkbox
* with answer order recorded
is_multi_ordered

Alternatively, one can combine multiple selectors, since the outputs of one command--that is, the variables with a certain characteristic--can be passed as input into another commend--that is, the variables to consider for another characteristic.

* combine selectors

* first, select multi-select
sel_vars is_multi_select
local multi_select "`r(varlist)'"
* then, select linked questions among them
sel_vars is_linked, varlist(`multi_select')
Remove metadata

Once metadata are no longer needed--for example, as data files are prepared for publication--they can be removed with sel_remove_metadata.

* remove metadata (e.g., before saving data for dissemination)
sel_remove_metadata

Other arbitrary chars

The selector package uses Stata chars to select variables. For Survey Solutions users, selector provides a dedicated command for accessing particular chars corresponding to Survey Solutions questionnaire metadata (i.e., sel_vars and its subcommands like is_numeric, is_multi_select, is_linked, etc.).

For those interested in using different chars, selector provides a general-purpose selector to query and select on the basis of user-provided chars: sel_char.

For example:

* use the automobile data set
sysuse auto, clear

* add a currency unit char to the price variable
char price[currency] "USD"

* create another price variable and attach a currency unit char
gen  price_eur = price * .9
char price_eur[currency] "EUR"

* select those variables whose currency unit is USD
sel_char "currency USD"
return list

Learn more

To learn more about the package:

  • Consult the reference documentation
  • Read how-to articles

Contact

LSMS Team, World Bank lsms@worldbank.org