Package 'prepdat' reference manual

Title:	Preparing Experimental Data for Statistical Analysis
Description:	Prepares data for statistical analysis (e.g., analysis of variance ;ANOVA) by enabling the user to easily and quickly merge (using the file_merge() function) raw data files into one merged table and then aggregate the merged table (using the prep() function) into a finalized table while keeping track and summarizing every step of the preparation. The finalized table contains several possibilities for dependent measures of the dependent variable. Most suitable when measuring variables in an interval or ratio scale (e.g., reaction-times) and/or discrete values such as accuracy. Main functions included are file_merge() and prep(). The file_merge() function vertically merges individual data files (in a long format) in which each line is a single observation to one single dataset. The prep() function aggregates the single dataset according to any combination of grouping variables (i.e., between-subjects and within-subjects independent variables, respectively), and returns a data frame with a number of dependent measures for further analysis for each cell according to the combination of provided grouping variables. Dependent measures for each cell include among others means before and after rejecting all values according to a flexible standard deviation criteria, number of rejected values according to the flexible standard deviation criteria, proportions of rejected values according to the flexible standard deviation criteria, number of values before rejection, means after rejecting values according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times), standard deviations, medians, means according to any percentile (e.g., 0.05, 0.25, 0.75, 0.95) and harmonic means. The data frame prep() returns can also be exported as a txt file to be used for statistical analysis in other statistical programs.
Authors:	Ayala S. Allon [aut, cre], Roy Luria [aut], James Grange [ctb], Nachshon Meiran [ctb]
Maintainer:	Ayala S. Allon <[email protected]>
License:	GPL-3
Version:	1.0.8
Built:	2025-02-08 04:37:41 UTC
Source:	https://github.com/ayalaallon/prepdat

Vertically Merge Files in a Directory into a Single Large Dataset

Description

Vertically concatenates files containing data tables in a long format into a single large dataset. In order for the function to work, all files you wish to merge should be in the same format (either txt or csv). This function is very useful for concatenating raw data files of individual subjects in an experiment (in which each line corresponds to a single observation in the experiment) to one raw data file that includes all subjects.

Usage

file_merge(
           folder_path = NULL
           , has_header = TRUE
           , new_header = c()
           , raw_file_name = NULL
           , raw_file_extension = NULL
           , file_name = "dataset.txt"
           , save_table = TRUE
           , dir_save_table = NULL
           , notification = TRUE
          )
file_merge(
           folder_path = NULL
           , has_header = TRUE
           , new_header = c()
           , raw_file_name = NULL
           , raw_file_extension = NULL
           , file_name = "dataset.txt"
           , save_table = TRUE
           , dir_save_table = NULL
           , notification = TRUE
          )

Arguments

`folder_path`	A string with the path of the folder in which files to be merged are searched. Search is recursive (i.e., can search also in subdirectories). `folder_path` must be provided. Default is `NULL`.
`has_header`	Logical. If `TRUE`, the function takes the first line of the first file found as the header of the merged table. Default is `TRUE`.
`new_header`	String vector with names for columns of the merged table. Default is `c()`. If used, `new_header` should be the same length as the number of columns in the merged table.
`raw_file_name`	A string with the name of the files to be searched and then merged. File extension should NOT be included here (see `raw_file_extension`). `raw_file_name` must be provided. Default is `NULL`.
`raw_file_extension`	A string with the format of the files (i.e., `csv` or `txt`) to be merged. `raw_file_extension` must be provided. Default is `NULL`.
`file_name`	A string with the name of the file of the merged table the function creates in case `save_table` is `TRUE`. Extension of the the file can be txt or csv and should be included. Default is `"dataset.txt"`.
`save_table`	Logical. If `TRUE`, saves the merged table. Default is `TRUE`.
`dir_save_table`	A string with the path of the folder in which the merged table is saved in case `save_table` is `TRUE`. Default is the path provided in `folder_path`.
`notification`	Logical. If `TRUE`, prints messages about the progress of the function. Default is `TRUE`.

Value

The merged table

Finalized Table `prepdat::prep()` returns for `stroopdata` According to the Example in `prepdat::prep()`.

Description

A data frame containing dependent measures prep for each id calculated according to grouping variables: block and target_type. prep() aggregates the columns for the dependent measures by first dividing them to the levels of the first independent variable in wthin vars, and then within each level prep() divides the columns according to the next variable in within_vars and so forth. Thus, for each dependent measure in this example there are four columns according to the order they where entered in within_vars argument in prep. For this data frame this argument was within_vars = c("block", "target_type").

Usage

data(finalized_stroopdata)
data(finalized_stroopdata)

Format

A data frame with 15 rows and 98 columns.

Details

The complete list of names of the dependent measures is:

mdvc: mean dvc.

sdvc: SD for dvc.

meddvc: median dvc.

tdvc: mean dvc after rejecting observations above standard deviation criteria specified in sd_criterion.

ntr: number of observations rejected for each standard deviation criterion specified in sd_criterion.

ndvc: number of observations before rejection.

ptr: proportion of observations rejected for each standard deviation criterion specified in sd_criterion.

rminv: harmonic mean of dvc.

prt: dvc according to each of the percentiles specified in percentiles.

mdvd: mean dvd.

merr: mean error.

nrmc: mean dvc according to non-recursive procedure with moving criterion.

nnrmc: number of observations rejected for dvc according to non-recursive procedure with moving criterion.

pnrmc: percent of observations rejected for dvc according to non-recursive procedure with moving criterion.

tnrmc: total number of observations upon which the non-recursive procedure with moving criterion was applied.

mrmc: mean dvc according to modified-recursive procedure with moving criterion.

nmrmc: number of observations rejected for dvc according to modified-recursive procedure with moving criterion.

pmrmc: percent of observations rejected for dvc according to modified-recursive procedure with moving criterion.

tmrmc: total number of observations upon which the modified-recursive procedure with moving criterion was applied.

hrmc: mean dvc according to hybrid-recursive procedure with moving criterion.

nhrmc: number of observations rejected for dvc according to hybrid-recursive procedure with moving criterion.

thrmc: total number of observations upon which the hybrid-recursive procedure with moving criterion was applied.

Examples

data(finalized_stroopdata)
head(finalized_stroopdata)
data(finalized_stroopdata)
head(finalized_stroopdata)

Hybrid-recursive Outlier Removal Procedure with Moving Criterion

Description

Hybrid-recursive outlier removal procedure with moving criterion according to Van Selst & Jolicoeur (1994).

Usage

hybrid_recursive_mc(exp_cell)
hybrid_recursive_mc(exp_cell)

Arguments

exp_cell

Numeric vector on which the outlier removal method takes place. If experimental cell has 4 trials or less it will result in NA.

Value

A vector with the mean of exp_cell after removing outliers, percent of trials removed, and total number of trials in exp_cell before outlier removal.

References

Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R Package Version 1.0.0. https://cran.r-project.org/package=trimr

Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. The quarterly journal of experimental psychology, 47(3), 631-650.

Modified-recursive Outlier Removal Procedure with Moving Criterion

Description

Modified-recursive outlier removal procedure with moving criterion according to Van Selst & Jolicoeur (1994).

Usage

modified_recursive_mc(exp_cell)
modified_recursive_mc(exp_cell)

Arguments

exp_cell

Numeric vector on which the outlier removal method takes place. If experimental cell has 4 trials or less it will result in NA.

Value

A vector with the mean of exp_cell after removing outliers, percent of trials removed, number of trials removed in the procedure,and total number of trials in exp_cell before outlier removal.

References

Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R Package Version 1.0.0. https://cran.r-project.org/package=trimr

Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. The quarterly journal of experimental psychology, 47(3), 631-650.

Non-recursive Outlier Removal Procedure with Moving Criterion

Description

Non-recursive outlier removal procedure with moving criterion according to Van Selst & Jolicoeur (1994).

Usage

non_recursive_mc(exp_cell)
non_recursive_mc(exp_cell)

Arguments

exp_cell

Numeric vector on which the outlier removal method takes place. If experimental cell has 4 trials or less it will result in NA.

Value

A vector with the mean of exp_cell after removing outliers, percent of trials removed, number of trials removed in the procedure,and total number of trials in exp_cell before outlier removal.

References

Grange, J.A. (2015). trimr: An implementation of common response time trimming methods. R Package Version 1.0.0. https://cran.r-project.org/package=trimr

Van Selst, M., & Jolicoeur, P. (1994). A solution to the effect of sample size on outlier elimination. The quarterly journal of experimental psychology, 47(3), 631-650.

Creates One Finalized Table Ready for Statistical Analysis

Description

prep() aggregates a single dataset in a long format according to any number of grouping variables. This makes prep() suitable for aggregating data from various types of experimental designs such as between-subjects, within-subjects (i.e., repeated measures), and mixed designs (i.e., experimental designs that include both between- and within- subjects independent variables). prep() returns a data frame with a number of dependent measures for further analysis for each aggregated cell (i.e., experimental cell) according to the provided grouping variables (i.e., independent variables). Dependent measures for each experimental cell include among others means before and after rejecting observations according to a flexible standard deviation criteria, number of rejected observations according to the flexible standard deviation criteria, proportions of rejected observations according to the flexible standard deviation criteria, number of observations before rejection, means after rejecting observations according to procedures described in Van Selst & Jolicoeur (1994; suitable when measuring reaction-times), standard deviations, medians, means according to any percentile (e.g., 0.05, 0.25, 0.75, 0.95) and harmonic means. The data frame prep() returns can also be exported as a txt or csv file to be used for statistical analysis in other statistical programs.

Usage

prep(
   dataset = NULL
   , file_name = NULL
   , file_path = NULL
   , id = NULL
   , within_vars = c()
   , between_vars = c()
   , dvc = NULL
   , dvd = NULL
   , keep_trials = NULL
   , drop_vars = c()
   , keep_trials_dvc = NULL
   , keep_trials_dvd = NULL
   , id_properties = c()
   , sd_criterion = c(1, 1.5, 2)
   , percentiles = c(0.05, 0.25, 0.75, 0.95)
   , outlier_removal = NULL
   , keep_trials_outlier = NULL
   , decimal_places = 4
   , notification = TRUE
   , dm = c()
   , save_results = TRUE
   , results_name = "results.txt"
   , results_path = NULL
   , save_summary = TRUE
)
prep(
   dataset = NULL
   , file_name = NULL
   , file_path = NULL
   , id = NULL
   , within_vars = c()
   , between_vars = c()
   , dvc = NULL
   , dvd = NULL
   , keep_trials = NULL
   , drop_vars = c()
   , keep_trials_dvc = NULL
   , keep_trials_dvd = NULL
   , id_properties = c()
   , sd_criterion = c(1, 1.5, 2)
   , percentiles = c(0.05, 0.25, 0.75, 0.95)
   , outlier_removal = NULL
   , keep_trials_outlier = NULL
   , decimal_places = 4
   , notification = TRUE
   , dm = c()
   , save_results = TRUE
   , results_name = "results.txt"
   , results_path = NULL
   , save_summary = TRUE
)

Arguments

`dataset`	Name of the data frame in R that contains the long format table after merging the individual data files using `file_merge()`. Either `dataset` or `file_name` must be provided. Default is `NULL`.
`file_name`	A string with the name of a txt or csv file (including the file extension, e.g. `"my_data.txt"`) with the merged table in case the user already merged the individual data files. Either `dataset` or `file_name` must be provided. Default is `NULL`.
`file_path`	A string with the path of the folder in which `file_name` is located. If `file_name` was used, then `file_path` must be provided. Default is `NULL`.
`id`	A string with the name of the column in `file_name` or in `dataset` that contains the variable specifying the case identifier (i.e., the variable upon which the measurement took place; e.g., `"subject_number"`). This should be a unique value per case. Values in this column must be numeric. Argument must be provided. Default is `NULL`.
`within_vars`	String vector with names of grouping variables in `file_name` or in `dataset` that contain independent variables manipulated (or observed) within-ids (i.e., within-subjects, repeated measures). Single or multiple values must be specified as a string (e.g., `c("SOA", "condition")`) according to the hierarchical order you wish. Note that the order of the names in `within_vars()` is important because `prep()` aggregates the data for the dependent measures by first dividing them to the levels of the first grouping variable in `witin_vars()`, and then within each of those levels `prep()` divides the data according to the next variable in `within_vars()` and so forth. Values in these columns must be numeric. Either `within_vars` or `between_vars` (or both) arguments must be provided. Default is `c()`.
`between_vars`	String vector with names of grouping variables in `file_name` or in `dataset` that contain independent variables manipulated (or observed) between-ids (i.e., between-subjects). Single or multiple values must be specified as a string (e.g., `c("order")`). Order of the names in `between_vars()` does not matter. Values in this column must be numeric. Either `between_vars` or `within_vars` (or both) arguments must be provided. Default is `c()`.
`dvc`	A string with the name of the column in `file_name` or in `dataset` that contains the dependent variable (e.g., "rt" for reaction-time as a dependent variable). Values in this column must be in an interval or ratio scale. Either `dvc` or `dvd` (or both) arguments must be provided. Default is `NULL`.
`dvd`	A string with the name of the column in `file_name` or in `dataset` that contains the dependent variable (e.g., `"ac"` for accuracy as a dependent variable). Values in this column must be numeric and discrete (e.g., 0 and 1). Either `dvc` or `dvd` (or both) arguments must be provided. Default is `NULL`.
`keep_trials`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` according to logical conditions specified as a string. For example, if the dataset contains practice trials for each subject, these trials should not be included in the aggregation. The user should remove these trials by specifying how they were coded in the raw data (i.e., data before aggregation). For example, if practice trials are the ones for which the "block" column in the raw data tables equals to zero, the `keep_trials` argument should be `"raw_data$block !== 0"`. `raw_data` is the internal object in `prep()` representing the merged table. All logical conditions in `keep_trials` should be put in the same string and be concatenated by `&` or `\|`. Logical conditions for this argument can relate to different columns in the merged table. Note that all further arguments of `prep()` will relate to the remaining observations in the merged table. Default is `NULL`.
`drop_vars`	String vector with names of columns to delete in `file_name` or in `dataset`. Single or multiple values must be specified as a string (e.g., `c("font_size")`). Order of the names in `drop_vars` does not matter. Note that all further arguments of `prep()` will relate to the remaining variables in the merged table. Default is `c()`.
`keep_trials_dvc`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the dependent variable in `dvc` according to logical conditions specified as a string. Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `"raw_data$rt > 100 & raw_data$rt < 3000 & raw_dada$ac == 1"`). All dependent measures for `dvc` except for those specified in `outlier_removal` will be calculated on the remaining observations. Defalut is `NULL`.
`keep_trials_dvd`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the dependent variable in `dvd` according to logical conditions specified as a string. Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `raw_data$rt > 100 & raw_data$rt < 3000`). All dependent measures for `dvd` (i.e., `"mdvd"` and `"merr"`) will be calculated on the remaining observations. Default is `NULL`.
`id_properties`	String vector with names of columns in `dataset` or in `file_name` that describe the ids (e.g., subjects) in the data and were not manipulated within-or between-ids. For example, in case the user logged for each observation and for each id in an experiment also the age and the gender of the subject, this argument will be `c("age", "gender")`. Order of the names in `id_properties` does not matter. Single or multiple values must be specified as a string. Values in these columns must be numeric. Default is `c()`.
`sd_criterion`	Numeric vector specifying a number of standard deviation criteria for which `prep()` will calculate the mean `dvc` for each cell in the finalized table after rejecting observations that did not meet the criterion (e.g., rejecting observations that were more than 2 standard deviations above or below the mean of that cell). Values in this vector must be numeric. Default is `c(1, 1.5, 2)`.
`percentiles`	Numeric vector containing wanted percentiles for `dvc`. Values in this vector must be decimal numbers between 0 to 1. Percentiles are calculated according to `type = 7` (see `quantile` for more information). Default is `c(0.05, 0.25, 0.75, 0.95)`.
`outlier_removal`	Numeric. Specifies which outlier removal procedure with moving criterion to calculate for `dvc` according to procedures described by Van Selst & Jolicoeur (1994). If `1` then non-recursive procedure is calculated, if `2` then modified recursive procedure is calculated, if `3` then hybrid recursive procedure is calculated. Moving criterion is according to Table 4 in Van Selst & Jolicoeur (1994). If experimental cell has 4 trials or less it will result in `NA`. Default is `NULL`.
`keep_trials_outlier`	A string. Allows deleting unnecessary observations and keeping necessary observations in `file_name` or in `dataset` for calculations and aggregation of the outlier removal procedures by Van Selst & Jolicoeur (1994). Logical conditions should be specified as a string as in the `keep_trials` argument (e.g., `"raw_data$ac == 1"`). `outlier_removal` procedure will be calculated on the remaining observations. Defalut is `NULL`.
`decimal_places`	Numeric. Specifies number of decimals to be written in `results_name` for each value of the dependent measures for `dvc`. Value must be numeric. Default is `4`.
`notification`	Logical. If `TRUE`, prints messages about the progress of the function. Default is `TRUE`.
`dm`	String vector with names of dependent measures the function returns. If empty (i.e., `c()`) the function returns a data frame with all possible dependent measures in `prep()`. Values in this vector must be strings from the following list: "mdvc", "sdvc", "meddvc", "tdvc", "ntr", "ndvc", "ptr", "prt", "rminv", "mdvd", "merr". Default is `c()`. See Value section below for more details.
`save_results`	Logical. If TRUE, the function creates a txt file containing the returned data frame. Default is `TRUE`.
`results_name`	A string with the name of the file `prep` returns in case `save_results` is `TRUE`. Extension of the file can be txt or csv and should be included. Default is `"results.txt"`.
`results_path`	A string with the path of the folder in which `results_name` will be saved. Default is the path provided in `file_path`. In case no path was provided in `file_path`, `results_path` must be provided.
`save_summary`	Logical. if `TRUE`, creates a summary file in the same format as `results_name`. Default is `TRUE`.

Value

A data frame with dependent measures for the dependent variables in dvc and dvd by id and grouping variables.

The first column in the finalized table is the id column. In case id_properties was used, the next columns will be the value of each id_properties for each id.

If between_vars was used then the next column{}s will be the value of each beween_vars for each id.

The next columns of the finalized table contain the dependent measures according to the design specified. If within_vars was used, then the data for each dependent measure was first divided according to the levels of the first grouping variable in witin_vars(), and then within each of those levels prep() divided the data according to the next variable in within_vars() and so forth. The dependent measures in the finalized table are: