--- title: "Description of groupdata2" author: - "Ludvig Renbo Olsen" date: "`r Sys.Date()`" abstract: | `groupdata2` is a set of methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data. Create balanced folds for cross-validation or divide a time series into windows. Balance group sizes with up- and downsampling or collapse existing groups to fewer, balanced groups. This vignette contains descriptions of functions and methods, along with simple examples of usage. For a gentler introduction to groupdata2, please see [Introduction to groupdata2](introduction_to_groupdata2.html)     Contact author at r-pkgs@ludvigolsen.dk     ----- output: rmarkdown::html_vignette: css: - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", package = "rmarkdown") - styles.css fig_width: 6 fig_height: 4 toc: yes number_sections: no rmarkdown::pdf_document: highlight: tango number_sections: yes toc: yes toc_depth: 4 vignette: > %\VignetteIndexEntry{Description of groupdata2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align='center', dpi = 92, fig.retina = 2 ) options(tibble.print_min = 4L, tibble.print_max = 4L) ``` ## Installing groupdata2 You can either install the CRAN version or the GitHub development version. ### CRAN version ```{r eval=FALSE} # Uncomment: # install.packages("groupdata2") ``` ### GitHub development version ```{r eval=FALSE} # Uncomment: # install.packages("devtools") # devtools::install_github("LudvigOlsen/groupdata2") ``` ## Attach packages ```{r error=FALSE, message=FALSE, warning=FALSE} # Attaching groupdata2 library(groupdata2) # Attaching other packages used in this vignette library(dplyr) library(tidyr) require(ggplot2, quietly = TRUE) # Attach if installed library(knitr) # We will also be using plyr a few times, but we don't attach this # because of possible conflicts with dplyr. Instead we use its functions # like so: plyr::count() ``` # General information `groupdata2` is a set of functions and methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data. There are 7 main functions: ### group_factor() Returns a factor with group numbers, e.g. `c(1, 1, 1, 2, 2, 2, 3, 3, 3)`. This can be used to subset, aggregate, group_by, etc. ### group() Returns the given data as a `data frame` with an added grouping factor made with `group_factor()`. The `data frame` is grouped by the grouping factor for easy use with `%>%` pipelines. ### splt() Splits the given data into the specified groups made with `group_factor()` and returns them in a `list`. ### fold() Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or one numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Can create multiple unique fold columns at once, e.g. for repeated cross-validation. ### partition() Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or one numerical variable. Make sure that all datapoints sharing an ID is in the same partition. ### collapse_groups() Collapses existing groups into a smaller set of groups with (optional) categorical, numerical, ID, and/or size balancing. ### balance() Uses up- or downsampling to fix the size of all groups to the min, max, mean, or median group size or to a specific number of rows. Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category. ## Groups, windows or folds? When working with time series we would often refer to the kind of groups made by `group_factor()`, `group()` and `splt()` as **windows**. In this vignette, these will be referred to as groups. `fold()` creates balanced groups for cross-validation by using `group()`. These are referred to as **folds**. `partition()` creates balanced groups of (potentially) different sizes, by using `group()`. These are referred to as **partitions**. ## Introduction vignettes [Cross-validation with groupdata2](cross-validation_with_groupdata2.html) In this vignette, we go through the basics of cross-validation, such as creating balanced train/test sets with `partition()` and balanced folds with `fold()`. We also write up a simple cross-validation function and compare multiple linear regression models. [Time series with groupdata2](time_series_with_groupdata2.html) In this vignette, we divide up a time series into groups (windows) and subgroups using `group()` with the `greedy` and `staircase` methods. We do some basic descriptive stats of each group and use them to reduce the data size. [Automatic groups with groupdata2](automatic_groups_with_groupdata2.html) In this vignette, we will use the `l_starts` method with `group()` to allow transferring of information from one dataset to another. We will use the automatic grouping function that finds group starts all by itself. ## Use of kable() In the examples, we will be using `knitr::kable()` to visualize some of the data such as data frames. You do not need to use `kable()` when using the functions in `groupdata2`. # Grouping Methods There are currently 10 methods for grouping the data. It is possible to create groups based on *number of groups* (default), *group size*, *list of group sizes*, *list of group start positions*, *distance between members*, *step size* or *prime number to start at*. These can be passed as whole number(s) or percentage(s), while method `l_starts` also accepts `'auto'`. Here, we will take a look at the different methods. ## Method: 'greedy' `greedy` uses group **size** for dividing up the data. *Greedy* refers to each group grabbing as many elements as possible (up to the specified size), meaning that there might be fewer elements available to the last group, but that all other groups than the last are guaranteed to have the size specified.   **Example** > We have a vector with 57 values. We want to have group sizes of 10. > > The greedy splitter will return groups with this many values in them: > > 10, 10, 10, 10, 10, 7   By setting `force_equal = TRUE`, we discard the last group if it contains fewer values than the other groups.   **Example** > We have a vector with 57 values. We want to have group sizes of 10. > > The greedy splitter with force_equal set to TRUE > will return groups with this many values in them: > > 10, 10, 10, 10, 10 > > meaning that 7 values have been discarded.   ## Method: 'n_dist' (Default) `n_dist` uses a specified number of groups to divide up the data. First, it creates equal groups as large as possible. Then, it distributed any excess data points across the groups.   **Example** > We have a vector with 57 values. We want to get back 5 groups. > > 'n_dist' with default settings would return groups with this many values in them: > > 11, 11, 12, 11, 12   By setting `force_equal = TRUE`, `n_dist` will create the largest possible, equally sized groups by discarding excess data elements.   **Example** > 'n_dist' with **force_equal** set to TRUE would return groups with this many values in them: > > 11, 11, 11, 11, 11 > > meaning that 2 values have been discarded.   ## Method: 'n_fill' `n_fill` uses a specified number of groups to divide up the data. It creates equal groups as large as possible and places any excess data points in the first groups. By setting `descending = TRUE`, it would fill from the the last groups instead.   **Example** > We have a vector with 57 values. We want to get back 5 groups. > > 'n_fill' with default settings would return groups with this many values in them: > > 12, 12, 11, 11, 11   By setting `force_equal = TRUE`, `n_fill` will create the largest possible, equally sized groups by discarding excess data elements.   **Example** > 'n_fill' with **force_equal** set to TRUE would return groups with this many values in them: > > 11, 11, 11, 11, 11 > > meaning that 2 values have been discarded.   ## Method: 'n_last' `n_last` uses a specified number of groups to divide up the data. With *default settings*, it tries to make the groups as equally sized as possible, but notice that the last group might contain fewer or more elements, if the length of the data is not divisible with the number of groups. All groups, except the last, are guaranteed to contain the same number of elements.   **Example** > We have a vector with 57 values. We want to get back 5 groups. > > 'n_last' with default settings would return groups with this many values in them: > > 11, 11, 11, 11, 13   By setting `force_equal = TRUE`, `n_last` will create the largest possible, equally sized groups by discarding excess data elements.   **Example** > 'n_last' with **force_equal** set to TRUE would return groups with this many values in them: > > 11, 11, 11, 11, 11 > > meaning that 2 values have been discarded.   Notice that `n_last` will always return the given number of groups. It will never return a group with zero elements. For some situations that means that the last group will contain a lot of elements. Asked to divide a vector with 57 elements into 20 groups, the first 19 groups will contain 2 elements, while the last group will itself contain 19 elements. Had we instead asked it to divide the vector into 19 groups, we would have had 3 elements in all groups.   ## Method: 'n_rand' `n_rand` uses a specified number of groups to divide up the data. It creates equal groups as large as possible and places any excess data points randomly in the groups (max. one extra element per group).   **Example** > We have a vector with 57 values. We want to get back 5 groups. > > 'n_rand' with default settings **could** return groups with this many values in them: > > 12, 11, 11, 11, 12   By setting `force_equal = TRUE`, `n_rand` will create the largest possible, equally sized groups by discarding excess data elements.   **Example** > 'n_rand' with **force_equal** set to TRUE would return groups with this many values in them: > > 11, 11, 11, 11, 11 > > meaning that 2 values have been discarded.   ## Method: 'l_sizes' `l_sizes` divides up the data by a list of group sizes. Excess data points are placed in extra group at the end. `n` is a `list`/`vector` of group sizes   **Example** > We have a vector with 57 values. We want to get back 3 groups containing > 20%, 30% and 50% of the data points. > > 'l_sizes' with n = c(0.2, 0.3) would return groups with this many values in them: > > 11, 17, 29   By setting `force_equal = TRUE`, `l_sizes` discard any excess elements.   **Example** > 'l_sizes' with n = c(0.2, 0.3) and **force_equal** set to TRUE would return groups with this many values in them: > > 11, 17 > > meaning that 29 values have been discarded.   ## Method: 'l_starts' `l_starts` starts new groups at specified values of vector. `n` is a `list` of starting positions. Skip values by `c(value, skip_to_number)` where `skip_to_number` is the nth appearance of the value in the `vector`. Groups automatically start from first data point. If passing `n = 'auto'` the starting positions are automatically found with `find_starts()`. If data is a `data frame`, `starts_col` must be set to indicate the column to match starts. Set `starts_col` to `'index'` or `'.index'` for matching with row names. `'index'` first looks for column named `'index'` in data, while `'.index'` completely ignores any potential column in data named `'.index'`.   **Example** > We have a vector with 57 values ranging from (1:57). We want to get back groups starting at > specific values in the vector. > > 'l_starts' with n = c(1, 3, 7, 25, 50) would return groups with this many values in them: > > 2, 4, 18, 25, 8   `force_equal` does not have any effect with method `l_starts`.   ### Skipping Groups can start at nth appearance of the value by using `c(value, skip_to_number)`.   **Example** > We have a vector with the values c("a", "e", "o", "a", "e", "o") and want to start > groups at the first "a", the first following "e" and the second following "o". > > 'l_starts' with n = list("a", "e", c("o", 2)) would return groups with this many values in them: > > 1, 4, 1 >   ### Automatically find group starts Using the `find_starts()` function, `l_starts` is capable of finding the beginning of groups automatically. A group start is a value which differs from the previous value.   **Example** > We have a vector with the values c("a", "a", "o", "o", "o", "a", "a") and want to > automatically discover groups of data and group them. > > 'l_starts' with n = 'auto' would return groups with this many values in them: > > 2, 3, 2 >   ### find_starts() `find_starts()` finds group starts in a given vector. A group start is a value which differs from the previous value. Setting `return_index = TRUE` returns indices of group starts.   **Example** > We have a vector with the values c("a", "a", "o", "o", "o", "a", "a") and want to > automatically discover group starts. > > find_starts() would return these group starts: > > "a", "o", "a" >   ### find_missing_starts() `find_missing_starts()` tells you the values and (optionally) skip-to numbers that would be recursively removed when using the `l_starts` method with the `remove_missing_starts` argument set to `TRUE`. Set `return_skip_numbers = FALSE` to get only the missing values without the skip-to numbers.   **Example** > We have a vector with the values c("a", "a", "o", "o", "o", "a", "a") and > a vector of starting positions c("a","d","o","p","a"). > > find_missing_starts() would return this list of values and skip_to numbers: > > list(c("d",1), c("p",1)) >   ## Method: 'every' `every` greedily combines every `n`th data point.   **Example** > We have a vector with 57 values. We want to get back 5 groups. > > 'every' with default settings would return groups with this many values in them: > > 12, 12, 11, 11, 11   By setting `force_equal = TRUE`, `every` discards the last `length % n` data points such that all groups have the same size.   **Example** > 'every' with **force_equal** set to TRUE would return groups with this many values in them: > > 11, 11, 11, 11, 11 > > meaning that 2 values have been discarded.   ## Method: 'staircase' `staircase` uses a step size to divide up the data. For each group, the group size will be step size multiplied with the group index.   **Example** > We have a vector with 57 values. We specify a step size of 5. > > 'staircase' with default settings would return groups with this many values in them: > > 5, 10, 15, 20, 7   By setting `force_equal = TRUE`, `staircase` will discard the last group if it does not contain the expected values (step size multiplied by group index).   **Example** > 'staircase' with **force_equal** set to TRUE would return groups with this many values in them: > > 5, 10, 15, 20 > > meaning that 7 values have been discarded.   ### Find remainder - %staircase% When using the `staircase` method, the last group might not have the size of the second last group + step size. Use `%staircase%` to find the remainder. If the last group has the size of the second last group + step size, `%staircase%` will return `0`.   **Example** > %staircase% on a vector with size 57 and step size of 5 would look like this: > > 57 %staircase% 5 > > and return: > > 7 > > meaning that the last group would contain 7 values ## Method: 'primes' `primes` creates groups with sizes of primary numbers in a staircasing design. `n` is the prime number to start at (size of first group). Prime numbers are generated with the [`numbers` package](https://cran.r-project.org/package=numbers) by Hans Werner Borchers.   **Example** > We have a vector with 57 values. We specify n (start at) as 5. > > 'primes' with default settings would return groups with this many values in them: > > 5, 7, 11, 13, 17, 4   By setting `force_equal = TRUE`, `primes` will discard the last group if it does not contain the expected number of values.   **Example** > 'primes' with **force_equal** set to TRUE would return groups with this many values in them: > > 5, 7, 11, 13, 17 > > meaning that 4 values have been discarded.   ### Find remainder - %primes% When using the `primes` method, the last group might not have the size of the associated prime number, if there are not enough elements. Use `%primes%` to find the remainder. Returns `0` if the last group has the size of the associated prime number.   **Example** > %primes% on a vector with size 57 and n (start at) as 5 would look like this: > > 57 %primes% 5 > > and return: > > 4 > > meaning that the last group would contain 4 values # Balancing ID Methods There are currently 4 methods for balancing on ID level in `balance()`. ## ID method: 'n_ids' Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories. ## ID method: 'n_rows_c' Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps: 1. If a category needs to add all its rows one or more times, the data is repeated. 2. Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled. ## ID method: 'distributed' Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. ## ID method: 'nested' Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows. # Arguments ## Grouping arguments These are the arguments for `group_factor()`, `group()`, `splt()`, `fold()`, `partition()` ### data Type: `data frame` or `vector` The data to process.   Used in: `group_factor()`, `group()`, `splt()`, `fold()`, `partition()` ### n Type: `integer`, `numeric`, `character`, or `list` `n` represents either number of groups (default), group size, list of group sizes, list of group starts, step size or prime number to start at, *depending on which method is specified*. `n` can be given as a **whole number(s)** (n > 1) or as **percentage(s)** (0 < n < 1). Method `l_starts` allows `n = 'auto'`.   Used in: `group_factor()`, `group()`, `splt()` ### method Type: `character` Choose which method to use when dividing up the data. Available methods: `greedy`, `n_dist`, `n_fill`, `n_last`, `n_rand`, `l_starts`, `l_sizes`, `staircase`, or `primes`   Used in: `group_factor()`, `group()`, `splt()`, `fold()` ### starts_col Type: `character` Name of column with values to match in method `l_starts` when `data` is a `data frame`. Pass `'index'` or `'.index'` to use rownames. `'index'` first looks for column named `'index'` in `data`, while `'.index'` completely ignores any potential column in data named `'.index'`.   Used in: `group_factor()`, `group()`, `splt()` ### force_equal Type: `logical` (`TRUE` or `FALSE`) If you need groups with the exact same size, set `force_equal = TRUE`. Implementation is different in the different methods. Read more in their sections above. **Be aware** that this setting discards excess datapoints!   Used in: `group_factor()`, `group()`, `splt()`, `partition()` ### allow_zero Type: `logical` (`TRUE` or `FALSE`) If you set `n` to `0`, you get an error. If you don't want this behavior, you can set `allow_zero = TRUE`, and (depending on the function) you will get the following output: `group_factor()` will return the factor with `NA`s instead of numbers. It will be the same length as expected. `group()` will return the expected data frame with `NA`s instead of a grouping factor. `splt()` functions will return the given data (`data frame` or `vector`) in the same `list` format as if it had been split.   Used in: `group_factor()`, `group()`, `splt()` ### descending Type: `logical` (`TRUE` or `FALSE`) In methods like `n_fill` where it makes sense to change the direction of the method, you can use this argument. In `n_fill` it fills up the excess data points starting from the last group instead of the first. NB. Only some of the methods can use this argument.   Used in: `group_factor()`, `group()`, `splt()` ### randomize Type: `logical` (`TRUE` or `FALSE`) After creating the the grouping factor using the chosen method, it is possible to randomly reorganize it before returning it. Notice that this **applies to all the functions that allows for the argument**, as `group()` and `splt()` uses the grouping factor!   Used in: `group_factor()`, `group()`, `splt()` N.B. `fold()` and `partition()` always uses some randomization. ### col_name Type: `character` Name of added grouping factor column. Allows multiple grouping factors in a `data frame`.   Used in: `group()` ### remove_missing_starts Type: `logical` (`TRUE` or `FALSE`) Recursively remove elements from the list of starts that are not found. For method `l_starts` only. Used in: `group_factor()`, `group()`, `splt()` ### k Type: `integer` or `numeric` `k` represents either number of folds (default), fold size, or step size, *depending on which method is specified*. `k` can be given as a **whole number** (k > 1) or as a **percentage** (0 < k < 1).   Used in: `fold()` ### p Type: `integer` or `numeric` Size(s) of partition(s). Passed as `vector` if specifying multiple partitions. `p` can be given as **whole number(s)** (p > 1) or as **percentage(s)** (0 < p < 1).   Used in: `partition()` ### cat_col Type: categorical `vector` or `factor` (passed as column name) Categorical variable to balance between the groups. E.g. when predicting a binary variable ('a' or 'b'), we usually want both classes represented in every fold and partition.   **N.B.** If also passing `id_col`, `cat_col` should be a constant within IDs. E.g. a participant must always have the same diagnosis ('a' or 'b') throughout the dataset. Otherwise, the participant might be placed in multiple folds.   Used in: `fold()`, `partition()` ### num_col Type: numerical `vector` (passed as column name) Numerical variable to balance between groups.   **N.B.** When used with `id_col`, values for each ID are aggregated using `id_aggregation_fn` before being balanced. **N.B.** When passing `num_col`, the `method` argument is not used.   Used in: `fold()`, `partition()` ### id_col Type: `factor` (passed as column name) `factor` with IDs. This will be used to keep all rows with an ID in the same group (if possible). E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold/partition.   Used in: `fold()`, `partition()` ### id_aggregation_fn Type: `function` Function for aggregating values in `num_col` for each ID, before balancing by `num_col`. **N.B.** Only used when `num_col` and `id_col` are both specified.   Used in: `fold()`, `partition()` ### extreme_pairing_levels Type: `integer` or `numeric` How many levels of extreme pairing to do when balancing groups by `num_col`. **Extreme pairing**: Rows/pairs are ordered as smallest, largest, second smallest, second largest, etc. If `extreme_pairing_levels > 1`, this is done "recursively" on the extreme pairs. **N.B.** Values greater than `1` works best with large datasets. Always check if an increase actually makes the groups more balanced. There are examples of how to do this, and more detailed descriptions of the implementations, in the functions' help files (`?fold` and `?partition`).   Used in: `fold()`, `partition()` ### num_fold_cols Type: `integer` or `numeric` Number of fold columns to create. This is useful for repeated cross-validation. If `num_fold_cols > 1`, columns will be named `".folds_1"`, `".folds_2"`, etc. Otherwise simply `".folds"`. **N.B.** If `unique_fold_cols_only` is `TRUE`, we can end up with fewer columns than specified, see `max_iters`. **N.B.** If `data` has existing fold columns, see `handle_existing_fold_cols`.   Used in: `fold()` ### unique_fold_cols_only Type: `logical` (`TRUE` or `FALSE`) Check if the fold columns are identical and keep only the unique columns. **N.B.** As the number of column comparisons can be time consuming, we can run this part in parallel. See `parallel`. **N.B.** We can end up with fewer columns than specified in `num_fold_cols`, see `max_iters`. **N.B.** Only used when `num_fold_cols > 1` or `data` has existing fold columns.   Used in: `fold()` ### max_iters Type: `logical` (`TRUE` or `FALSE`) Maximum number of attempts at reaching `num_fold_cols` **unique** fold columns. When only keeping the unique fold columns, we risk having fewer columns than expected. Hence, we repeatedly create the missing columns and remove those that are not unique. This is done until we have `num_fold_cols` unique fold columns, or we have attempted `max_iters` times. In some cases, it is not possible to create `num_fold_cols` unique combinations of the dataset, e.g. when specifying `cat_col`, `id_col` and `num_col`. `max_iters` specifies when to stop trying. **N.B.** We can end up with fewer columns than specified in `num_fold_cols`. **N.B.** Only used `num_fold_cols > 1`.   Used in: `fold()` ### handle_existing_fold_cols Type: `character` How to handle existing fold columns. Either `"keep_warn"`, `"keep"`, or `"remove"`. To add extra fold columns, use `"keep"` or `"keep_warn"`. Note that existing fold columns might be renamed. To replace the existing fold columns, use `"remove"`.   Used in: `fold()` ### parallel Type: `logical` (`TRUE` or `FALSE`) Whether to parallelize the fold column comparisons, when `unique_fold_cols_only` is `TRUE`. **N.B.** Requires a registered parallel backend. Like with `doParallel:registerDoParallel`.   Used in: `fold()` ### list_out Type: `logical` (`TRUE` or `FALSE`) Return `list` of partitions (`TRUE`) or a `grouped data frame` (`FALSE`).   Used in: `partition()`   ## Balancing arguments These are the arguments for `balance()`, `upsample()`, `downsample()` ### data Type: `data frame` The data to process.   Used in: `balance()`, `upsample()`, `downsample()` ### size Type: `character`, `numeric` Size to balance categories to. Either a specific number, given as a whole number, or one of the following strings: `"min"`, `"max"`, `"mean"`, `"median"`.   Used in: `balance()`, `upsample()`, `downsample()` ### cat_col Type: categorical `vector` or `factor` (passed as column name) Categorical variable to balance sample size by.   Used in: `balance()`, `upsample()`, `downsample()` ### id_col Type: `factor` (passed as column name) `factor` with IDs. IDs are considered entities, e.g. allowing us to add or remove all rows for an ID. How this is used is up to the `id_method`. E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements together. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant.   Used in: `balance()`, `upsample()`, `downsample()` ### id_method Type: `character` Method for balancing the IDs. Available ID methods are: `n_ids`, `n_rows_c`, or `nested`.   Used in: `balance()`, `upsample()`, `downsample()` ### mark_new_rows Type: `logical` Whether to add a column with `1`s for added rows, and `0`s for original rows.   Used in: `balance()`, `upsample()`, `downsample()` ### new_rows_col_name Type: `character` Name of column marking new rows.   Used in: `balance()`, `upsample()`, `downsample()`   # Using Functions We will be using the method `n_dist` on a `data frame` to showcase the functions. Afterwards we will use and compare the methods. Note, that you can also use `vector`s as data with all the functions. See the necessary attached packages for running the examples under [Attach Packages]. ## group_factor() 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `group_factor()` ```{r} groups <- group_factor(df, 5, method = 'n_dist') groups df$groups <- groups df %>% kable(align = 'c') ``` 3. We could get the mean age of each group ```{r} df %>% group_by(groups) %>% summarize(mean_age = mean(age)) %>% kable(align = 'c') ``` ### force_equal Getting an equal number of elements per group with `group_factor()`. Notice that we discard the excess values so all groups contain the same amount of elements. Since the grouping factor is shorter than the data frame, we can't combine them as they are. A way to do so would be to shorten the data frame to be the same length as the grouping factor. 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `group_factor()` with `force_equal` ```{r} groups <- group_factor(df, 5, method = 'n_dist', force_equal = TRUE) groups plyr::count(groups) %>% rename(group = x, size = freq) %>% kable(align = 'c') ``` 3. Combining `data frame` and grouping factor First we make the `data frame` the same size as the grouping factor. Then we add the grouping factor to the `data frame`. ```{r} df <- head(df, length(groups)) %>% mutate(group = groups) df %>% kable(align = 'c') ```   ## group() 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `group()` ```{r} df_grouped <- group(df, 5, method = 'n_dist') df_grouped %>% kable(align = 'c') ``` 2.2 Using `group()` in `magrittr` pipeline to get mean age ```{r} df_means <- df %>% group(5, method = 'n_dist') %>% summarise(mean_age = mean(age)) df_means %>% kable(align = 'c') ``` ### force_equal Getting an equal number of elements per group with `group()`. Notice that we discard the excess rows/elements so all groups contain the same amount of elements. 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `group()` with `force_equal` ```{r} df_grouped <- df %>% group(5, method = 'n_dist', force_equal = TRUE) df_grouped %>% kable(align = 'c') ```   ## splt() 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `splt()` ```{r} df_list <- splt(df, 5, method = 'n_dist') df_list %>% kable(align = 'c') ``` 3. Let's see the format of the list created by `splt()` without using `kable()` to visualize it. `splt()` uses `base::split()` to split the data by the grouping factor. ```{r} v <- c(1:6) splt(v, 3, method = 'n_dist') ``` ### force_equal Getting an equal number of elements per group with `splt()`. Notice that we discard the excess rows/elements so all groups contain the same amount of elements. 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. Using `splt()` with `force_equal` ```{r} df_list <- splt(df, 5, method = 'n_dist', force_equal = TRUE) df_list %>% kable(align = 'c') ```   ## fold() 1. We create a `data frame` ```{r} df <- data.frame( "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) # Remove index rownames(df) <- NULL # Add session info df$session <- rep(c('1','2', '3'), 6) kable(df, align = 'c') ``` 2. Using `fold()` without balancing ```{r} df_folded <- fold(df, 3, method = 'n_dist') # Order by folds df_folded <- df_folded %>% arrange(.folds) kable(df_folded, align = 'c') ``` 3. Using `fold()` with balancing but without id_col ```{r} df_folded <- fold(df, 3, cat_col = 'diagnosis', method = 'n_dist') # Order by folds df_folded <- df_folded %>% arrange(.folds) kable(df_folded, align = 'c') ``` Let's count how many of each diagnosis there are in each group. ```{r} df_folded %>% count(.folds, diagnosis) %>% kable(align='c') ``` 4. Using `fold()` with `id_col` but without balancing ```{r} df_folded <- fold(df, 3, id_col = 'participant', method = 'n_dist') # Order by folds df_folded <- df_folded %>% arrange(.folds) # Remove index (Looks prettier in the table!) rownames(df_folded) <- NULL kable(df_folded, align = 'c') ``` Let's see how participants were distributed in the groups. ```{r} df_folded %>% count(.folds, participant) %>% kable(align='c') ``` 5. Using `fold()` with balancing and with `id_col` `fold()` first divides up the data frame by `cat_col` and then create `n` folds for both diagnoses. As there are only 3 participants per diagnosis, we can maximally create 3 folds in this scenario. ```{r} df_folded <- fold(df, 3, cat_col = 'diagnosis', id_col = 'participant', method = 'n_dist') # Order by folds df_folded <- df_folded %>% arrange(.folds) kable(df_folded, align = 'c') ``` Let's count how many of each diagnosis there are in each group and find which participants are in which groups. ```{r} df_folded %>% count(.folds, diagnosis, participant) %>% kable(align='c') ```   ## partition() 1. We create a `data frame` ```{r} df <- data.frame( "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) # Remove index rownames(df) <- NULL # Add session info df$session <- rep(c('1','2', '3'), 6) kable(df, align = 'c') ``` 2. Using `partition()` without balancing ```{r} df_partitioned <- partition(df, 0.3, list_out = FALSE) # Order by partitions df_partitioned <- df_partitioned %>% arrange(.partitions) # Partition Sizes df_partitioned %>% count(.partitions) %>% kable(align = 'c') kable(df_partitioned, align = 'c') ``` 3. Using `partition()` with balancing but without `id_col` ```{r} df_partitioned <- partition(df, 0.3, cat_col = 'diagnosis', list_out = FALSE) # Order by partitions df_partitioned <- df_partitioned %>% arrange(.partitions) kable(df_partitioned, align = 'c') ``` Let's count how many of each diagnosis there are in each partition. ```{r} df_partitioned %>% count(.partitions, diagnosis) %>% kable(align='c') ``` 4. Using `partition()` with `id_col` but without balancing ```{r} df_partitioned <- partition(df, 0.5, id_col = 'participant', list_out = FALSE) # Order by partitions df_partitioned <- df_partitioned %>% arrange(.partitions) kable(df_partitioned, align = 'c') ``` Let's see how participants were distributed in the partitions. ```{r} df_partitioned %>% count(.partitions, participant) %>% kable(align='c') ``` 5. Using `partition()` with `cat_col` and with `id_col` `partition()` first divides up the data frame by `cat_col` and then create `n` partitions for both diagnoses. ```{r} df_partitioned <- partition( data = df, p = 0.5, cat_col = 'diagnosis', id_col = 'participant', list_out = FALSE ) # Order by folds df_partitioned <- df_partitioned %>% arrange(.partitions) kable(df_partitioned, align = 'c') ``` Let's count how many of each diagnosis there are in each group and find which participants are in which partitions. ```{r} df_partitioned %>% count(.partitions, diagnosis, participant) %>% kable(align='c') ```   ## balance() 1. We create a `dataframe` ```{r echo=FALSE} set.seed(2) ``` ```{r} df <- data.frame( "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c('a', 'b', 'a', 'a', 'b', 'b'), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) # Add session info df$session <- rep(c('1','2', '3'), 6) # Sample dataset to get imbalances df <- df %>% sample_frac(0.7) %>% arrange(participant) # Remove index rownames(df) <- NULL # Counts df %>% count(diagnosis, participant) %>% kable(align = 'c') df %>% count(diagnosis) %>% kable(align = 'c') kable(df, align = 'c') ``` 2. Using `balance()` for downsampling without `id_col` ```{r} df_balanced <- balance(df, "min", cat_col = "diagnosis") %>% arrange(diagnosis, participant) # Counts df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` 3. Using `balance()` for downsampling with `id_col` and `id_method` `n_rows_c` ```{r} df_balanced <- balance(df, "min", cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c") %>% arrange(diagnosis, participant) # Partition Sizes df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` Let's count how many of each participant there are in each diagnosis. ```{r} df_balanced %>% count(diagnosis, participant) %>% kable(align='c') ``` 4. Using `balance()` for upsampling without `id_col` ```{r} df_balanced <- balance(df, "max", cat_col = "diagnosis") %>% arrange(diagnosis, participant) # Counts df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` 5. Using `balance()` for upsampling with `id_col` and `id_method` `n_rows_c` ```{r} df_balanced <- balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c") %>% arrange(diagnosis, participant) # Partition Sizes df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` Let's count how many of each participant there are in each diagnosis. ```{r} df_balanced %>% count(diagnosis, participant) %>% kable(align='c') ``` 6. Using `balance()` for downsampling to a specific size without `id_col` ```{r} df_balanced <- balance(df, 3, cat_col = "diagnosis") %>% arrange(diagnosis, participant) # Counts df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` Let's count how many of each participant there are in each diagnosis. ```{r} df_balanced %>% count(diagnosis, participant) %>% kable(align='c') ``` 7. Using `balance()` for downsampling to a specific size with `id_col` and `id_method` `n_rows_c` ```{r} df_balanced <- balance(df, 3, cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c") %>% arrange(diagnosis, participant) # Partition Sizes df_balanced %>% count(diagnosis) %>% kable(align = 'c') kable(df_balanced, align = 'c') ``` Let's count how many of each participant there are in each diagnosis. ```{r} df_balanced %>% count(diagnosis, participant) %>% kable(align='c') ```   ## Extra arguments showcase ### randomize 1. We create a `data frame` ```{r} df <- data.frame( "x" = c(1:12), "species" = factor(rep(c('cat', 'pig', 'human'), 4)), "age" = sample(c(1:100), 12) ) ``` 2. We use `group_factor()` with `randomize` set to `TRUE` ```{r} groups <- group_factor(df, 5, method = 'n_dist', randomize = TRUE) groups ``` 3. We use `splt()` with `randomize` set to `TRUE` Notice that the index has been shuffled but the group sizes are the same as before! ```{r} df_list <- splt(df, 5, method = 'n_dist', randomize = TRUE) df_list %>% kable(align = 'c') ``` # Examples of method differences In this section, we will take a look at the outputs we get from the different methods. ## n_ methods ### Vector with 57 elements divided into 6 groups Below you'll see a data frame with counts of group elements when dividing up the same data with the different `n_*` methods. The forced-equal column is simply with the `force_equal` set to `TRUE.` *forced_equal*: Since this is a setting to make sure that all groups are of the same size, it makes sense that all the groups have the same size. *n_dist*: compared to *forced_equal* we see the 3 datapoints that *forced_equal* had discarded. These are distributed across the groups (in this example group 2,4 and 6) *n_fill*: The 3 extra datapoints are located at the first 3 groups. Had we set `descending = TRUE`, it would have been the last 3 groups instead. *n_last*: We see that `n_last` creates equal group sizes in all but the last group. This means that the last group can sometimes have a group size, which is very large or small compared to the other groups. Here it is a third larger than the other groups. *n_rand*: The extra datapoints are placed randomly and so we would see the extra datapoints located at different groups if we ran the script again. *Unless we use* `set.seed()` *before running the function.* ```{r echo=FALSE, eval=requireNamespace("ggplot2")} # # Examples to show difference between methods # This could be made interactive! This way you could test what happens in different situations by # by simply moving a slider! # vec <- c(1:57) n <- 6 if (exists ('n_meth_v57n6')){ rm(n_meth_v57n6) } for (meth in c('n_dist', 'n_fill' ,'n_last','n_rand')){ data_temp <- data.frame(plyr::count(group_factor(vec, n, method = meth))) names(data_temp)[names(data_temp)=="freq"] <- meth if (exists ('n_meth_v57n6')) { n_meth_v57n6 <- cbind(n_meth_v57n6, data_temp) } else { n_meth_v57n6 <- data_temp } } forced_equal <- plyr::count(group_factor(vec, n, method = 'n_last', force_equal = TRUE)) n_meth_v57n6$forced_equal <- forced_equal$freq n_meth_v57n6 <- n_meth_v57n6[ , !duplicated(colnames(n_meth_v57n6))] # gather() data frame for plotting data_plot <- n_meth_v57n6 %>% gather(method, group_size,-1) upper_limit <- max(data_plot$group_size)+1 lower_limit <- min(data_plot$group_size)-1 v57n6_plot <- ggplot(data_plot, aes(x, group_size)) ## Output # Data frame n_meth_v57n6 # Plot v57n6_plot + geom_point() + scale_y_continuous(limit = c(lower_limit, upper_limit), breaks = round(seq(lower_limit, upper_limit, by = 2),1)) + #scale_y_continuous(limit = c(lower_limit, upper_limit))+ facet_wrap('method', ncol=1) + labs(x = 'group', y = 'group Size', title = 'Distribution of Elements in groups')+ theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=9)) ```   ### Vector with 117 elements divided into 11 groups Here is another example. ```{r echo=FALSE, eval=requireNamespace("ggplot2")} vec <- c(1:117) n <- 11 if (exists ('n_meth_v117n11')){ rm(n_meth_v117n11) } for (meth in c('n_dist', 'n_fill' ,'n_last','n_rand')){ data_temp <- data.frame(plyr::count(group_factor(vec, n, method = meth))) names(data_temp)[names(data_temp)=="freq"] <- meth if (exists ('n_meth_v117n11')) { n_meth_v117n11 <- cbind(n_meth_v117n11, data_temp) } else { n_meth_v117n11 <- data_temp } } forced_equal <- plyr::count(group_factor(vec, n, method = 'n_last', force_equal = TRUE)) n_meth_v117n11$forced_equal <- forced_equal$freq n_meth_v117n11 <- n_meth_v117n11[ , !duplicated(colnames(n_meth_v117n11))] # gather() data frame for plotting data_plot <- n_meth_v117n11 %>% gather(method, group_size,-1) v117n11_plot <- ggplot(data_plot, aes(x, group_size)) upper_limit <- max(data_plot$group_size)+1 lower_limit <- min(data_plot$group_size)-1 ## Output # Data frame n_meth_v117n11 # Plot v117n11_plot + geom_point() + scale_y_continuous(limit = c(lower_limit, upper_limit), breaks = round(seq(lower_limit, upper_limit, by = 2),1)) + facet_wrap('method', ncol=1) + labs(x = 'group', y = 'group Size', title = 'Distribution of Elements in groups')+ theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=9)) ```   ## Greedy ### Vector with 100 elements with sizes of 8, 15, 20 Below you will see group sizes when using the method `greedy` and asking for group sizes of 8, 15, 20. What should become clear is that only the last group can have a different group size than what we asked for. This is important if, say, you want to split a time series into groups of 100 elements, but the time series is not divisible with 100. Then you could use `force_equal` to remove the excess elements, if you need equal groups. With a size of 8, we get 13 groups. The last group (13) only contains 4 elements, but all the other groups contain 8 elements as specified. With a size of 15, we get 7 groups. The last group (7) contains only 10 elements, but all the other groups contain 15 elements as specified. With a size of 20, we get 5 groups. As 20 is divisible with the 100 elements that the split vector contained, the last group also contains 20 elements, and so we have equal groups. ```{r echo=FALSE, eval=requireNamespace("ggplot2")} vec <- c(1:100) if (exists ('greedy_data')){ rm(greedy_data) } for (n in c(8,15,20)){ group_sizes <- plyr::count(group_factor(vec, n, method='greedy')) data_temp <- data.frame(group_sizes, 'Size' = factor(n)) if (exists ('greedy_data')) { greedy_data <- rbind(greedy_data, data_temp) } else { greedy_data <- data_temp } } greedy_plot <- ggplot(greedy_data, aes(x, freq, color=Size)) greedy_plot + geom_point() + labs(x = 'group', y = 'group Size', title = 'Greedy Distribution of Elements in groups', color = 'Size') + theme_bw()+ theme(plot.margin = unit(c(1,1,1,1), "cm"))+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=9)) ```   ## Staircasing ### Vector with 1000 elements with step sizes of 2, 5, 11 Below you'll see a plot with the group sizes at each group when using step sizes 2, 5, and 11. At a step size of 2 elements it simply increases 2 for each group, until the last group (32) where it runs out of elements. Had we set `force_equal = TRUE`, this last group would have been discarded, because of the lack of elements. At a step size of 5 elements it increases with 5 every time. Because of this it runs out of elements faster. Again we see that the last group (20) has fewer elements. At a step size of 11 elements it increases with 11 every time. It seems that the last group is not too small, but it can be hard to see on this scale. Actually, the last group misses 1 element to be complete and so would have been discarded if `force_equal` was set to `TRUE`.   ```{r echo=FALSE, eval=requireNamespace("ggplot2")} vec <- c(1:1000) if (exists ('staircase_data')){ rm(staircase_data) } for (n in c(2, 5, 11)){ group_sizes <- plyr::count(group_factor(vec, n, method='staircase')) data_temp <- data.frame(group_sizes, 'step_size' = factor(n)) if (exists ('staircase_data')) { staircase_data <- rbind(staircase_data, data_temp) } else { staircase_data <- data_temp } } staircase_plot <- ggplot(staircase_data, aes(x, freq, color=step_size)) staircase_plot + geom_point() + #scale_x_continuous(breaks = round(seq(1, max(data_temp$x), by = 2),1))+ labs(x = 'group', y = 'group Size', title = 'Staircasing Distribution of Elements in groups', color = 'Step Size') + theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=7)) ```   Below we will take a quick look at the cumulative sum of group elements to get an idea of what is going on under the hood. Remember that the split vector had 1000 elements? That is why they all stop at 1000 on the y-axis. There are simply no more elements left! ```{r echo=FALSE, eval=requireNamespace("ggplot2")} staircase_data <- staircase_data %>% group_by(step_size) %>% mutate(cumsum = cumsum(freq)) staircase_cumulative_plot <- ggplot(staircase_data, aes(x, cumsum, color=step_size)) staircase_cumulative_plot + geom_point() + labs(x = 'group', y = 'Cumulative sum of group sizes', title = 'Staircasing Cumulative Sum of group Sizes', color = 'Step Size') + theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=7)) ``` ## Primes ### Vector with 1000 elements with n (start at) as 2, 5, 11 Below you'll see a plot with the group sizes at each group when starting from prime numbers 2, 5, and 11.   ```{r echo=FALSE, eval=requireNamespace("ggplot2")} vec <- c(1:1000) if (exists ('primes_data')){ rm(primes_data) } for (n in c(2, 5, 11)){ group_sizes <- plyr::count(group_factor(vec, n, method='primes')) data_temp <- data.frame(group_sizes, 'start_at' = factor(n)) if (exists ('primes_data')) { primes_data <- rbind(primes_data, data_temp) } else { primes_data <- data_temp } } primes_plot <- ggplot(primes_data, aes(x, freq, color=start_at)) primes_plot + geom_point() + #scale_x_continuous(breaks = round(seq(1, max(data_temp$x), by = 2),1))+ labs(x = 'group', y = 'group Size', title = 'Prime numbers method - Elements per groups', color = 'Start at') + theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=7)) ```   Below we will take a quick look at the cumulative sum of group elements to get an idea of what is going on under the hood. Because the split vector had 1000 elements, it stops at 1000 on the y-axis. There are simply no more elements left! ```{r echo=FALSE, eval=requireNamespace("ggplot2")} primes_data <- primes_data %>% group_by(start_at) %>% mutate(cumsum = cumsum(freq)) primes_cumulative_plot <- ggplot(primes_data, aes(x, cumsum, color=start_at)) primes_cumulative_plot + geom_point() + labs(x = 'group', y = 'Cumulative sum of group sizes', title = 'Primes Cumulative Sum of group Sizes', color = 'Start At') + theme_bw()+ theme(axis.text.y = element_text(size=9), axis.text.x = element_text(size=7)) ``` # The End You have reached the end! Now celebrate by taking the week off, splitting data and laughing!