Title: | Creating Groups from Data |
---|---|
Description: | Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling or collapse them to fewer groups. |
Authors: | Ludvig Renbo Olsen [aut, cre] (<https://orcid.org/0009-0006-6798-7454>, @ludvigolsen) |
Maintainer: | Ludvig Renbo Olsen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.0.4 |
Built: | 2024-11-27 04:56:12 UTC |
Source: | https://github.com/ludvigolsen/groupdata2 |
Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling.
The groupdata2
package provides six main functions:
group()
, group_factor()
, splt()
, partition()
,
fold()
, and balance()
.
Create groups from your data.
Divides data into groups by a wide range of methods.
Creates a grouping factor with 1
s for group 1, 2
s for group 2, etc.
Returns a data.frame
grouped by the grouping factor
for easy use in magrittr
pipelines.
Go to group()
Create grouping factor for subsetting your data.
Divides data into groups by a wide range of methods.
Creates and returns a grouping factor
with 1
s for group 1, 2
s for group 2, etc.
Go to group_factor()
Split data by a wide range of methods.
Divides data into groups by a wide range of methods. Splits data by these groups.
Go to splt()
Create balanced partitions (e.g. training/test sets).
Splits data into partitions. Balances a given categorical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.
Go to partition()
Create balanced folds for cross-validation.
Divides data into groups (folds) by a wide range of methods. Balances a given categorical variable between folds and keeps (if possible) all data points with the same ID (e.g. participant_id) in the same fold.
Go to fold()
Balance the sizes of your groups with up- and downsampling.
Uses up- and/or downsampling to fix the group sizes to the
min
, max
, mean
, or median
group size or
to a specific number of rows. Has a set of methods for balancing on
ID level.
Go to balance()
Ludvig Renbo Olsen, [email protected]
Useful links:
Report bugs at https://github.com/ludvigolsen/groupdata2/issues
When using the "primes"
method,
the last group might not have the size of the associated prime number
if there are not enough elements left. Use %primes%
to find this remainder.
size %primes% start_at
size %primes% start_at
size |
Size to group (Integer) |
start_at |
Prime to start at (Integer) |
Remainder (Integer).
Returns 0
if the last group
has the size of the associated prime number.
Ludvig Renbo Olsen, [email protected]
Other staircase tools:
%staircase%()
,
group()
,
group_factor()
Other remainder tools:
%staircase%()
# Attach packages library(groupdata2) 100 %primes% 2
# Attach packages library(groupdata2) 100 %primes% 2
When using the "staircase"
method,
the last group might not have the size of the second last
group + step size. Use %staircase%
to find this remainder.
size %staircase% step_size
size %staircase% step_size
size |
Size to staircase (Integer) |
step_size |
Step size (Integer) |
Remainder (Integer).
Returns 0
if the last group has the size of the second last group + step size.
Ludvig Renbo Olsen, [email protected]
Other staircase tools:
%primes%()
,
group()
,
group_factor()
Other remainder tools:
%primes%()
# Attach packages library(groupdata2) 100 %staircase% 2 # Finding remainder with value 0 size = 150 for (step_size in c(1:30)){ if(size %staircase% step_size == 0){ print(step_size) }}
# Attach packages library(groupdata2) 100 %staircase% 2 # Finding remainder with value 0 size = 150 for (step_size in c(1:30)){ if(size %staircase% step_size == 0){ print(step_size) }}
Checks whether two grouping factors contain the same groups, looking only at the group members, allowing for different group names / identifiers.
all_groups_identical(x, y)
all_groups_identical(x, y)
x , y
|
Two grouping factors ( N.B. Both are converted to |
Both factors are sorted by `x`
.
A grouping factor is created with new groups starting at the values in
`y`
which differ from the previous row
(i.e. group()
with method = "l_starts"
and n = "auto"
).
A similar grouping factor is created for `x`
,
to have group identifiers range from 1
to the number of groups.
The two generated grouping factors are tested for equality.
Whether all groups in `x`
are the same in `y`
, memberwise. (logical)
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
collapse_groups()
,
collapse_groups_by
,
fold()
,
group()
,
group_factor()
,
partition()
,
splt()
# Attach groupdata2 library(groupdata2) # Same groups, different identifiers x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(2, 2, 1, 1, 4, 4) all_groups_identical(x1, x2) # TRUE # Same groups, different identifier types x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c("a", "a", "b", "b", "c", "c") all_groups_identical(x1, x2) # TRUE # Not same groups # Note that all groups must be the same to return TRUE x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(1, 2, 2, 3, 3, 3) all_groups_identical(x1, x2) # FALSE # Different number of groups x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(1, 1, 1, 2, 2, 2) all_groups_identical(x1, x2) # FALSE
# Attach groupdata2 library(groupdata2) # Same groups, different identifiers x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(2, 2, 1, 1, 4, 4) all_groups_identical(x1, x2) # TRUE # Same groups, different identifier types x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c("a", "a", "b", "b", "c", "c") all_groups_identical(x1, x2) # TRUE # Not same groups # Note that all groups must be the same to return TRUE x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(1, 2, 2, 3, 3, 3) all_groups_identical(x1, x2) # FALSE # Different number of groups x1 <- c(1, 1, 2, 2, 3, 3) x2 <- c(1, 1, 1, 2, 2, 2) all_groups_identical(x1, x2) # FALSE
Uses up- and/or downsampling to fix the group sizes to the
min
, max
, mean
, or median
group size or
to a specific number of rows. Has a range of methods for balancing on
ID level.
balance( data, size, cat_col, id_col = NULL, id_method = "n_ids", mark_new_rows = FALSE, new_rows_col_name = ".new_row" )
balance( data, size, cat_col, id_col = NULL, id_method = "n_ids", mark_new_rows = FALSE, new_rows_col_name = ".new_row" )
data |
|
size |
Size to fix group sizes to.
Can be a specific number, given as a whole number, or one of the following strings:
numberFix each group to have the size of the specified number of row. Uses downsampling for groups with too many rows and upsampling for groups with too few rows. minFix each group to have the size of smallest group in the dataset. Uses downsampling on all groups that have too many rows. maxFix each group to have the size of largest group in the dataset. Uses upsampling on all groups that have too few rows. meanFix each group to have the mean group size in the dataset. The mean is rounded. Uses downsampling for groups with too many rows and upsampling for groups with too few rows. medianFix each group to have the median group size in the dataset. The median is rounded. Uses downsampling for groups with too many rows and upsampling for groups with too few rows. |
cat_col |
Name of categorical variable to balance by. (Character) |
id_col |
Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID.
How this is used is up to the E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When |
id_method |
Method for balancing the IDs. (Character)
n_ids (default)Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_cAttempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps:
distributedDistributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nestedCalls I.e. if size is |
mark_new_rows |
Add column with |
new_rows_col_name |
Name of column marking new rows. Defaults to |
`id_col`
Upsampling is done with replacement for added rows, while the original data remains intact. Downsampling is done without replacement, meaning that rows are not duplicated but only removed.
`id_col`
See `id_method`
description.
data.frame
with added and/or deleted rows.
Ordered by potential grouping variables, `cat_col`
and (potentially) `id_col`
.
Ludvig Renbo Olsen, [email protected]
Other sampling functions:
downsample()
,
upsample()
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using balance() with specific number of rows balance(df, 3, cat_col = "diagnosis") # Using balance() with min balance(df, "min", cat_col = "diagnosis") # Using balance() with max balance(df, "max", cat_col = "diagnosis") # Using balance() with id_method "n_ids" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "n_ids", mark_new_rows = TRUE ) # Using balance() with id_method "n_rows_c" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c", mark_new_rows = TRUE ) # Using balance() with id_method "distributed" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "distributed", mark_new_rows = TRUE ) # Using balance() with id_method "nested" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "nested", mark_new_rows = TRUE )
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using balance() with specific number of rows balance(df, 3, cat_col = "diagnosis") # Using balance() with min balance(df, "min", cat_col = "diagnosis") # Using balance() with max balance(df, "max", cat_col = "diagnosis") # Using balance() with id_method "n_ids" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "n_ids", mark_new_rows = TRUE ) # Using balance() with id_method "n_rows_c" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c", mark_new_rows = TRUE ) # Using balance() with id_method "distributed" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "distributed", mark_new_rows = TRUE ) # Using balance() with id_method "nested" # With column specifying added rows balance(df, "max", cat_col = "diagnosis", id_col = "participant", id_method = "nested", mark_new_rows = TRUE )
Collapses a set of groups into a smaller set of groups.
Attempts to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows (size).
Note: The more of these you balance at a time,
the less balanced each of them may become. While, on average,
the balancing work better than without, this is
not guaranteed on every run. Enabling `auto_tune`
can yield a
much better overall balance than without in most contexts.
This generates a larger set of group columns using all combinations of the
balancing columns and selects the most balanced group column(s).
This is slower and we recommend enabling parallelization (see `parallel`
).
While this balancing algorithm will not be optimal in all cases, it allows balancing a large number of columns at once. Especially with auto-tuning enabled, this can be very powerful.
Tip: Check the balances of the new groups with
summarize_balances()
and
ranked_balances()
.
Note: The categorical and ID balancing algorithms are different to those
in fold()
and
partition()
.
collapse_groups( data, n, group_cols, cat_cols = NULL, cat_levels = NULL, num_cols = NULL, id_cols = NULL, balance_size = TRUE, auto_tune = FALSE, weights = NULL, method = "balance", group_aggregation_fn = mean, num_new_group_cols = 1, unique_new_group_cols_only = TRUE, max_iters = 5, extreme_pairing_levels = 1, combine_method = "avg_standardized", col_name = ".coll_groups", parallel = FALSE, verbose = TRUE )
collapse_groups( data, n, group_cols, cat_cols = NULL, cat_levels = NULL, num_cols = NULL, id_cols = NULL, balance_size = TRUE, auto_tune = FALSE, weights = NULL, method = "balance", group_aggregation_fn = mean, num_new_group_cols = 1, unique_new_group_cols_only = TRUE, max_iters = 5, extreme_pairing_levels = 1, combine_method = "avg_standardized", col_name = ".coll_groups", parallel = FALSE, verbose = TRUE )
data |
|
n |
Number of new groups. When |
group_cols |
Names of factors in Multiple names are treated as in Note: Do not confuse these group columns with potential columns that |
cat_cols |
Names of categorical columns to balance the average frequency of one or more levels of. |
cat_levels |
Names of the levels in the The weights are automatically scaled to sum to Can be When
|
num_cols |
Names of numerical columns to balance between groups. |
id_cols |
Names of factor columns with IDs to balance the counts of between groups. E.g. useful to get a similar number of participants in each group. |
balance_size |
Whether to balance the size of the collapsed groups. (logical) |
auto_tune |
Whether to create a larger set of collapsed group columns from all combinations of the balancing dimensions and select the overall most balanced group column(s). This tends to create much more balanced collapsed group columns. Can be slow, why we recommend enabling parallelization (see |
weights |
Named The weights are automatically scaled to sum to Dimensions that are not given a weight is automatically given the weight E.g. |
method |
After calculating a combined balancing column from each of the balancing columns (see
|
group_aggregation_fn |
Function for aggregating values in the Default is When using N.B. Only used when |
num_new_group_cols |
Number of group columns to create. When N.B. When |
unique_new_group_cols_only |
Whether to only return unique new group columns. As the number of column comparisons can be quite time consuming,
we recommend enabling parallelization. See N.B. We can end up with fewer columns than specified in
N.B. Only used when |
max_iters |
Maximum number of attempts at reaching
When only keeping unique new group columns, we risk having fewer columns than expected.
Hence, we repeatedly create the missing columns and remove those that are not unique.
This is done until we have In some cases, it is not possible to create N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing the groups by the combined balancing column (see Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high, the result might not be stochastic. Always check if an increase actually makes the groups more balanced. |
combine_method |
Method to combine the balancing columns by.
One of For each balancing column (all columns in The three steps are:
|
col_name |
Name of the new group column. When creating multiple new group columns
( |
parallel |
Whether to parallelize the group column comparisons
when Especially highly recommended when Requires a registered parallel backend.
Like |
verbose |
Whether to print information about the process. May make the function slightly slower. N.B. Currently only used during auto-tuning. |
The goal of collapse_groups()
is to combine existing groups
to a lower number of groups while (optionally) balancing one or more
numeric, categorical and/or ID columns, along with the group
size.
For each of these columns (and size), we calculate a normalized, numeric "balancing column" that when balanced between the groups lead to its original column being balanced as well.
To balance multiple columns at once, we combine their balancing columns with
weighted averaging (see `combine_method`
and `weights`
) to a single
combined balancing column.
Finally, we create groups where this combined balancing column is balanced between the groups,
using the numerical balancing in fold()
.
This strategy is not guaranteed to produce balanced groups in all contexts,
e.g. when the balancing columns cancel out. To increase the probability of
balanced groups, we can produce multiple group columns with all combinations
of the balancing columns and select the overall most balanced group column(s).
We refer to this as auto-tuning (see `auto_tune`
).
We find the overall most balanced group column by ranking the across-group
standard deviations for each of the balancing columns, as found with
summarize_balances()
.
Example of finding the overall most balanced group column(s):
Given a group column with the following average age per group: `c(16, 18, 25, 21)`
,
the standard deviation hereof (3.92
) is a measure of how balanced the age
column is. Another group column can thus have a lower/higher standard deviation
and be considered more/less balanced.
We find the rankings of these standard deviations for all the balancing columns
and average them (again weighted by `weights`
). We select the group column(s) with the,
on average, highest rank (i.e. lowest standard deviations).
We highly recommend using
summarize_balances()
and ranked_balances()
to
check how balanced the created groups are on the various dimensions.
When applying ranked_balances()
to the output of summarize_balances()
,
we get a data.frame
with the standard deviations
for each balancing dimension (lower means more balanced),
ordered by the average rank (see Examples
).
The following describes the creation of the balancing columns for each of the supported column types:
For each column in `cat_cols`
:
Count each level within each group. This creates a data.frame
with
one count column per level, with one row per group.
Standardize the count columns.
Average the standardized counts rowwise to create one combined column representing
the balance of the levels for each group. When cat_levels
contains weights for each of the levels,
we apply weighted averaging.
Example: Consider a factor column with the levels c("A", "B", "C")
.
We count each level per group, normalize the counts and combine them with weighted averaging:
Group | A | B | C | -> | nA | nB | nC | -> | Combined |
1 | 5 | 57 | 1 | | | 0.24 | 0.55 | -0.77 | | | 0.007 |
2 | 7 | 69 | 2 | | | 0.93 | 0.64 | -0.77 | | | 0.267 |
3 | 2 | 34 | 14 | | | -1.42 | 0.29 | 1.34 | | | 0.07 |
4 | 5 | 0 | 4 | | | 0.24 | -1.48 | 0.19 | | | -0.35 |
... | ... | ... | ... | | | ... | ... | ... | | | ... |
For each column in `id_cols`
:
Count the unique IDs (levels) within each group. (Note: The same ID can be counted in multiple groups.)
For each column in `num_cols`
:
Aggregate the numeric columns by group using the `group_aggregation_fn`
.
Count the number of rows per group.
Apply standardization or MinMax scaling to each of the balancing columns (see `combine_method`
).
Perform weighted averaging to get a single balancing column (see `weights`
).
Example: We apply standardization and perform weighted averaging:
Group | Size | Num | Cat | ID | -> | nSize | nNum | nCat | nID | -> | Combined |
1 | 34 | 1.3 | 0.007 | 3 | | | -0.33 | -0.82 | 0.03 | -0.46 | | | -0.395 |
2 | 23 | 4.6 | 0.267 | 4 | | | -1.12 | 0.34 | 1.04 | 0.0 | | | 0.065 |
3 | 56 | 7.2 | 0.07 | 7 | | | 1.27 | 1.26 | 0.28 | 1.39 | | | 1.05 |
4 | 41 | 1.4 | -0.35 | 2 | | | 0.18 | -0.79 | -1.35 | -0.93 | | | -0.723 |
... | ... | ... | ... | ... | | | ... | ... | ... | ... | | | ... |
Finally, we get to the group creation. There are three methods for creating groups based on the
combined balancing column: "balance"
(default), "ascending"
, and "descending"
.
method
is "balance"To create groups that are balanced by the combined balancing column, we use the numerical balancing
in fold()
.
The following describes the numerical balancing in broad terms:
Rows are shuffled. Note that this will only affect rows with the same value in the combined balancing column.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc.
Each small+large pair get an extreme-group identifier. (See rearrr::pair_extremes()
)
If `extreme_pairing_levels` > 1
: These extreme-group identifiers are reordered as smallest,
largest, second smallest, second largest, etc., by the sum
of the combined balancing column in the represented rows.
These pairs (of pairs) get a new set of extreme-group identifiers, and the process is repeated
`extreme_pairing_levels`-2
times. Note that the extreme-group identifiers at the last level will represent
2^`extreme_pairing_levels`
rows, why you should be careful when choosing a larger setting.
The extreme-group identifiers from the last pairing are randomly divided into the final groups and these final identifiers are transferred to the original rows.
N.B. When doing extreme pairing of an unequal number of rows, the row with the smallest value is placed in a group by itself, and the order is instead: (smallest), (second smallest, largest), (third smallest, second largest), etc.
A similar approach with extreme triplets (i.e. smallest, closest to median, largest,
second smallest, second closest to median, second largest, etc.) may also be utilized in some scenarios.
(See rearrr::triplet_extremes()
)
Example: We order the data.frame
by smallest "Num" value,
largest "Num" value, second smallest, and so on.
We could further (when `extreme_pairing_levels` > 1
)
find the sum of "Num" for each pair and perform extreme pairing on the pairs.
Finally, we group the data.frame
:
Group | Num | -> | Group | Num | Pair | -> | New group |
1 | -0.395 | | | 5 | -1.23 | 1 | | | 3 |
2 | 0.065 | | | 3 | 1.05 | 1 | | | 3 |
3 | 1.05 | | | 4 | -0.723 | 2 | | | 1 |
4 | -0.723 | | | 2 | 0.065 | 2 | | | 1 |
5 | -1.23 | | | 1 | -0.395 | 3 | | | 2 |
6 | -0.15 | | | 6 | -0.15 | 3 | | | 2 |
... | ... | | | ... | ... | ... | | | ... |
method
is "ascending" or "descending"These methods order the data by the combined balancing column and
creates groups such that the sums get increasingly larger (`ascending`
)
or smaller (`descending`
). This will in turn lead to a pattern of
increasing/decreasing sums in the balancing columns (e.g. increasing/decreasing counts
of the categorical levels, counts of IDs, number of rows and sums of numeric columns).
data.frame
with one or more new grouping factors.
Ludvig Renbo Olsen, [email protected]
fold()
for creating balanced folds/groups.
partition()
for creating balanced partitions.
Other grouping functions:
all_groups_identical()
,
collapse_groups_by
,
fold()
,
group()
,
group_factor()
,
partition()
,
splt()
# Attach packages library(groupdata2) library(dplyr) # Set seed if (requireNamespace("xpectr", quietly = TRUE)){ xpectr::set_test_seed(42) } # Create data frame df <- data.frame( "participant" = factor(rep(1:20, 3)), "age" = rep(sample(c(1:100), 20), 3), "answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)), "score" = sample(c(1:100), 20 * 3) ) df <- df %>% dplyr::arrange(participant) df$session <- rep(c("1", "2", "3"), 20) # Sample rows to get unequal sizes per participant df <- dplyr::sample_n(df, size = 53) # Create the initial groups (to be collapsed) df <- fold( data = df, k = 8, method = "n_dist", id_col = "participant" ) # Ungroup the data frame # Otherwise `collapse_groups()` would be # applied to each fold separately! df <- dplyr::ungroup(df) # NOTE: Make sure to check the examples with `auto_tune` # in the end, as this is where the magic lies # Collapse to 3 groups with size balancing # Creates new `.coll_groups` column df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", balance_size = TRUE # enabled by default ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", cat_cols = 'answer', num_cols = c('score', 'age'), id_cols = 'participant' )) # Get ranked balances # NOTE: When we only have a single new group column # we don't get ranks - but this is good to use # when comparing multiple group columns! # The scores are standard deviations across groups ranked_balances(coll_summary) # Collapse to 3 groups with size + *categorical* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances # To simplify the output, we only find the # balance of the `answer` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), cat_cols = 'answer' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks # Rows are ranked by most to least balanced # (i.e. lowest average SD rank) ranked_balances(coll_summary) # Collapse to 3 groups with size + categorical + *numerical* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", num_cols = "score", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), cat_cols = 'answer', num_cols = 'score' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) # Collapse to 3 groups with size and *ID* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", id_cols = "participant", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances # To simplify the output, we only find the # balance of the `participant` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), id_cols = 'participant' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) ################### #### Auto-tune #### # As you might have seen, the balancing does not always # perform as optimal as we might want or need # To get a better balance, we can enable `auto_tune` # which will create a larger set of collapsings # and select the most balanced new group columns # While it is not required, we recommend # enabling parallelization ## Not run: # Uncomment for parallelization # library(doParallel) # doParallel::registerDoParallel(7) # use 7 cores # Collapse to 3 groups with lots of balancing # We enable `auto_tune` to get a more balanced set of columns # We create 10 new `.coll_groups_1/2/...` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", num_cols = "score", id_cols = "participant", balance_size = TRUE, num_new_group_cols = 10, auto_tune = TRUE, parallel = FALSE # Set to TRUE for parallelization! ) # Check balances # To simplify the output, we only find the # balance of the `participant` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:10), cat_cols = "answer", num_cols = "score", id_cols = 'participant' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) # Now we can choose the .coll_groups_* column(s) # that we favor the balance of # and move on with our lives! ## End(Not run)
# Attach packages library(groupdata2) library(dplyr) # Set seed if (requireNamespace("xpectr", quietly = TRUE)){ xpectr::set_test_seed(42) } # Create data frame df <- data.frame( "participant" = factor(rep(1:20, 3)), "age" = rep(sample(c(1:100), 20), 3), "answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)), "score" = sample(c(1:100), 20 * 3) ) df <- df %>% dplyr::arrange(participant) df$session <- rep(c("1", "2", "3"), 20) # Sample rows to get unequal sizes per participant df <- dplyr::sample_n(df, size = 53) # Create the initial groups (to be collapsed) df <- fold( data = df, k = 8, method = "n_dist", id_col = "participant" ) # Ungroup the data frame # Otherwise `collapse_groups()` would be # applied to each fold separately! df <- dplyr::ungroup(df) # NOTE: Make sure to check the examples with `auto_tune` # in the end, as this is where the magic lies # Collapse to 3 groups with size balancing # Creates new `.coll_groups` column df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", balance_size = TRUE # enabled by default ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", cat_cols = 'answer', num_cols = c('score', 'age'), id_cols = 'participant' )) # Get ranked balances # NOTE: When we only have a single new group column # we don't get ranks - but this is good to use # when comparing multiple group columns! # The scores are standard deviations across groups ranked_balances(coll_summary) # Collapse to 3 groups with size + *categorical* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances # To simplify the output, we only find the # balance of the `answer` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), cat_cols = 'answer' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks # Rows are ranked by most to least balanced # (i.e. lowest average SD rank) ranked_balances(coll_summary) # Collapse to 3 groups with size + categorical + *numerical* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", num_cols = "score", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), cat_cols = 'answer', num_cols = 'score' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) # Collapse to 3 groups with size and *ID* balancing # We create 2 new `.coll_groups_1/2` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", id_cols = "participant", balance_size = TRUE, num_new_group_cols = 2 ) # Check balances # To simplify the output, we only find the # balance of the `participant` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:2), id_cols = 'participant' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) ################### #### Auto-tune #### # As you might have seen, the balancing does not always # perform as optimal as we might want or need # To get a better balance, we can enable `auto_tune` # which will create a larger set of collapsings # and select the most balanced new group columns # While it is not required, we recommend # enabling parallelization ## Not run: # Uncomment for parallelization # library(doParallel) # doParallel::registerDoParallel(7) # use 7 cores # Collapse to 3 groups with lots of balancing # We enable `auto_tune` to get a more balanced set of columns # We create 10 new `.coll_groups_1/2/...` columns df_coll <- collapse_groups( data = df, n = 3, group_cols = ".folds", cat_cols = "answer", num_cols = "score", id_cols = "participant", balance_size = TRUE, num_new_group_cols = 10, auto_tune = TRUE, parallel = FALSE # Set to TRUE for parallelization! ) # Check balances # To simplify the output, we only find the # balance of the `participant` column (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:10), cat_cols = "answer", num_cols = "score", id_cols = 'participant' )) # Get ranked balances # All scores are standard deviations across groups or (average) ranks ranked_balances(coll_summary) # Now we can choose the .coll_groups_* column(s) # that we favor the balance of # and move on with our lives! ## End(Not run)
Collapses a set of groups into a smaller set of groups.
Balance the new groups by:
The number of rows with collapse_groups_by_size()
Numerical columns with collapse_groups_by_numeric()
One or more levels of categorical columns with collapse_groups_by_levels()
Level counts in ID columns with collapse_groups_by_ids()
Any combination of these with collapse_groups()
These functions wrap collapse_groups()
to provide a simpler interface. To balance more than one of the attributes at a time
and/or create multiple new unique grouping columns at once, use
collapse_groups()
directly.
While, on average, the balancing work better than without, this is
not guaranteed on every run. `auto_tune`
(enabled by default) can yield
a much better overall balance than without in most contexts. This generates a larger set
of group columns using all combinations of the balancing columns and selects the
most balanced group column(s). This is slower and can be speeded up by enabling
parallelization (see `parallel`
).
Tip: When speed is more important than balancing, disable `auto_tune`
.
Tip: Check the balances of the new groups with
summarize_balances()
and
ranked_balances()
.
Note: The categorical and ID balancing algorithms are different to those
in fold()
and
partition()
.
collapse_groups_by_size( data, n, group_cols, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_numeric( data, n, group_cols, num_cols, balance_size = FALSE, auto_tune = TRUE, method = "balance", group_aggregation_fn = mean, col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_levels( data, n, group_cols, cat_cols, cat_levels = NULL, balance_size = FALSE, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_ids( data, n, group_cols, id_cols, balance_size = FALSE, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE )
collapse_groups_by_size( data, n, group_cols, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_numeric( data, n, group_cols, num_cols, balance_size = FALSE, auto_tune = TRUE, method = "balance", group_aggregation_fn = mean, col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_levels( data, n, group_cols, cat_cols, cat_levels = NULL, balance_size = FALSE, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE ) collapse_groups_by_ids( data, n, group_cols, id_cols, balance_size = FALSE, auto_tune = TRUE, method = "balance", col_name = ".coll_groups", parallel = FALSE, verbose = FALSE )
data |
|
n |
Number of new groups. |
group_cols |
Names of factors in Multiple names are treated as in Note: Do not confuse these group columns with potential columns that |
auto_tune |
Whether to create a larger set of collapsed group columns from all combinations of the balancing dimensions and select the overall most balanced group column(s). This tends to create much more balanced collapsed group columns. Can be slow, why we recommend enabling parallelization (see |
method |
|
col_name |
Name of the new group column. When creating multiple new group columns
( |
parallel |
Whether to parallelize the group column comparisons
when Requires a registered parallel backend.
Like |
verbose |
Whether to print information about the process. May make the function slightly slower. N.B. Currently only used during auto-tuning. |
num_cols |
Names of numerical columns to balance between groups. |
balance_size |
Whether to balance the size of the collapsed groups. (logical) |
group_aggregation_fn |
Function for aggregating values in the Default is When using N.B. Only used when |
cat_cols |
Names of categorical columns to balance the average frequency of one or more levels of. |
cat_levels |
Names of the levels in the The weights are automatically scaled to sum to Can be When
|
id_cols |
Names of factor columns with IDs to balance the counts of between groups. E.g. useful to get a similar number of participants in each group. |
See details in collapse_groups()
.
`data`
with a new grouping factor column.
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
fold()
,
group()
,
group_factor()
,
partition()
,
splt()
# Attach packages library(groupdata2) library(dplyr) # Set seed if (requireNamespace("xpectr", quietly = TRUE)){ xpectr::set_test_seed(42) } # Create data frame df <- data.frame( "participant" = factor(rep(1:20, 3)), "age" = rep(sample(c(1:100), 20), 3), "answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)), "score" = sample(c(1:100), 20 * 3) ) df <- df %>% dplyr::arrange(participant) df$session <- rep(c("1", "2", "3"), 20) # Sample rows to get unequal sizes per participant df <- dplyr::sample_n(df, size = 53) # Create the initial groups (to be collapsed) df <- fold( data = df, k = 8, method = "n_dist", id_col = "participant" ) # Ungroup the data frame # Otherwise `collapse_groups*()` would be # applied to each fold separately! df <- dplyr::ungroup(df) # When `auto_tune` is enabled for larger datasets # we recommend enabling parallelization # This can be done with: # library(doParallel) # doParallel::registerDoParallel(7) # use 7 cores ## Not run: # Collapse to 3 groups with size balancing # Creates new `.coll_groups` column df_coll <- collapse_groups_by_size( data = df, n = 3, group_cols = ".folds" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups" )) # Get ranked balances # This is most useful when having created multiple # new group columns with `collapse_groups()` # The scores are standard deviations across groups ranked_balances(coll_summary) # Collapse to 3 groups with *categorical* balancing df_coll <- collapse_groups_by_levels( data = df, n = 3, group_cols = ".folds", cat_cols = "answer" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", cat_cols = 'answer' )) # Collapse to 3 groups with *numerical* balancing # Also balance size to get similar sums # as well as means df_coll <- collapse_groups_by_numeric( data = df, n = 3, group_cols = ".folds", num_cols = "score", balance_size = TRUE ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", num_cols = 'score' )) # Collapse to 3 groups with *ID* balancing # This should give us a similar number of IDs per group df_coll <- collapse_groups_by_ids( data = df, n = 3, group_cols = ".folds", id_cols = "participant" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", id_cols = 'participant' )) # Collapse to 3 groups with balancing of ALL attributes # We create 5 new grouping factors and compare them # The latter is in-general a good strategy even if you # only need a single collapsed grouping factor # as you can choose your preferred balances # based on the summary # NOTE: This is slow (up to a few minutes) # consider enabling parallelization df_coll <- collapse_groups( data = df, n = 3, num_new_group_cols = 5, group_cols = ".folds", cat_cols = "answer", num_cols = 'score', id_cols = "participant", auto_tune = TRUE # Disabled by default in `collapse_groups()` # parallel = TRUE # Add comma above and uncomment ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:5), cat_cols = "answer", num_cols = 'score', id_cols = 'participant' )) # Compare the new grouping columns # The lowest across-group standard deviation # is the most balanced ranked_balances(coll_summary) ## End(Not run)
# Attach packages library(groupdata2) library(dplyr) # Set seed if (requireNamespace("xpectr", quietly = TRUE)){ xpectr::set_test_seed(42) } # Create data frame df <- data.frame( "participant" = factor(rep(1:20, 3)), "age" = rep(sample(c(1:100), 20), 3), "answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)), "score" = sample(c(1:100), 20 * 3) ) df <- df %>% dplyr::arrange(participant) df$session <- rep(c("1", "2", "3"), 20) # Sample rows to get unequal sizes per participant df <- dplyr::sample_n(df, size = 53) # Create the initial groups (to be collapsed) df <- fold( data = df, k = 8, method = "n_dist", id_col = "participant" ) # Ungroup the data frame # Otherwise `collapse_groups*()` would be # applied to each fold separately! df <- dplyr::ungroup(df) # When `auto_tune` is enabled for larger datasets # we recommend enabling parallelization # This can be done with: # library(doParallel) # doParallel::registerDoParallel(7) # use 7 cores ## Not run: # Collapse to 3 groups with size balancing # Creates new `.coll_groups` column df_coll <- collapse_groups_by_size( data = df, n = 3, group_cols = ".folds" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups" )) # Get ranked balances # This is most useful when having created multiple # new group columns with `collapse_groups()` # The scores are standard deviations across groups ranked_balances(coll_summary) # Collapse to 3 groups with *categorical* balancing df_coll <- collapse_groups_by_levels( data = df, n = 3, group_cols = ".folds", cat_cols = "answer" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", cat_cols = 'answer' )) # Collapse to 3 groups with *numerical* balancing # Also balance size to get similar sums # as well as means df_coll <- collapse_groups_by_numeric( data = df, n = 3, group_cols = ".folds", num_cols = "score", balance_size = TRUE ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", num_cols = 'score' )) # Collapse to 3 groups with *ID* balancing # This should give us a similar number of IDs per group df_coll <- collapse_groups_by_ids( data = df, n = 3, group_cols = ".folds", id_cols = "participant" ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = ".coll_groups", id_cols = 'participant' )) # Collapse to 3 groups with balancing of ALL attributes # We create 5 new grouping factors and compare them # The latter is in-general a good strategy even if you # only need a single collapsed grouping factor # as you can choose your preferred balances # based on the summary # NOTE: This is slow (up to a few minutes) # consider enabling parallelization df_coll <- collapse_groups( data = df, n = 3, num_new_group_cols = 5, group_cols = ".folds", cat_cols = "answer", num_cols = 'score', id_cols = "participant", auto_tune = TRUE # Disabled by default in `collapse_groups()` # parallel = TRUE # Add comma above and uncomment ) # Check balances (coll_summary <- summarize_balances( data = df_coll, group_cols = paste0(".coll_groups_", 1:5), cat_cols = "answer", num_cols = 'score', id_cols = 'participant' )) # Compare the new grouping columns # The lowest across-group standard deviation # is the most balanced ranked_balances(coll_summary) ## End(Not run)
Finds values, or indices of values, that differ from the previous value by some threshold(s).
Operates with both a positive and a negative threshold.
Depending on `direction`
, it checks if the difference to the previous value is:
greater than or equal to the positive threshold.
less than or equal to the negative threshold.
differs_from_previous( data, col = NULL, threshold = NULL, direction = "both", return_index = FALSE, include_first = FALSE, handle_na = "ignore", factor_conversion_warning = TRUE )
differs_from_previous( data, col = NULL, threshold = NULL, direction = "both", return_index = FALSE, include_first = FALSE, handle_na = "ignore", factor_conversion_warning = TRUE )
data |
N.B. If checking a N.B. If |
col |
Name of column to find values that differ in. Used when |
threshold |
Threshold to check difference to previous value to.
NULLChecks if the value is different from the previous value. Ignores N.B. Works for both numeric and character vectors. Numeric scalarPositive number. Negative threshold is the negated number. N.B. Only works for numeric vectors. Numeric vector with length 2Given as Negative threshold must be a negative number and positive threshold must be a positive number. N.B. Only works for numeric vectors. |
direction |
bothChecks whether the difference to the previous value is
positiveChecks whether the difference to the previous value is
negativeChecks whether the difference to the previous value is
|
return_index |
Return indices of values that differ. (Logical) |
include_first |
Whether to include the first element of the vector in the output. (Logical) |
handle_na |
How to handle "ignore"Removes the "as_element"Treats all Numeric scalarA numeric value to replace |
factor_conversion_warning |
Whether to throw a warning when converting a |
vector
with either the differing values or the indices of the differing values.
N.B. If `data`
is a grouped data.frame
,
the output is a list
of vector
s
with the differing values. The names are based on the group indices
(see dplyr::group_indices()
).
Ludvig Renbo Olsen, [email protected]
Other l_starts tools:
find_missing_starts()
,
find_starts()
,
group()
,
group_factor()
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = factor(c("a", "a", "b", "b", "c", "c")), "n" = c(1, 3, 6, 2, 2, 4) ) # Get differing values in column 'a' with no threshold. # This will simply check, if it is different to the previous value or not. differs_from_previous(df, col = "a") # Get indices of differing values in column 'a' with no threshold. differs_from_previous(df, col = "a", return_index = TRUE) # Get values, that are 2 or more greater than the previous value differs_from_previous(df, col = "n", threshold = 2, direction = "positive") # Get values, that are 4 or more less than the previous value differs_from_previous(df, col = "n", threshold = 4, direction = "negative") # Get values, that are either 2 or more greater than the previous value # or 4 or more less than the previous value differs_from_previous(df, col = "n", threshold = c(-4, 2), direction = "both")
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = factor(c("a", "a", "b", "b", "c", "c")), "n" = c(1, 3, 6, 2, 2, 4) ) # Get differing values in column 'a' with no threshold. # This will simply check, if it is different to the previous value or not. differs_from_previous(df, col = "a") # Get indices of differing values in column 'a' with no threshold. differs_from_previous(df, col = "a", return_index = TRUE) # Get values, that are 2 or more greater than the previous value differs_from_previous(df, col = "n", threshold = 2, direction = "positive") # Get values, that are 4 or more less than the previous value differs_from_previous(df, col = "n", threshold = 4, direction = "negative") # Get values, that are either 2 or more greater than the previous value # or 4 or more less than the previous value differs_from_previous(df, col = "n", threshold = c(-4, 2), direction = "both")
Uses random downsampling to fix the group sizes to the
smallest group in the data.frame
.
Wraps balance()
.
downsample(data, cat_col, id_col = NULL, id_method = "n_ids")
downsample(data, cat_col, id_col = NULL, id_method = "n_ids")
data |
|
cat_col |
Name of categorical variable to balance by. (Character) |
id_col |
Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID.
How this is used is up to the E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When |
id_method |
Method for balancing the IDs. (Character)
n_ids (default)Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_cAttempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps:
distributedDistributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nestedCalls I.e. if size is |
`id_col`
Downsampling is done without replacement, meaning that rows are not duplicated but only removed.
`id_col`
See `id_method`
description.
data.frame
with some rows removed.
Ordered by potential grouping variables, `cat_col`
and (potentially) `id_col`
.
Ludvig Renbo Olsen, [email protected]
Other sampling functions:
balance()
,
upsample()
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using downsample() downsample(df, cat_col = "diagnosis") # Using downsample() with id_method "n_ids" # With column specifying added rows downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_ids" ) # Using downsample() with id_method "n_rows_c" # With column specifying added rows downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c" ) # Using downsample() with id_method "distributed" downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "distributed" ) # Using downsample() with id_method "nested" downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "nested" )
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using downsample() downsample(df, cat_col = "diagnosis") # Using downsample() with id_method "n_ids" # With column specifying added rows downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_ids" ) # Using downsample() with id_method "n_rows_c" # With column specifying added rows downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c" ) # Using downsample() with id_method "distributed" downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "distributed" ) # Using downsample() with id_method "nested" downsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "nested" )
`data`
Tells you which values and (optionally) skip-to-numbers
that are
recursively removed when using the "l_starts"
method with `remove_missing_starts`
set to TRUE
.
find_missing_starts(data, n, starts_col = NULL, return_skip_numbers = TRUE)
find_missing_starts(data, n, starts_col = NULL, return_skip_numbers = TRUE)
data |
N.B. If |
n |
List of starting positions. Skip values by See |
starts_col |
Name of column with values to match
when |
return_skip_numbers |
Return |
List of start values and skip-to-numbers
or a vector
with the start values.
Returns NULL
if no values were found.
N.B. If `data`
is a grouped data.frame
,
the function is applied group-wise and the output is a
list
of either vector
s or list
s.
The names are based on the group indices
(see dplyr::group_indices()
).
Ludvig Renbo Olsen, [email protected]
Other l_starts tools:
differs_from_previous()
,
find_starts()
,
group()
,
group_factor()
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = c("a", "a", "b", "b", "c", "c"), stringsAsFactors = FALSE ) # Create list of starts starts <- c("a", "e", "b", "d", "c") # Find missing starts with skip_to numbers find_missing_starts(df, starts, starts_col = "a") # Find missing starts without skip_to numbers find_missing_starts(df, starts, starts_col = "a", return_skip_numbers = FALSE )
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = c("a", "a", "b", "b", "c", "c"), stringsAsFactors = FALSE ) # Create list of starts starts <- c("a", "e", "b", "d", "c") # Find missing starts with skip_to numbers find_missing_starts(df, starts, starts_col = "a") # Find missing starts without skip_to numbers find_missing_starts(df, starts, starts_col = "a", return_skip_numbers = FALSE )
Finds values or indices of values that are not the same as the previous value.
E.g. to use with the "l_starts"
method.
Wraps differs_from_previous()
.
find_starts( data, col = NULL, return_index = FALSE, handle_na = "ignore", factor_conversion_warning = TRUE )
find_starts( data, col = NULL, return_index = FALSE, handle_na = "ignore", factor_conversion_warning = TRUE )
data |
N.B. If checking a N.B. If |
col |
Name of column to find starts in. Used when |
return_index |
Whether to return indices of starts. (Logical) |
handle_na |
How to handle "ignore"Removes the "as_element"Treats all Numeric scalarA numeric value to replace |
factor_conversion_warning |
Throw warning when
converting |
vector
with either the start values or the indices of the start values.
N.B. If `data`
is a grouped data.frame
,
the output is a list
of vector
s.
The names are based on the group indices
(see dplyr::group_indices()
).
Ludvig Renbo Olsen, [email protected]
Other l_starts tools:
differs_from_previous()
,
find_missing_starts()
,
group()
,
group_factor()
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = c("a", "a", "b", "b", "c", "c"), stringsAsFactors = FALSE ) # Get start values for new groups in column 'a' find_starts(df, col = "a") # Get indices of start values for new groups # in column 'a' find_starts(df, col = "a", return_index = TRUE ) ## Use found starts with l_starts method # Notice: This is equivalent to n = 'auto' # with l_starts method # Get start values for new groups in column 'a' starts <- find_starts(df, col = "a") # Use starts in group() with 'l_starts' method group(df, n = starts, method = "l_starts", starts_col = "a" ) # Similar but with indices instead of values # Get indices of start values for new groups # in column 'a' starts_ind <- find_starts(df, col = "a", return_index = TRUE ) # Use starts in group() with 'l_starts' method group(df, n = starts_ind, method = "l_starts", starts_col = "index" )
# Attach packages library(groupdata2) # Create a data frame df <- data.frame( "a" = c("a", "a", "b", "b", "c", "c"), stringsAsFactors = FALSE ) # Get start values for new groups in column 'a' find_starts(df, col = "a") # Get indices of start values for new groups # in column 'a' find_starts(df, col = "a", return_index = TRUE ) ## Use found starts with l_starts method # Notice: This is equivalent to n = 'auto' # with l_starts method # Get start values for new groups in column 'a' starts <- find_starts(df, col = "a") # Use starts in group() with 'l_starts' method group(df, n = starts, method = "l_starts", starts_col = "a" ) # Similar but with indices instead of values # Get indices of start values for new groups # in column 'a' starts_ind <- find_starts(df, col = "a", return_index = TRUE ) # Use starts in group() with 'l_starts' method group(df, n = starts_ind, method = "l_starts", starts_col = "index" )
Divides data into groups by a wide range of methods. Balances a given categorical variable and/or numerical variable between folds and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same fold. Can create multiple unique fold columns for repeated cross-validation.
fold( data, k = 5, cat_col = NULL, num_col = NULL, id_col = NULL, method = "n_dist", id_aggregation_fn = sum, extreme_pairing_levels = 1, num_fold_cols = 1, unique_fold_cols_only = TRUE, max_iters = 5, use_of_triplets = "fill", handle_existing_fold_cols = "keep_warn", parallel = FALSE )
fold( data, k = 5, cat_col = NULL, num_col = NULL, id_col = NULL, method = "n_dist", id_aggregation_fn = sum, extreme_pairing_levels = 1, num_fold_cols = 1, unique_fold_cols_only = TRUE, max_iters = 5, use_of_triplets = "fill", handle_existing_fold_cols = "keep_warn", parallel = FALSE )
data |
|
k |
Depends on Number of folds (default), fold size, with more (see When Given as whole number or percentage ( |
cat_col |
Name of categorical variable to balance between folds. E.g. when predicting a binary variable (a or b), we usually want both classes represented in every fold. N.B. If also passing an |
num_col |
Name of numerical variable to balance between folds. N.B. When used with N.B. When passing |
id_col |
Name of factor with IDs. This will be used to keep all rows that share an ID in the same fold (if possible). E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same fold. N.B. When |
method |
Notice: examples are sizes of the generated groups
based on a vector with n_dist (default)Divides the data into a specified number of groups and
distributes excess data points across groups
n_fillDivides the data into a specified number of groups and
fills up groups with excess data points from the beginning
n_lastDivides the data into a specified number of groups.
It finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size
n_randDivides the data into a specified number of groups.
Excess data points are placed randomly in groups (only 1 per group)
greedyDivides up the data greedily given a specified group size
staircaseUses step size to divide up the data.
Group size increases with 1 step for every group,
until there is no more data
|
id_aggregation_fn |
Function for aggregating values in N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing folds by a numerical column (i.e. Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high, the result might not be stochastic. Always check if an increase actually makes the folds more balanced. See example. |
num_fold_cols |
Number of fold columns to create. Useful for repeated cross-validation. If N.B. If N.B. If |
unique_fold_cols_only |
Check if fold columns are identical and keep only unique columns. As the number of column comparisons can be time consuming,
we can run this part in parallel. See N.B. We can end up with fewer columns than specified in
N.B. Only used when |
max_iters |
Maximum number of attempts at reaching
When only keeping unique fold columns, we risk having fewer columns than expected.
Hence, we repeatedly create the missing columns and remove those that are not unique.
This is done until we have In some cases, it is not possible to create N.B. Only used when |
use_of_triplets |
When to use extreme triplet grouping in numerical balancing (when fill (default)When extreme pairing cannot create enough unique fold columns, use extreme triplet grouping to create additional unique fold columns. insteadUse extreme triplet grouping instead of extreme pairing. For some datasets, grouping in triplets give better balancing than grouping in pairs. This can be worth exploring when numerical balancing is important. Tip: Compare the balances with neverNever use extreme triplet grouping. Extreme triplet groupingSimilar to extreme pairing (see For some datasets, this can be give more balanced groups than extreme pairing, but on average, extreme pairing works better. Due to the grouping into triplets instead of pairs they tend to create different groupings though, so when creating many fold columns and extreme pairing cannot create enough unique fold columns, we can create the remaining (or at least some additional number) with extreme triplet grouping. Extreme triplet grouping is implemented in
|
handle_existing_fold_cols |
How to handle existing fold columns.
Either To add extra fold columns, use To replace the existing fold columns, use |
parallel |
Whether to parallelize the fold column comparisons,
when Requires a registered parallel backend.
Like |
`data`
is subset by `cat_col`
.
Subsets are grouped and merged.
Groups are created from unique IDs.
Rows are shuffled.
Note that this will only affect rows with the same value in `num_col`
.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc.
Each pair get a group identifier. (See rearrr::pair_extremes()
)
If `extreme_pairing_levels` > 1
: These group identifiers are reordered as smallest,
largest, second smallest, second largest, etc., by the sum of `num_col`
in the represented rows.
These pairs (of pairs) get a new set of group identifiers, and the process is repeated
`extreme_pairing_levels`-2
times. Note that the group identifiers at the last level will represent
2^`extreme_pairing_levels`
rows, why you should be careful when choosing that setting.
The group identifiers from the last pairing are folded (randomly divided into groups), and the fold identifiers are transferred to the original rows.
N.B. When doing extreme pairing of an unequal number of rows, the row with the smallest value is placed in a group by itself, and the order is instead: smallest, second smallest, largest, third smallest, second largest, etc.
N.B. When `num_fold_cols` > 1
and fewer than `num_fold_cols`
fold columns have
been created after `max_iters`
attempts, we try with extreme triplets instead
(see rearrr::triplet_extremes()
). It groups the elements
as smallest, closest to the median, largest, second smallest, second closest to the median, second largest, etc.
We can also choose to never/only use extreme triplets via `use_of_triplets`
.
`data`
is subset by `cat_col`
.
Groups are created from unique IDs in each subset.
Subsets are merged.
`data`
is subset by `cat_col`
.
Subsets are grouped by `num_col`
.
Subsets are merged such that the largest group
(by sum of `num_col`
) from the first category
is merged with the smallest group from the second category, etc.
Values in `num_col`
are aggregated for each ID, using `id_aggregation_fn`
.
The IDs are grouped, using the aggregated values as "num_col
".
The groups of the IDs are transferred to the rows.
Values in `num_col`
are aggregated for each ID, using `id_aggregation_fn`
.
IDs are subset by `cat_col`
.
The IDs in each subset are grouped,
by using the aggregated values as "num_col
".
The subsets are merged such that the largest group (by sum of the aggregated values) from the first category is merged with the smallest group from the second category, etc.
The groups of the IDs are transferred to the rows.
data.frame
with grouping factor for subsetting in cross-validation.
Ludvig Renbo Olsen, [email protected]
partition
for balanced partitions
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
collapse_groups_by
,
group()
,
group_factor()
,
partition()
,
splt()
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using fold() ## Without balancing df_folded <- fold(data = df, k = 3, method = "n_dist") ## With cat_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", method = "n_dist" ) ## With id_col df_folded <- fold( data = df, k = 3, id_col = "participant", method = "n_dist" ) ## With num_col # Note: 'method' would not be used in this case df_folded <- fold(data = df, k = 3, num_col = "score") # With cat_col and id_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", method = "n_dist" ) ## With cat_col, id_col and num_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", num_col = "score" ) # Order by folds df_folded <- df_folded %>% arrange(.folds) ## Multiple fold columns # Useful for repeated cross-validation # Note: Consider running in parallel df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", num_fold_cols = 5, unique_fold_cols_only = TRUE, max_iters = 4 ) # Different `k` per fold column # Note: `length(k) == num_fold_cols` df_folded <- fold( data = df, k = c(2, 3), cat_col = "diagnosis", id_col = "participant", num_fold_cols = 2, unique_fold_cols_only = TRUE, max_iters = 4 ) # Check the generated columns # with `summarize_group_cols()` summarize_group_cols( data = df_folded, group_cols = paste0('.folds_', 1:2) ) ## Check if additional `extreme_pairing_levels` ## improve the numerical balance set.seed(2) # try with seed 1 as well df_folded_1 <- fold( data = df, k = 3, num_col = "score", extreme_pairing_levels = 1 ) df_folded_1 %>% dplyr::ungroup() %>% summarize_balances(group_cols = '.folds', num_cols = 'score') set.seed(2) # Try with seed 1 as well df_folded_2 <- fold( data = df, k = 3, num_col = "score", extreme_pairing_levels = 2 ) df_folded_2 %>% dplyr::ungroup() %>% summarize_balances(group_cols = '.folds', num_cols = 'score') # We can directly compare how balanced the 'score' is # in the two fold columns using a combination of # `summarize_balances()` and `ranked_balances()` # We see that the second fold column (made with `extreme_pairing_levels = 2`) # has a lower standard deviation of its mean scores - meaning that they # are more similar and thus more balanced df_folded_1$.folds_2 <- df_folded_2$.folds df_folded_1 %>% dplyr::ungroup() %>% summarize_balances(group_cols = c('.folds', '.folds_2'), num_cols = 'score') %>% ranked_balances()
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using fold() ## Without balancing df_folded <- fold(data = df, k = 3, method = "n_dist") ## With cat_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", method = "n_dist" ) ## With id_col df_folded <- fold( data = df, k = 3, id_col = "participant", method = "n_dist" ) ## With num_col # Note: 'method' would not be used in this case df_folded <- fold(data = df, k = 3, num_col = "score") # With cat_col and id_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", method = "n_dist" ) ## With cat_col, id_col and num_col df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", num_col = "score" ) # Order by folds df_folded <- df_folded %>% arrange(.folds) ## Multiple fold columns # Useful for repeated cross-validation # Note: Consider running in parallel df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", id_col = "participant", num_fold_cols = 5, unique_fold_cols_only = TRUE, max_iters = 4 ) # Different `k` per fold column # Note: `length(k) == num_fold_cols` df_folded <- fold( data = df, k = c(2, 3), cat_col = "diagnosis", id_col = "participant", num_fold_cols = 2, unique_fold_cols_only = TRUE, max_iters = 4 ) # Check the generated columns # with `summarize_group_cols()` summarize_group_cols( data = df_folded, group_cols = paste0('.folds_', 1:2) ) ## Check if additional `extreme_pairing_levels` ## improve the numerical balance set.seed(2) # try with seed 1 as well df_folded_1 <- fold( data = df, k = 3, num_col = "score", extreme_pairing_levels = 1 ) df_folded_1 %>% dplyr::ungroup() %>% summarize_balances(group_cols = '.folds', num_cols = 'score') set.seed(2) # Try with seed 1 as well df_folded_2 <- fold( data = df, k = 3, num_col = "score", extreme_pairing_levels = 2 ) df_folded_2 %>% dplyr::ungroup() %>% summarize_balances(group_cols = '.folds', num_cols = 'score') # We can directly compare how balanced the 'score' is # in the two fold columns using a combination of # `summarize_balances()` and `ranked_balances()` # We see that the second fold column (made with `extreme_pairing_levels = 2`) # has a lower standard deviation of its mean scores - meaning that they # are more similar and thus more balanced df_folded_1$.folds_2 <- df_folded_2$.folds df_folded_1 %>% dplyr::ungroup() %>% summarize_balances(group_cols = c('.folds', '.folds_2'), num_cols = 'score') %>% ranked_balances()
Divides data into groups by a wide range of methods.
Creates a grouping factor with 1
s for group 1, 2
s for group 2, etc.
Returns a data.frame
grouped by the grouping factor for easy use in
magrittr `%>%`
pipelines.
By default*, the data points in a group are connected sequentially (e.g. c(1, 1, 2, 2, 3, 3)
)
and splitting is done from top to bottom. *Except in the "every"
method.
There are five types of grouping methods:
The "n_*"
methods split the data into a given number of groups.
They differ in how they handle excess data points.
The "greedy"
method uses a group size to split the data into groups,
greedily grabbing `n`
data points from the top.
The last group may thus differ in size (e.g. c(1, 1, 2, 2, 3)
).
The "l_*"
methods use a list of either starting points ("l_starts"
)
or group sizes ("l_sizes"
). The "l_starts"
method can also auto-detect group starts
(when a value differs from the previous value).
The "every"
method puts every `n`
th data point into the same group
(e.g. c(1, 2, 3, 1, 2, 3)
).
The step methods "staircase"
and "primes"
increase the group size by a step for each group.
Note: To create groups balanced by a categorical and/or numerical variable, see the
fold()
and partition()
functions.
group( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, return_factor = FALSE, descending = FALSE, randomize = FALSE, col_name = ".groups", remove_missing_starts = FALSE )
group( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, return_factor = FALSE, descending = FALSE, randomize = FALSE, col_name = ".groups", remove_missing_starts = FALSE )
data |
|
n |
Depends on Number of groups (default), group size, list of group sizes,
list of group starts, number of data points between group members,
step size or prime number to start at. See Passed as whole number(s) and/or percentage(s) ( Method |
method |
Note: examples are sizes of the generated groups
based on a vector with greedyDivides up the data greedily given a specified group size
n_dist (default)Divides the data into a specified number of groups and
distributes excess data points across groups
n_fillDivides the data into a specified number of groups and
fills up groups with excess data points from the beginning
n_lastDivides the data into a specified number of groups.
It finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size
n_randDivides the data into a specified number of groups.
Excess data points are placed randomly in groups (max. 1 per group)
l_sizesDivides up the data by a
l_startsStarts new groups at specified values in the
To skip: If passing
everyCombines every
staircaseUses step size to divide up the data.
Group size increases with 1 step for every group,
until there is no more data
primesUses prime numbers as group sizes.
Group size increases to the next prime number
until there is no more data.
|
starts_col |
Name of column with values to match in method |
force_equal |
Create equal groups by discarding excess data points. Implementation varies between methods. (Logical) |
allow_zero |
Whether |
return_factor |
Only return the grouping factor. (Logical) |
descending |
Change the direction of the method. (Not fully implemented) (Logical) |
randomize |
Randomize the grouping factor. (Logical) |
col_name |
Name of the added grouping factor. |
remove_missing_starts |
Recursively remove elements from the
list of starts that are not found.
For method |
data.frame
grouped by existing grouping variables and the new grouping factor.
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
collapse_groups_by
,
fold()
,
group_factor()
,
partition()
,
splt()
Other staircase tools:
%primes%()
,
%staircase%()
,
group_factor()
Other l_starts tools:
differs_from_previous()
,
find_missing_starts()
,
find_starts()
,
group_factor()
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using group() df_grouped <- group(df, n = 5, method = "n_dist") # Using group() in pipeline to get mean age df_means <- df %>% group(n = 5, method = "n_dist") %>% dplyr::summarise(mean_age = mean(age)) # Using group() with `l_sizes` df_grouped <- group( data = df, n = list(0.2, 0.3), method = "l_sizes" ) # Using group_factor() with `l_starts` # `c('pig', 2)` skips to the second appearance of # 'pig' after the first appearance of 'cat' df_grouped <- group( data = df, n = list("cat", c("pig", 2), "human"), method = "l_starts", starts_col = "species" )
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using group() df_grouped <- group(df, n = 5, method = "n_dist") # Using group() in pipeline to get mean age df_means <- df %>% group(n = 5, method = "n_dist") %>% dplyr::summarise(mean_age = mean(age)) # Using group() with `l_sizes` df_grouped <- group( data = df, n = list(0.2, 0.3), method = "l_sizes" ) # Using group_factor() with `l_starts` # `c('pig', 2)` skips to the second appearance of # 'pig' after the first appearance of 'cat' df_grouped <- group( data = df, n = list("cat", c("pig", 2), "human"), method = "l_starts", starts_col = "species" )
Divides data into groups by a wide range of methods.
Creates and returns a grouping factor
with 1
s for group 1, 2
s for group 2, etc.
By default*, the data points in a group are connected sequentially (e.g. c(1, 1, 2, 2, 3, 3)
)
and splitting is done from top to bottom. *Except in the "every"
method.
There are five types of grouping methods:
The "n_*"
methods split the data into a given number of groups.
They differ in how they handle excess data points.
The "greedy"
method uses a group size to split the data into groups,
greedily grabbing `n`
data points from the top.
The last group may thus differ in size (e.g. c(1, 1, 2, 2, 3)
).
The "l_*"
methods use a list of either starting points ("l_starts"
)
or group sizes ("l_sizes"
). The "l_starts"
method can also auto-detect group starts
(when a value differs from the previous value).
The "every"
method puts every `n`
th data point into the same group
(e.g. c(1, 2, 3, 1, 2, 3)
).
The step methods "staircase"
and "primes"
increase the group size by a step for each group.
Note: To create groups balanced by a categorical and/or numerical variable, see the
fold()
and partition()
functions.
group_factor( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, descending = FALSE, randomize = FALSE, remove_missing_starts = FALSE )
group_factor( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, descending = FALSE, randomize = FALSE, remove_missing_starts = FALSE )
data |
|
n |
Depends on Number of groups (default), group size, list of group sizes,
list of group starts, number of data points between group members,
step size or prime number to start at. See Passed as whole number(s) and/or percentage(s) ( Method |
method |
Note: examples are sizes of the generated groups
based on a vector with greedyDivides up the data greedily given a specified group size
n_dist (default)Divides the data into a specified number of groups and
distributes excess data points across groups
n_fillDivides the data into a specified number of groups and
fills up groups with excess data points from the beginning
n_lastDivides the data into a specified number of groups.
It finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size
n_randDivides the data into a specified number of groups.
Excess data points are placed randomly in groups (max. 1 per group)
l_sizesDivides up the data by a
l_startsStarts new groups at specified values in the
To skip: If passing
everyCombines every
staircaseUses step size to divide up the data.
Group size increases with 1 step for every group,
until there is no more data
primesUses prime numbers as group sizes.
Group size increases to the next prime number
until there is no more data.
|
starts_col |
Name of column with values to match in method |
force_equal |
Create equal groups by discarding excess data points. Implementation varies between methods. (Logical) |
allow_zero |
Whether |
descending |
Change the direction of the method. (Not fully implemented) (Logical) |
randomize |
Randomize the grouping factor. (Logical) |
remove_missing_starts |
Recursively remove elements from the
list of starts that are not found.
For method |
Grouping factor with 1
s for group 1, 2
s for group 2, etc.
N.B. If `data`
is a grouped data.frame
,
the output is a data.frame
with the existing groupings
and the generated grouping factor. The row order from `data`
is maintained.
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
collapse_groups_by
,
fold()
,
group()
,
partition()
,
splt()
Other staircase tools:
%primes%()
,
%staircase%()
,
group()
Other l_starts tools:
differs_from_previous()
,
find_missing_starts()
,
find_starts()
,
group()
# Attach packages library(groupdata2) library(dplyr) # Create a data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using group_factor() with n_dist groups <- group_factor(df, 5, method = "n_dist") df$groups <- groups # Using group_factor() with greedy groups <- group_factor(df, 5, method = "greedy") df$groups <- groups # Using group_factor() with l_sizes groups <- group_factor(df, list(0.2, 0.3), method = "l_sizes") df$groups <- groups # Using group_factor() with l_starts groups <- group_factor(df, list("cat", c("pig", 2), "human"), method = "l_starts", starts_col = "species" ) df$groups <- groups
# Attach packages library(groupdata2) library(dplyr) # Create a data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using group_factor() with n_dist groups <- group_factor(df, 5, method = "n_dist") df$groups <- groups # Using group_factor() with greedy groups <- group_factor(df, 5, method = "greedy") df$groups <- groups # Using group_factor() with l_sizes groups <- group_factor(df, list(0.2, 0.3), method = "l_sizes") df$groups <- groups # Using group_factor() with l_starts groups <- group_factor(df, list("cat", c("pig", 2), "human"), method = "l_starts", starts_col = "species" ) df$groups <- groups
Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.
partition( data, p = 0.2, cat_col = NULL, num_col = NULL, id_col = NULL, id_aggregation_fn = sum, extreme_pairing_levels = 1, force_equal = FALSE, list_out = TRUE )
partition( data, p = 0.2, cat_col = NULL, num_col = NULL, id_col = NULL, id_aggregation_fn = sum, extreme_pairing_levels = 1, force_equal = FALSE, list_out = TRUE )
data |
|
p |
List or vector of partition sizes.
Given as whole number(s) and/or percentage(s) ( E.g. |
cat_col |
Name of categorical variable to balance between partitions. E.g. when training and testing a model for predicting a binary variable (a or b), we usually want both classes represented in both the training set and the test set. N.B. If also passing an |
num_col |
Name of numerical variable to balance between partitions. N.B. When used with |
id_col |
Name of factor with IDs. Used to keep all rows that share an ID in the same partition (if possible). E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same partition. N.B. When |
id_aggregation_fn |
Function for aggregating values in N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing partitions by a numerical column (i.e. Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high,
the result might not be stochastic. Always check if an increase
actually makes the partitions more balanced. See |
force_equal |
Whether to discard excess data. (Logical) |
list_out |
Whether to return partitions in a N.B. When |
`data`
is subset by `cat_col`
.
Subsets are partitioned and merged.
Partitions are created from unique IDs.
Rows are shuffled. Note that this will only affect rows with the same value in `num_col`
.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc. Each pair get a group identifier.
If `extreme_pairing_levels` > 1
: The group identifiers are reordered as smallest,
largest, second smallest, second largest, etc., by the sum of `num_col`
in the represented rows.
These pairs (of pairs) get a new set of group identifiers, and the process is repeated
`extreme_pairing_levels`-2
times. Note that the group identifiers at the last level will represent
2^`extreme_pairing_levels`
rows, why you should be careful when choosing that setting.
The final group identifiers are shuffled, and their order is applied to the full dataset.
The ordered dataset is split by the sizes in `p`
.
N.B. When doing extreme pairing of an unequal number of rows, the row with the largest value is placed in a group by itself, and the order is instead: smallest, second largest, second smallest, third largest, ... , largest.
`data`
is subset by `cat_col`
.
Partitions are created from unique IDs in each subset.
Subsets are merged.
`data`
is subset by `cat_col`
.
Subsets are partitioned by `num_col`
.
Subsets are merged.
Values in `num_col`
are aggregated for each ID, using id_aggregation_fn
.
The IDs are partitioned, using the aggregated values as "num_col
".
The partition identifiers are transferred to the rows of the IDs.
Values in `num_col`
are aggregated for each ID, using id_aggregation_fn
.
IDs are subset by `cat_col`
.
The IDs for each subset are partitioned,
by using the aggregated values as "num_col
".
The partition identifiers are transferred to the rows of the IDs.
If `list_out`
is TRUE
:
A list
of partitions where partitions are data.frame
s.
If `list_out`
is FALSE
:
A data.frame
with grouping factor for subsetting.
N.B. When `data`
is a grouped data.frame
,
the output is always a data.frame
with a grouping factor.
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
collapse_groups_by
,
fold()
,
group()
,
group_factor()
,
splt()
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using partition() # Without balancing partitions <- partition(data = df, p = c(0.2, 0.3)) # With cat_col partitions <- partition(data = df, p = 0.5, cat_col = "diagnosis") # With id_col partitions <- partition(data = df, p = 0.5, id_col = "participant") # With num_col partitions <- partition(data = df, p = 0.5, num_col = "score") # With cat_col and id_col partitions <- partition( data = df, p = 0.5, cat_col = "diagnosis", id_col = "participant" ) # With cat_col, num_col and id_col partitions <- partition( data = df, p = 0.5, cat_col = "diagnosis", num_col = "score", id_col = "participant" ) # Return data frame with grouping factor # with list_out = FALSE partitions <- partition(df, c(0.5), list_out = FALSE) # Check if additional extreme_pairing_levels # improve the numerical balance set.seed(2) # try with seed 1 as well partitions_1 <- partition( data = df, p = 0.5, num_col = "score", extreme_pairing_levels = 1, list_out = FALSE ) partitions_1 %>% dplyr::group_by(.partitions) %>% dplyr::summarise( sum_score = sum(score), mean_score = mean(score) ) set.seed(2) # try with seed 1 as well partitions_2 <- partition( data = df, p = 0.5, num_col = "score", extreme_pairing_levels = 2, list_out = FALSE ) partitions_2 %>% dplyr::group_by(.partitions) %>% dplyr::summarise( sum_score = sum(score), mean_score = mean(score) )
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using partition() # Without balancing partitions <- partition(data = df, p = c(0.2, 0.3)) # With cat_col partitions <- partition(data = df, p = 0.5, cat_col = "diagnosis") # With id_col partitions <- partition(data = df, p = 0.5, id_col = "participant") # With num_col partitions <- partition(data = df, p = 0.5, num_col = "score") # With cat_col and id_col partitions <- partition( data = df, p = 0.5, cat_col = "diagnosis", id_col = "participant" ) # With cat_col, num_col and id_col partitions <- partition( data = df, p = 0.5, cat_col = "diagnosis", num_col = "score", id_col = "participant" ) # Return data frame with grouping factor # with list_out = FALSE partitions <- partition(df, c(0.5), list_out = FALSE) # Check if additional extreme_pairing_levels # improve the numerical balance set.seed(2) # try with seed 1 as well partitions_1 <- partition( data = df, p = 0.5, num_col = "score", extreme_pairing_levels = 1, list_out = FALSE ) partitions_1 %>% dplyr::group_by(.partitions) %>% dplyr::summarise( sum_score = sum(score), mean_score = mean(score) ) set.seed(2) # try with seed 1 as well partitions_2 <- partition( data = df, p = 0.5, num_col = "score", extreme_pairing_levels = 2, list_out = FALSE ) partitions_2 %>% dplyr::group_by(.partitions) %>% dplyr::summarise( sum_score = sum(score), mean_score = mean(score) )
Extract the standard deviations (default) from the "Summary" data.frame
from the output of summarize_balances()
,
ordered by the `SD_rank`
column.
See examples of usage in
summarize_balances()
.
ranked_balances(summary, measure = "SD")
ranked_balances(summary, measure = "SD")
summary |
Can also be the direct output list of
|
measure |
The measure to extract rows for. One of:
The most meaningful measures to consider as metrics of balance are NOTE: Ranks are of standard deviations and not affected by this argument. |
The rows in `summary`
where `measure` == "SD"
,
ordered by the `SD_rank`
column.
Ludvig Renbo Olsen, [email protected]
Other summarization functions:
summarize_balances()
,
summarize_group_cols()
Divides data into groups by a wide range of methods. Splits data by these groups.
splt( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, descending = FALSE, randomize = FALSE, remove_missing_starts = FALSE )
splt( data, n, method = "n_dist", starts_col = NULL, force_equal = FALSE, allow_zero = FALSE, descending = FALSE, randomize = FALSE, remove_missing_starts = FALSE )
data |
|
n |
Depends on Number of groups (default), group size, list of group sizes,
list of group starts, number of data points between group members,
step size or prime number to start at. See Passed as whole number(s) and/or percentage(s) ( Method |
method |
Note: examples are sizes of the generated groups
based on a vector with greedyDivides up the data greedily given a specified group size
n_dist (default)Divides the data into a specified number of groups and
distributes excess data points across groups
n_fillDivides the data into a specified number of groups and
fills up groups with excess data points from the beginning
n_lastDivides the data into a specified number of groups.
It finds the most equal group sizes possible,
using all data points. Only the last group is able to differ in size
n_randDivides the data into a specified number of groups.
Excess data points are placed randomly in groups (max. 1 per group)
l_sizesDivides up the data by a
l_startsStarts new groups at specified values in the
To skip: If passing
everyCombines every
staircaseUses step size to divide up the data.
Group size increases with 1 step for every group,
until there is no more data
primesUses prime numbers as group sizes.
Group size increases to the next prime number
until there is no more data.
|
starts_col |
Name of column with values to match in method |
force_equal |
Create equal groups by discarding excess data points. Implementation varies between methods. (Logical) |
allow_zero |
Whether |
descending |
Change the direction of the method. (Not fully implemented) (Logical) |
randomize |
Randomize the grouping factor. (Logical) |
remove_missing_starts |
Recursively remove elements from the
list of starts that are not found.
For method |
list
of the split `data`
.
N.B. If `data`
is a grouped data.frame
, there's an outer list
for each group. The names are based on the group indices
(see dplyr::group_indices()
).
Ludvig Renbo Olsen, [email protected]
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
collapse_groups_by
,
fold()
,
group()
,
group_factor()
,
partition()
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using splt() df_list <- splt(df, 5, method = "n_dist")
# Attach packages library(groupdata2) library(dplyr) # Create data frame df <- data.frame( "x" = c(1:12), "species" = factor(rep(c("cat", "pig", "human"), 4)), "age" = sample(c(1:100), 12) ) # Using splt() df_list <- splt(df, 5, method = "n_dist")
Summarize the balances of numeric, categorical, and ID columns in and between groups in one or more group columns.
This tool allows you to quickly and thoroughly assess the balance
of different columns between groups. This is for instance useful
after creating groups with fold()
,
partition()
, or
collapse_groups()
to
check how well they did and to compare multiple
groupings.
The output contains:
`Groups`
: a summary per group (per grouping column).
`Summary`
: statistical descriptors of the group summaries.
`Normalized Summary`
: statistical descriptors of a set of
"normalized" group summaries. (Disabled by default)
When comparing how balanced the grouping columns are, we can use
the standard deviations of the group summary columns. The lower a standard
deviation is, the more similar the groups are in that column. To quickly
extract these standard deviations, ordered by an aggregated rank,
use ranked_balances()
on the
"Summary" data.frame
in the output.
summarize_balances( data, group_cols, cat_cols = NULL, num_cols = NULL, id_cols = NULL, summarize_size = TRUE, include_normalized = FALSE, rank_weights = NULL, cat_levels_rank_weights = NULL, num_normalize_fn = function(x) { rearrr::min_max_scale(x, old_min = quantile(x, 0.025), old_max = quantile(x, 0.975), new_min = 0, new_max = 1) } )
summarize_balances( data, group_cols, cat_cols = NULL, num_cols = NULL, id_cols = NULL, summarize_size = TRUE, include_normalized = FALSE, rank_weights = NULL, cat_levels_rank_weights = NULL, num_normalize_fn = function(x) { rearrr::min_max_scale(x, old_min = quantile(x, 0.025), old_max = quantile(x, 0.975), new_min = 0, new_max = 1) } )
data |
Can be grouped (see |
group_cols |
Names of columns with group identifiers to summarize columns
in |
cat_cols |
Names of categorical columns to summarize. Each categorical level is counted per group. To distinguish between levels with the same name from different
Normalization when |
num_cols |
Names of numerical columns to summarize. For each column, the Normalization when |
id_cols |
Names of The number of unique IDs are counted per group. Normalization when |
summarize_size |
Whether to summarize the number of rows per group. |
include_normalized |
Whether to calculate and include the normalized summary in the output. |
rank_weights |
A named When summarizing size (see E.g. |
cat_levels_rank_weights |
Weights for averaging ranks of the categorical levels in E.g. |
num_normalize_fn |
Function for normalizing the Only used when |
list
with two/three data.frames
:
A summary per group.
`cat_cols`
: Each level has its own column with the count
of the level per group.
`num_cols`
: The mean
and sum
per group.
`id_cols`
: The count of unique IDs per group.
Statistical descriptors of the columns in `Groups`
.
Contains the mean
, median
, standard deviation (SD
),
interquartile range (IQR
), min
, and max
measures.
Especially the standard deviations and IQR measures can tell us about how
balanced the groups are. When comparing multiple `group_cols`
,
the group column with the lowest SD
and IQR
can be considered the most balanced.
(Disabled by default)
Same statistical descriptors as in `Summary`
but for a
"normalized" version of the group summaries. The motivation
is that these normalized measures can more easily be compared
or combined to a single "balance score".
First, we normalize each balance column:
`cat_cols`
: The level counts in the original group summaries are
normalized with with log(1 + count)
. This eases comparison
of the statistical descriptors (especially standard deviations)
of levels with very different count scales.
`num_cols`
: The numeric columns are normalized prior to
summarization by group, using the `num_normalize_fn`
function.
By default this applies MinMax scaling to columns such that ~95% of the values
are expected to be in the [0, 1]
range.
`id_cols`
: The counts of unique IDs in the original group summaries are
normalized with log(1 + count)
.
Contains the mean
, median
, standard deviation (SD
),
interquartile range (IQR
), min
, and max
measures.
Ludvig Renbo Olsen, [email protected]
Other summarization functions:
ranked_balances()
,
summarize_group_cols()
# Attach packages library(groupdata2) library(dplyr) set.seed(1) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using fold() ## Without balancing set.seed(1) df_folded <- fold(data = df, k = 3) # Check the balances of the various columns # As we have not used balancing in `fold()` # we should not expect it to be amazingly balanced df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = ".folds", num_cols = c("score", "age"), cat_cols = "diagnosis", id_cols = "participant" ) ## With balancing set.seed(1) df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", num_col = 'score', id_col = 'participant' ) # Now the balance should be better # although it may be difficult to get a good balance # the 'score' column when also balancing on 'diagnosis' # and keeping all rows per participant in the same fold df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = ".folds", num_cols = c("score", "age"), cat_cols = "diagnosis", id_cols = "participant" ) # Comparing multiple grouping columns # Create 3 fold column that only balance "score" set.seed(1) df_folded <- fold( data = df, k = 3, num_fold_cols = 3, num_col = 'score' ) # Summarize all three grouping cols at once (summ <- df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = paste0(".folds_", 1:3), num_cols = c("score") ) ) # Extract the across-group standard deviations # The group column with the lowest standard deviation(s) # is the most balanced group column summ %>% ranked_balances()
# Attach packages library(groupdata2) library(dplyr) set.seed(1) # Create data frame df <- data.frame( "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)), "age" = rep(sample(c(1:100), 6), 3), "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)), "score" = sample(c(1:100), 3 * 6) ) df <- df %>% arrange(participant) df$session <- rep(c("1", "2", "3"), 6) # Using fold() ## Without balancing set.seed(1) df_folded <- fold(data = df, k = 3) # Check the balances of the various columns # As we have not used balancing in `fold()` # we should not expect it to be amazingly balanced df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = ".folds", num_cols = c("score", "age"), cat_cols = "diagnosis", id_cols = "participant" ) ## With balancing set.seed(1) df_folded <- fold( data = df, k = 3, cat_col = "diagnosis", num_col = 'score', id_col = 'participant' ) # Now the balance should be better # although it may be difficult to get a good balance # the 'score' column when also balancing on 'diagnosis' # and keeping all rows per participant in the same fold df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = ".folds", num_cols = c("score", "age"), cat_cols = "diagnosis", id_cols = "participant" ) # Comparing multiple grouping columns # Create 3 fold column that only balance "score" set.seed(1) df_folded <- fold( data = df, k = 3, num_fold_cols = 3, num_col = 'score' ) # Summarize all three grouping cols at once (summ <- df_folded %>% dplyr::ungroup() %>% summarize_balances( group_cols = paste0(".folds_", 1:3), num_cols = c("score") ) ) # Extract the across-group standard deviations # The group column with the lowest standard deviation(s) # is the most balanced group column summ %>% ranked_balances()
Get the following summary statistics for each group column:
Number of groups
Mean, median, std., IQR, min, and max number of rows per group.
The output can be given in either long (default) or wide format.
summarize_group_cols(data, group_cols, long = TRUE)
summarize_group_cols(data, group_cols, long = TRUE)
data |
|
group_cols |
Names of columns to summarize. These columns must be factors in |
long |
Whether the output should be in long or wide format. |
Data frame (tibble
) with summary statistics for each column in `group_cols`
.
Ludvig Renbo Olsen, [email protected]
Other summarization functions:
ranked_balances()
,
summarize_balances()
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "some_var" = runif(25), "grp_1" = factor(sample(1:5, size = 25, replace=TRUE)), "grp_2" = factor(sample(1:8, size = 25, replace=TRUE)), "grp_3" = factor(sample(LETTERS[1:3], size = 25, replace=TRUE)), "grp_4" = factor(sample(LETTERS[1:12], size = 25, replace=TRUE)) ) # Summarize the group columns (long format) summarize_group_cols( data = df, group_cols = paste0("grp_", 1:4), long = TRUE ) # Summarize the group columns (wide format) summarize_group_cols( data = df, group_cols = paste0("grp_", 1:4), long = FALSE )
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "some_var" = runif(25), "grp_1" = factor(sample(1:5, size = 25, replace=TRUE)), "grp_2" = factor(sample(1:8, size = 25, replace=TRUE)), "grp_3" = factor(sample(LETTERS[1:3], size = 25, replace=TRUE)), "grp_4" = factor(sample(LETTERS[1:12], size = 25, replace=TRUE)) ) # Summarize the group columns (long format) summarize_group_cols( data = df, group_cols = paste0("grp_", 1:4), long = TRUE ) # Summarize the group columns (wide format) summarize_group_cols( data = df, group_cols = paste0("grp_", 1:4), long = FALSE )
Uses random upsampling to fix the group sizes to the largest group in the data frame.
Wraps balance()
.
upsample( data, cat_col, id_col = NULL, id_method = "n_ids", mark_new_rows = FALSE, new_rows_col_name = ".new_row" )
upsample( data, cat_col, id_col = NULL, id_method = "n_ids", mark_new_rows = FALSE, new_rows_col_name = ".new_row" )
data |
|
cat_col |
Name of categorical variable to balance by. (Character) |
id_col |
Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID.
How this is used is up to the E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When |
id_method |
Method for balancing the IDs. (Character)
n_ids (default)Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_cAttempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps:
distributedDistributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nestedCalls I.e. if size is |
mark_new_rows |
Add column with |
new_rows_col_name |
Name of column marking new rows. Defaults to |
`id_col`
Upsampling is done with replacement for added rows, while the original data remains intact.
`id_col`
See `id_method`
description.
data.frame
with added rows. Ordered by potential grouping variables, `cat_col`
and (potentially) `id_col`
.
Ludvig Renbo Olsen, [email protected]
Other sampling functions:
balance()
,
downsample()
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using upsample() upsample(df, cat_col = "diagnosis") # Using upsample() with id_method "n_ids" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_ids", mark_new_rows = TRUE ) # Using upsample() with id_method "n_rows_c" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c", mark_new_rows = TRUE ) # Using upsample() with id_method "distributed" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "distributed", mark_new_rows = TRUE ) # Using upsample() with id_method "nested" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "nested", mark_new_rows = TRUE )
# Attach packages library(groupdata2) # Create data frame df <- data.frame( "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)), "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)), "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4), "score" = sample(c(1:100), 13) ) # Using upsample() upsample(df, cat_col = "diagnosis") # Using upsample() with id_method "n_ids" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_ids", mark_new_rows = TRUE ) # Using upsample() with id_method "n_rows_c" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "n_rows_c", mark_new_rows = TRUE ) # Using upsample() with id_method "distributed" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "distributed", mark_new_rows = TRUE ) # Using upsample() with id_method "nested" # With column specifying added rows upsample(df, cat_col = "diagnosis", id_col = "participant", id_method = "nested", mark_new_rows = TRUE )