% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/boxcox_survey.R
\name{boxcox_survey}
\alias{boxcox_survey}
\title{Select the best Box-Cox lambda parameter for a variable}
\usage{
boxcox_survey(
  input.data,
  row.subset = NULL,
  variable,
  id,
  repeat.obs,
  lambda.start = 0,
  lambda.increment = 0.01,
  num.lambdas = 101,
  covariates = NULL,
  weight = NULL,
  do.winsorization = FALSE,
  print.winsorization = TRUE,
  is.episodic = FALSE,
  do.influential = FALSE,
  print.influential = TRUE,
  iqr.multiple = 3,
  influential.alpha = 2.342729e-06,
  multiple.test = "none"
)
}
\arguments{
\item{input.data}{A data frame.}

\item{row.subset}{Logical vector of the same length as the \code{nrow(input.data)}
indicating which rows of \code{input.data} to use for selecting the best lambda.}

\item{variable}{Variable to transform.}

\item{id}{Variable that identifies each subject. Required only for
supplemental reports.}

\item{repeat.obs}{Variable that distinguishes repeat observations for each
subject. Required only for supplemental reports.}

\item{lambda.start}{Minimum lambda value in the search grid.}

\item{lambda.increment}{Spacing between lambda values in the search grid.}

\item{num.lambdas}{Number of lambda values in the search grid.}

\item{covariates}{Vector of covariates used to select the best lambda.}

\item{weight}{Variable with weighting for each subject.}

\item{do.winsorization}{Generate suggested Winsorization report? (default =
\code{FALSE})}

\item{print.winsorization}{Print suggested Winsorization report to the
console? (default = \code{TRUE})}

\item{is.episodic}{Is the variable episodic? Episodic variables have a
substantial number of zero observations due to not being continuously
observed. Required only for suggested Winsorization report. (default =
\code{FALSE})}

\item{do.influential}{Generate influential subject report? (default =
\code{FALSE})}

\item{print.influential}{Print influential subject report to the console
(default = \code{TRUE})}

\item{iqr.multiple}{Multiple of the interquartile range of the Box-Cox
transformed variable. This sets the distance away that an observation must
be from the 25th or 75th percentiles to be considered an outlier. Has no
effect if the suggested Winsorization report is not generated. (default =
\code{3})}

\item{influential.alpha}{The F-test p-value threshold that a subject must be
under to be considered influential. Has no effect if the influential
subject report is not generated. See "Influential Subjects" section for
details. (default = \code{0.000002342729})}

\item{multiple.test}{The type of multiple testing correction to use to adjust
'influential.subject.alpha'. The options are the same as for
\code{\link[stats:p.adjust]{stats::p.adjust()}}. Has no effect if the influential subject report is not
generated. (default = \code{"none"})}
}
\value{
A data frame with the following columns:
\itemize{
\item variable: Name of the variable that was transformed.
\item tran_lambda: The value of lambda for the Box-Cox transformation most resembling a normal distribution.

The following attribute is present \code{do.winsorization} is \code{TRUE}:
\item winsorization.report: A data frame of outlier observations:
\itemize{
\item \code{id}: The unique identifier for each subject.
\item \code{repeat.obs} Distinguishes repeated observations for the same subject.
\item \code{variable}: The value of the outlier on the original scale.
\item \code{variable}.winsorized: The suggested value to Winsorize the outlier value to.
}

The following attribute is present if \code{do.influential} is \code{TRUE}:
\item influential.subject.report: A data frame of subjects with influential within-subject variances:
\itemize{
\item \code{id}: The unique identifier for each subject.
\item p: The p-value of the F-test that identified the subject's variance as influential.
\item \code{variable}.1 - \code{variable}.k: One column for each of the k unique values of \code{repeat.obs} containing values of \code{variable} for each observation.
}
}
}
\description{
Searches a specified grid of lambda values for a Box-Cox
transformation that is most similar to a normal distribution. This function
can also produce supplemental reports:
\itemize{
\item Suggested Winsorization: a report of outlier observations and suggested Winsorized values
\item Influential subjects: a report of subjects that have extreme variance among repeated observations
}
}
\section{Lambda search}{


The best lambda value is defined as the lambda value that produces a
transformation that minimizes the sum of squared errors (SSE) between the
actual 1st to 99th percentiles of the transformed variable to the
corresponding expected percentiles of a normal distribution. Using the 1st
to 99th percentiles excludes extreme values and makes the selection of the
lambda less susceptible to outliers.
}

\section{Suggested Winsorization}{


Outlier detection is done on the Box-Cox transformed scale using the
selected lambda value to ensure that the data is as close to normal as
possible. Outliers are defined as being a specified multiple (default: 3)
of the interquartile range below the 25th percentile or above the 75th
percentile.
}

\section{Influential subjects}{


Detection of influential subjects is done on the Box-Cox transformed scale
using the selected lambda value to ensure that the data is as close to
normal as possible. Influential subjects are found by performing an F-test
on the variance of each subject's observations against the mean of the
variances of the other subjects' observations. When each subject has 2
observations, the default alpha for the F-test corresponds to identifying
subjects as influential that are 3 times the interquartile range of the
differences between observations below the 25th percentile of differences
or above the 75th percentile of differences. Multiple testing correction
(e.g., Bonferroni, Benjamini-Hochberg) is also available.
}

\examples{
#subset NHANES data
nhanes.subset <- nhcvd[nhcvd$SDMVSTRA \%in\% c(48, 60, 72),]

#daily variable
boxcox.sodium <- boxcox_survey(input.data=nhanes.subset,
                               row.subset=(nhanes.subset$DAY == 1),
                               variable="TSODI",
                               id="SEQN",
                               repeat.obs="DAY",
                               weight="WTDRD1",
                               do.winsorization=TRUE,
                               iqr.multiple=2,
                               do.influential=TRUE,
                               influential.alpha=0.005)
boxcox.sodium

#episodic variable
boxcox.g.whole <- boxcox_survey(input.data=nhanes.subset,
                                row.subset=(nhanes.subset$DAY == 1),
                                variable="G_WHOLE",
                                is.episodic=TRUE,
                                id="SEQN",
                                repeat.obs="DAY",
                                weight="WTDRD1",
                                do.winsorization=TRUE,
                                iqr.multiple=2,
                                do.influential=TRUE,
                                influential.alpha=0.005)
boxcox.g.whole
}
