Regression Calibration with the NCI Method

Amer Moosa

1/17/2024

The output of the NCI method can be used to perform statistical procedures for measurement error correction such as regression calibration.

This vignette assumes some familiarity with the NCI method and the ncimultivar package. A more in-depth overview of the basic NCI method procedures can be found in the daily nutrient vignette.

This example will demonstrate the workflow of a basic regression calibration analysis:

  1. Fit a measurement error model and simulate usual intakes using the NCI method
  2. Fit models of outcome variables using the true intake distribution as the predictor
  3. Calculate confidence intervals for the model parameters

The standard errors for the model parameters will be calculated using balanced repeated replication (BRR).

While this example demonstrates regression using base R, it is recommended to use a dedicated function for survey data such as from the ‘survey’ R package.

library(ncimultivar)

The dataset used for this analysis is derived from the 2005-2010 NHANES data. A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

The nutrient being analyzed is potassium (TPOTA).

The covariates being examined are smoking status (SMK_REC), age (RIDAGEYR), and sex (RIAGENDR). The two nuisance covariates are whether the recall was on a weekend (Weekend) and and whether the recall is on day 2 (Day2).

The WTDRD1 variable is the base weighting for each observation.

The outcome variables are systolic blood pressure (BPSY_AVG) and hypertension (HTN_BIN). A linear model will be fit for systolic blood pressure and a logistic model will be fit for hypertension.

Subjects with missing values are removed, and categorical variables are transformed into binary indicators.

#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]

#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)

#remove subjects that are missing any covariates, variables, or outcomes
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
                      is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$TPOTA)
missing.outcomes <- is.na(input.dataset$BPSY_AVG) | is.na(input.dataset$HTN_BIN)

input.dataset <- input.dataset[!missing.covariates & !missing.variables & !missing.outcomes,]

#break down smoking status into binary indicators
input.dataset$Current.Smoker <- as.numeric(input.dataset$SMK_REC == 1)
input.dataset$Former.Smoker <- as.numeric(input.dataset$SMK_REC == 2)
input.dataset$Never.Smoker <- as.numeric(input.dataset$SMK_REC == 3)

#rename potassium variable for readability
input.dataset$Potassium <- input.dataset$TPOTA

BRR weights will be added to the dataset.

#generate BRR weights
fay.factor <- 0.7

input.dataset <- brr_weights(input.data=input.dataset,
                             id="SEQN",
                             strata="SDMVSTRA",
                             psu="SDMVPSU",
                             cell="PSCELL",
                             weight="WTDRD1",
                             fay.factor=fay.factor)

The potassium variable can now be transformed and standardized for use in the MCMC algorithm.

#Winsorize extreme values of potassium intake
wins.potassium <- boxcox_survey(input.data=input.dataset,
                                row.subset=(input.dataset$Day2 == 0),
                                variable="Potassium",
                                weight="RepWt_0",
                                do.winsorization=TRUE,
                                id="SEQN",
                                repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Potassium
#>   SEQN DAY Potassium Potassium.winsorized
#>  57963   2         0             42.45263

input.dataset$Potassium <- pmax(input.dataset$Potassium, 42.45263)

#run Box-Cox survey and create Box-Cox lambda data using the first recall
boxcox.lambda.data <- boxcox_survey(input.data=input.dataset,
                                    row.subset=(input.dataset$Day2 == 0),
                                    variable="Potassium",
                                    covariates=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "Weekend"),
                                    weight="RepWt_0")

#Calculate minimum amount of potassium intake in the first recall
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
                                                row.subset=(input.dataset$Day2 == 0),
                                                daily.variables="Potassium")

#Run MCMC pre-preprocessor
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
                                           daily.variables="Potassium",
                                           continuous.covariates="RIDAGEYR",
                                           boxcox.lambda.data=boxcox.lambda.data,
                                           minimum.amount.data=minimum.amount.data)

The MCMC measurement error model can now be fit for all BRR replicates. The outcome variable is not involved in the measurement error model.

To perform regression calibration, draws of the U matrix conditional on the mean MCMC parameters must be taken. This is accomplished using the num.post parameter in the nci_multivar_mcmc() function. Each conditional U matrix draw will be used in place of a simulated U matrix in nci_multivar_distrib() to simulate a usual intake for each subject. To ensure that enough data is generated to produce good estimates of the model parameters, 500 conditional U matrices will be drawn in order to simulate 500 usual intakes for each subject.

num.brr <- 8

mcmc.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  print(paste0("Starting Iteration ", brr.rep))
  
  mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
                                               id="SEQN",
                                               repeat.obs="DAY",
                                               weight=paste0("RepWt_", brr.rep),
                                               daily.variables="Potassium",
                                               default.covariates=c("Current.Smoker", "Former.Smoker", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
                                               num.mcmc.iterations=3000,
                                               num.burn=1000,
                                               num.post=500,
                                               num.thin=2,
                                               mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"

A dataset with simulated usual intakes for each subject can now be created using nci_multivar_distrib(). This dataset represents the conditional expectation of usual intake given the observed data for each subject to be used in the regression calibration procedure. It is not a prediction or calculation of the true usual intake for individuals.

The population-based dataset is created the same way as for the daily nutrient analysis.

#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)

#Set Day 2 to zero to factor out the effect of Day 2 recalls
distrib.pop$Day2 <- 0

#create repeats of each subject for weekday and weekend consumption
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)

Each subject will have 500 simulated usual intakes from the 500 conditional U matrix draws taken from the output of nci_multivar_mcmc(). In order to tell nci_multivar_distrib() to use the conditional U matrix draws instead of simulating U matrices, the use.mcmc.u.matrices parameter must be set to TRUE.

The additional.output parameter is used to include the outcome variable and covariates in the distribution dataset to use in the regression. The variables supplied in additional.output will be passed through from the population base dataset.

To perform regression calibration, the 500 simulated intakes for each subject need to be averaged. The mean simulated intakes are then used as a covariate in the regression models. The effect per milligram of potassium is very small, so the potassium usual intake will be converted to grams so that the coefficient is more readable.

The coefficients of the models are saved for each BRR replicate.

summary.brr <- vector(mode="list", length=num.brr + 1)

for(brr.rep in 0:num.brr) {
  
  print(paste0("Starting Iteration ", brr.rep))

  #simulate usual intakes for each subject using MCMC U matrices
  distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
                                       distrib.population=distrib.pop,
                                       id="SEQN",
                                       weight=paste0("RepWt_", brr.rep),
                                       nuisance.weight="Weekend.Weight",
                                       additional.output=c("Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", "BPSY_AVG", "HTN_BIN"),
                                       use.mcmc.u.matrices=TRUE,
                                       distrib.seed=99999+brr.rep)

  #take average usual intake per subject
  regression.data <- aggregate(distrib.data[,"usual.intake.Potassium",drop=FALSE], 
                               by=distrib.data[,c("SEQN", paste0("RepWt_", brr.rep), "Current.Smoker", "Former.Smoker", "RIDAGEYR", "RIAGENDR", 
                                                  "BPSY_AVG", "HTN_BIN")], 
                               mean)
  
  #scale down simulated potassium intake by 1000 to show the effect per 1,000 mg of potassium
  regression.data$usual.intake.Potassium <- regression.data$usual.intake.Potassium/1000
  
  #rescale weights so that models can converge
  regression.data[,paste0("RepWt_", brr.rep)] <- regression.data[,paste0("RepWt_", brr.rep)]/mean(regression.data[,paste0("RepWt_", brr.rep)])
  
  #fit linear model of systolic blood pressure against the average simulated potassium intakes
  bp.model <- lm(BPSY_AVG ~ usual.intake.Potassium + Current.Smoker + Former.Smoker + RIDAGEYR + RIAGENDR, 
                 data=regression.data, 
                 weights=regression.data[,paste0("RepWt_", brr.rep)])
  
  bp.parameters <- summary_coefficients(model=bp.model)
  
  #fit logistic model of diagnosed hypertension against the average simulated potassium intakes
  #family is specified as quasibinomial to handle non-integer weights
  htn.model <- glm(HTN_BIN ~ usual.intake.Potassium + Current.Smoker + Former.Smoker + RIDAGEYR + RIAGENDR,
                   data=regression.data, 
                   weights=regression.data[,paste0("RepWt_", brr.rep)],
                   family=quasibinomial(link="logit"))
  
  htn.parameters <- summary_coefficients(model=htn.model)
  
  summary.brr[[brr.rep + 1]] <- rbind(bp.parameters, 
                                      htn.parameters)
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"

The BRR standard error and confidence intervals can be calculated as usual. The BRR replicate weights in this dataset used a Fay factor of 0.7, so this must be accounted for in calculating the variance. Other datasets may use different Fay factors when generating BRR replicate weights, so it is important to verify it before calculating variances.

#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))

#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)

#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))

#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error

#create summary report
summary.report <- data.frame(summary.brr[[1]], 
                             std.error=std.error,
                             confidence.lower=confidence.lower,
                             confidence.upper=confidence.upper)

summary.report
#>    population variable                              statistic        value
#> 1         All BPSY_AVG            Coefficient for (Intercept) 113.05108719
#> 2         All BPSY_AVG Coefficient for usual.intake.Potassium  -2.35735666
#> 3         All BPSY_AVG         Coefficient for Current.Smoker  -0.59002768
#> 4         All BPSY_AVG          Coefficient for Former.Smoker   0.28047403
#> 5         All BPSY_AVG               Coefficient for RIDAGEYR   0.35397544
#> 6         All BPSY_AVG               Coefficient for RIAGENDR  -6.51452720
#> 7         All  HTN_BIN            Coefficient for (Intercept)  -2.88269021
#> 8         All  HTN_BIN Coefficient for usual.intake.Potassium  -0.22628765
#> 9         All  HTN_BIN         Coefficient for Current.Smoker   0.60912325
#> 10        All  HTN_BIN          Coefficient for Former.Smoker   0.24351942
#> 11        All  HTN_BIN               Coefficient for RIDAGEYR   0.04354076
#> 12        All  HTN_BIN               Coefficient for RIAGENDR  -0.36279833
#>      std.error confidence.lower confidence.upper
#> 1  4.732255653     101.47167475     124.63049963
#> 2  1.221397873      -5.34600959       0.63129627
#> 3  0.990866015      -3.01458947       1.83453412
#> 4  0.539864757      -1.04052744       1.60147550
#> 5  0.016996676       0.31238607       0.39556481
#> 6  1.117872687      -9.24986313      -3.77919127
#> 7  0.465241367      -4.02109483      -1.74428560
#> 8  0.131607955      -0.54832072       0.09574542
#> 9  0.231068624       0.04371870       1.17452781
#> 10 0.230022847      -0.31932621       0.80636505
#> 11 0.004818188       0.03175108       0.05533045
#> 12 0.218058347      -0.89636788       0.17077122