Announcing a major update to merTools

merTools is an R package that is designed to make working with multilevel models from lme4, particularly large models with many random effects, fast and easy. With merTools you can generate prediction intervals that incorporate various components of uncertainty (fixed effect, random effect, and model uncertainty), you can get the expected rank of individual random effect levels (a combination of magnitude and precision of the estimate) and you can explore the substantive effect of variables in the model using a Shiny application interactively!

Recently, we've updated the package to significantly improve performance and accuracy. You can get it on CRAN now. 

Below are some updates from the To learn more check out the package development on GitHub. You can also read previous a previous blog entry discussing the package and its uses.

merTools 0.3.0

  • Improve handling of formulas. If the original merMod has functions specified in the formula, the draw and wiggle functions will check for this and attempt to respect these variable transformations. Where this is not possible a warning will be issued. Most common transformations are respected as long as the the original variable is passed untransformed to the model.
  • Change the calculations of the residual variance. Previously residual variance was used to inflate both the variance around the fixed parameters and around the predicted values themselves. This was incorrect and resulted in overly conservative estimates. Now the residual variance is appropriately only used around the final predictions
  • New option for predictInterval that allows the user to return the full interval, the fixed component, the random component, or the fixed and each random component separately for each observation
  • Fixed a bug with slope+intercept random terms that caused a miscalculation of the random component
  • Add comparison to rstanarm to the Vignette
  • Make expectedRank output more tidy like and allow function to calculate expected rank for all terms at once
    • Note, this breaks the API by changing the names of the columns in the output of this function
  • Remove tests that test for timing to avoid issues with R-devel JIT compiler
  • Remove plyr and replace with dplyr
  • Fix issue #62 varList will now throw an error if == is used instead of =
  • Fix issue #54 predictInterval did not included random effects in calculations when newdata had more than 1000 rows and/or user specified parallel=TRUE. Note: fix was to disable the .paropts option for predictInterval ... user can still specify for temporary backward compatibility but this should be either removed or fixed in the permanent solution.
  • Fix issue #53 about problems with predictInterval when only specific levels of a grouping factor are in newdata with the colon specification of interactions
  • Fix issue #52 ICC wrong calculations ... we just needed to square the standard deviations that we pulled

Introducing the R Data Science Livestream

Have you ever watched a livestream? Have you ever wondered what the actual minute to minute of doing data science looks like? Do you wonder if other R users have the same frustrations as you? If yes -- then read on!

I'm off on a new professional adventure where I am doing public facing work for the first time in years. While working at home the other day I thought it would be a great idea to keep myself on-task and document my decisions if I recorded myself working with my webcam. Then, I thought, why stop there -- why not livestream my work? 

And thus, the R Data Science Livestream was born. The idea is that every day for an hour or two I will livestream myself doing some data science tasks related to my current project -- which is to analyze dozens of years of FBI Uniform Crime reports (read more). I haven't done much R coding in the last 4 months, so it's also a good way to shake off the rust of being out of the game for so long.

So if you are at all interested or curious why someone would do this, check out the landing page I put up to document the project and if you are really curious, maybe even tune in or watch the archives on YouTube!


Explore multilevel models faster with the new merTools R package

Update 1: Package is now available on CRAN

Update 2: Development version also supports models from the blme package.

By far the most popular content on this blog are my two tutorials on how to fit and use multilevel models in R. Since publishing those tutorials I have received numerous questions, comments, and hits to this blog looking for more information about multilevel models in R. Since my day job involves fitting and exploring multilevel models, as well as explaining them to a non-technical audience, I began working with my colleague Carl Frederick on an R package to make these tasks easier. Today, I'm happy to announce that our solution, merTools, is now available. Below, I reproduce the package README file, but you can find out more on GitHub. There are two extensive vignettes that describe how to make use of the package, as well as a shiny app that allows interactive model exploration. The package should be available on CRAN within the next few days. 

Working with generalized linear mixed models (GLMM) and linear mixed models (LMM) has become increasingly easy with advances in the lme4 package. As we have found ourselves using these models more and more within our work, we, the authors, have developed a set of tools for simplifying and speeding up common tasks for interacting with merMod objects from lme4. This package provides those tools.


# development version

# CRAN version -- coming soon

Shiny App and Demo

The easiest way to demo the features of this application is to use the bundled Shiny application which launches a number of the metrics here to aide in exploring the model. To do this:

m1 <- lmer(y ~ service + lectage + studage + (1|d) + (1|s), data=InstEval)
shinyMer(m1, simData = InstEval[1:100, ]) # just try the first 100 rows of data

On the first tab, the function presents the prediction intervals for the data selected by user which are calculated using the predictInterval function within the package. This function calculates prediction intervals quickly by sampling from the simulated distribution of the fixed effect and random effect terms and combining these simulated estimates to produce a distribution of predictions for each observation. This allows prediction intervals to be generated from very large models where the use of bootMer would not be feasible computationally.

On the next tab the distribution of the fixed effect and group-level effects is depicted on confidence interval plots. These are useful for diagnostics and provide a way to inspect the relative magnitudes of various parameters. This tab makes use of four related functions in merTools: FEsim, plotFEsim, REsim and plotREsim which are available to be used on their own as well.

On the third tab are some convenient ways to show the influence or magnitude of effects by leveraging the power of predictInterval. For each case, up to 12, in the selected data type, the user can view the impact of changing either one of the fixed effect or one of the grouping level terms. Using the REimpact function, each case is simulated with the model’s prediction if all else was held equal, but the observation was moved through the distribution of the fixed effect or the random effect term. This is plotted on the scale of the dependent variable, which allows the user to compare the magnitude of effects across variables, and also between models on the same data.


Standard prediction looks like so.

predict(m1, newdata = InstEval[1:10, ])
#>        1        2        3        4        5        6        7        8
#> 3.146336 3.165211 3.398499 3.114248 3.320686 3.252670 4.180896 3.845218
#>        9       10
#> 3.779336 3.331012

With predictInterval we obtain predictions that are more like the standard objects produced by lm and glm:

#predictInterval(m1, newdata = InstEval[1:10, ]) # all other parameters are optional
predictInterval(m1, newdata = InstEval[1:10, ], n.sims = 500, level = 0.9,
                stat = 'median')
#>         fit      lwr      upr
#> 1  3.074148 1.112255 4.903116
#> 2  3.243587 1.271725 5.200187
#> 3  3.529055 1.409372 5.304214
#> 4  3.072788 1.079944 5.142912
#> 5  3.395598 1.268169 5.327549
#> 6  3.262092 1.333713 5.304931
#> 7  4.215371 2.136654 6.078790
#> 8  3.816399 1.860071 5.769248
#> 9  3.811090 1.697161 5.775237
#> 10 3.337685 1.417322 5.341484

Note that predictInterval is slower because it is computing simulations. It can also return all of the simulated yhat values as an attribute to the predict object itself.

predictInterval uses the sim function from the arm package heavily to draw the distributions of the parameters of the model. It then combines these simulated values to create a distribution of the yhat for each observation.


merTools also provides functionality for inspecting merMod objects visually. The easiest are getting the posterior distributions of both fixed and random effect parameters.

feSims <- FEsim(m1, n.sims = 100)
#>          term        mean      median         sd
#> 1 (Intercept)  3.22673524  3.22793168 0.01798444
#> 2    service1 -0.07331857 -0.07482390 0.01304097
#> 3   lectage.L -0.18419526 -0.18451731 0.01726253
#> 4   lectage.Q  0.02287717  0.02187172 0.01328641
#> 5   lectage.C -0.02282755 -0.02117014 0.01324410
#> 6   lectage^4 -0.01940499 -0.02041036 0.01196718

And we can also plot this:

plotFEsim(FEsim(m1, n.sims = 100), level = 0.9, stat = 'median', intercept = FALSE)

We can also quickly make caterpillar plots for the random-effect terms:

reSims <- REsim(m1, n.sims = 100)
#>   groupFctr groupID        term        mean      median        sd
#> 1         s       1 (Intercept)  0.15317316  0.11665654 0.3255914
#> 2         s       2 (Intercept) -0.08744824 -0.03964493 0.2940082
#> 3         s       3 (Intercept)  0.29063126  0.30065450 0.2882751
#> 4         s       4 (Intercept)  0.26176515  0.26428522 0.2972536
#> 5         s       5 (Intercept)  0.06069458  0.06518977 0.3105805
#> 6         s       6 (Intercept)  0.08055309  0.05872426 0.2182059
plotREsim(REsim(m1, n.sims = 100), stat = 'median', sd = TRUE)

Note that plotREsim highlights group levels that have a simulated distribution that does not overlap 0 – these appear darker. The lighter bars represent grouping levels that are not distinguishable from 0 in the data.

Sometimes the random effects can be hard to interpret and not all of them are meaningfully different from zero. To help with this merTools provides the expectedRank function, which provides the percentile ranks for the observed groups in the random effect distribution taking into account both the magnitude and uncertainty of the estimated effect for each group.

ranks <- expectedRank(m1, groupFctr = "d")
#>      d (Intercept) (Intercept)_var       ER pctER
#> 1 1866   1.2553613     0.012755634 1123.806   100
#> 2 1258   1.1674852     0.034291228 1115.766    99
#> 3  240   1.0933372     0.008761218 1115.090    99
#> 4   79   1.0998653     0.023095979 1112.315    99
#> 5  676   1.0169070     0.026562174 1101.553    98
#> 6   66   0.9568607     0.008602823 1098.049    97

Effect Simulation

It can still be difficult to interpret the results of LMM and GLMM models, especially the relative influence of varying parameters on the predicted outcome. This is where the REimpact and the wiggle functions in merTools can be handy.

impSim <- REimpact(m1, InstEval[7, ], groupFctr = "d", breaks = 5,
                   n.sims = 300, level = 0.9)
#>   case bin   AvgFit     AvgFitSE nobs
#> 1    1   1 2.787033 2.801368e-04  193
#> 2    1   2 3.260565 5.389196e-05  240
#> 3    1   3 3.561137 5.976653e-05  254
#> 4    1   4 3.840941 6.266748e-05  265
#> 5    1   5 4.235376 1.881360e-04  176

The result of REimpact shows the change in the yhat as the case we supplied to newdata is moved from the first to the fifth quintile in terms of the magnitude of the group factor coefficient. We can see here that the individual professor effect has a strong impact on the outcome variable. This can be shown graphically as well:

ggplot(impSim, aes(x = factor(bin), y = AvgFit, ymin = AvgFit - 1.96*AvgFitSE,
                   ymax = AvgFit + 1.96*AvgFitSE)) +
  geom_pointrange() + theme_bw() + labs(x = "Bin of `d` term", y = "Predicted Fit")

Here the standard error is a bit different – it is the weighted standard error of the mean effect within the bin. It does not take into account the variability within the effects of each observation in the bin – accounting for this variation will be a future addition to merTools.

Version 0.9.0 of eeptools released!

A long overdue overhaul of my eeptools package for R was released to CRAN today and should be showing up in the mirrors soon. The release notes for this version are extensive as this represents a modernization of the package infrastructure and the reimagining of many of the utility functions contained in the package. From the release notes:

This is a major update including removing little used functions and renaming and restructuring functions.

New Functionality

  • A new package vignette is now included
  • nth_max function for finding the nth highest value in a vector
  • retained_calc now accepts user specified values for sid and grade
  • destring function deprecated and renamed to makenum to better reflect the use of the function
  • crosstabs function exported to allow the user to generate the data behind crosstabplot but not draw the plot


  • dropbox_source deprecated, use the rdrop2 package
  • plotForWord function deprecated in favor of packages like knitr and rmarkdown
  • mapmerge2 has been deprecated in favor of a tested mapmerge
  • mosaictabs.labels has been deprecated in favor of crosstabplot

Bug Fixes

  • nsims in gelmansim was renamed to n.sims to align with the arm package
  • Fixed bug in retained_calc where user specified sid resulted in wrong ids being returned
  • Inserted a meaningful error in age_calc when the enddate is before the date of birth
  • Fixed issue with age_calc which lead to wrong fraction of age during leap years
  • lag_data now can do leads and lags and includes proper error messages
  • fix major bugs for statamode including faulty default to method and returning objects of the wrong class
  • add unit tests and continuous integration support for better package updating
  • fix behavior of max_mis in cases when it is passed an empty vector or a vector of NA
  • leading_zero function made robust to negative values
  • added NA handling options to cutoff and thresh
  • Codebase is now tested with lintr to improve readability

Announcing the caretEnsemble R package

Last week version 1.0 of the caretEnsemble package was released to CRAN. I have co-authored this package with Zach Mayer, who had the original idea of allowing for ensembles of train objects in the caret package. The package is designed to make it easy for the user to optimally combine models of various types together to produce a meta-model with superior fit than the sub-models. 

From the vignette:

"caretEnsemble has 3 primary functions: caretList, caretEnsemble and caretStack. caretList is used to build lists of caret models on the same training data, with the same re-sampling parameters. caretEnsemble andcaretStack are used to create ensemble models from such lists of caret models. caretEnsemble uses greedy optimization to create a simple linear blend of models and caretStack uses a caret model to combine the outputs from several component caret models."

I am excited about this package because the ensembling features in caretEnsemble are used to provide additional predictive power in the Wisconsin Dropout Early Warning System (DEWS). I've written about this system before, but it is a large-scale machine learning system used to provide schools with a prediction on the likely graduation of their middle grade students. It is easy to implement and provides additional predictive power for the cost of some CPU cycles. 

Additionally, Zach and I have worked hard to make ensembling models *easy*. For example, you can automatically build lists of models -- a library of models -- for ensembling using the caretList function. This caretList can then be used directly in either the caretEnsemble or caretStack mode, depending on how you want to combine the predictions from the submodels. These new caret objects also come with their own S3 methods (adding more in future releases) to allow you to interact with them and explore the results of ensembling -- including summary, print, plot, and variable importance calculations. They also include the all important predict method allowing you to generate predictions for use elsewhere. 

Zach has written a great vignette that should give you a feel for how caretEnsemble works. And, we are actively improving caretEnsemble over on GitHub. Drop by and let us know if you find a bug, have a feature request, or want to let us know how it is working for you!

Mixed Effects Tutorial 2: Fun with merMod Objects

Update: Since this post was released I have co-authored an R package to make some of the items in this post easier to do. This package is called merTools and is available on CRAN and on GitHub. To read more about it, read my new post here and check out the package on GitHub.


First of all, be warned, the terminology surrounding multilevel models is vastly inconsistent. For example, multilevel models themselves may be referred to as hierarchical linear models, random effects models, multilevel models, random intercept models, random slope models, or pooling models. Depending on the discipline, software used, and the academic literature many of these terms may be referring to the same general modeling strategy. In this tutorial I will attempt to provide a user guide to multilevel modeling by demonstrating how to fit multilevel models in R and by attempting to connect the model fitting procedure to commonly used terminology used regarding these models.

We will cover the following topics:

  • The structure and methods of merMod objects
  • Extracting random effects of merMod objects
  • Plotting and interpreting merMod objects

If you haven't already, make sure you head over to the Getting Started With Multilevel Models tutorial in order to ensure you have set up your environment correctly and installed all the necessary packages. The tl;dr is that you will need:

  • A current version of R (2.15 or greater)
  • The lme4 package (install.packages("lme4"))

Read in the data

Multilevel models are appropriate for a particular kind of data structure where units are nested within groups (generally 5+ groups) and where we want to model the group structure of the data. For our introductory example we will start with a simple example from the lme4 documentation and explain what the model is doing. We will use data from Jon Starkweather at the University of North Texas. Visit the excellent tutorial available here for more.

library(lme4) # load library
library(arm) # convenience functions for regression in R <- read.table("",
                       header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
##   id    extro     open    agree    social class school
## 1  1 63.69356 43.43306 38.02668  75.05811     d     IV
## 2  2 69.48244 46.86979 31.48957  98.12560     a     VI
## 3  3 79.74006 32.27013 40.20866 116.33897     d     VI
## 4  4 62.96674 44.40790 30.50866  90.46888     c     IV
## 5  5 64.24582 36.86337 37.43949  98.51873     d     IV
## 6  6 50.97107 46.25627 38.83196  75.21992     d      I

Here we have data on the extroversion of subjects nested within classes and within schools.

Let's understand the structure of the data a bit before we begin:

## 'data.frame':    1200 obs. of  7 variables:
##  $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ extro : num  63.7 69.5 79.7 63 64.2 ...
##  $ open  : num  43.4 46.9 32.3 44.4 36.9 ...
##  $ agree : num  38 31.5 40.2 30.5 37.4 ...
##  $ social: num  75.1 98.1 116.3 90.5 98.5 ...
##  $ class : Factor w/ 4 levels "a","b","c","d": 4 1 4 3 4 4 4 4 1 2 ...
##  $ school: Factor w/ 6 levels "I","II","III",..: 4 6 6 4 4 1 3 4 3 1 ...

Here we see we have two possible grouping variables -- class and school. Let's explore them a bit further:

##   a   b   c   d 
## 300 300 300 300
##   I  II III  IV   V  VI 
## 200 200 200 200 200 200
##      I II III IV  V VI
##   a 50 50  50 50 50 50
##   b 50 50  50 50 50 50
##   c 50 50  50 50 50 50
##   d 50 50  50 50 50 50

This is a perfectly balanced dataset. In all likelihood you aren't working with a perfectly balanced dataset, but we'll explore the implications for that in the future. For now, let's plot the data a bit. Using the excellent xyplot function in the lattice package, we can explore the relationship between schools and classes across our variables.

xyplot(extro ~ open + social + agree | class, data =, 
                 auto.key = list(x = .85, y = .035, corner = c(0, 0)), 
       layout = c(4,1), main = "Extroversion by Class")

Here we see that within classes there are clear stratifications and we also see that the social variable is strongly distinct from the open and agree variables. We also see that class a and class d have significantly more spread in their lowest and highest bands respectively. Let's next plot the data by school.

xyplot(extro ~ open + social + agree | school, data =, 
                 auto.key = list(x = .85, y = .035, corner = c(0, 0)), 
       layout = c(3, 2), main = "Extroversion by School")

By school we see that students are tightly grouped, but that school I and school VI show substantially more dispersion than the other schools. The same pattern among our predictors holds between schools as it did between classes. Let's put it all together:

xyplot(extro ~ open + social + agree | school + class, data =, 
                 auto.key = list(x = .85, y = .035, corner = c(0, 0)), 
       main = "Extroversion by School and Class")

Here we can see that school and class seem to closely differentiate the relationship between our predictors and extroversion.

Exploring the Internals of a merMod Object

In the last tutorial we fit a series of random intercept models to our nested data. We will examine the lmerMod object produced when we fit this model in much more depth in order to understand how to work with mixed effect models in R. We start by fitting a the basic example below grouped by class:

MLexamp1 <- lmer(extro ~ open + agree + social + (1|school),
## [1] "lmerMod"
## attr(,"package")
## [1] "lme4"

First, we see that MLexamp1 is now an R object of the class lmerMod. This lmerMod object is an S4 class, and to explore its structure, we use slotNames:

##  [1] "resp"    "Gp"      "call"    "frame"   "flist"   "cnms"    "lower"   "theta"   "beta"   
## [10] "u"       "devcomp" "pp"      "optinfo"

Within the lmerMod object we see a number of objects that we may wish to explore. To look at any of these, we can simply type MLexamp1@ and then the slot name itself. For example:

MLexamp1@call # returns the model call
## lmer(formula = extro ~ open + agree + social + (1 | school), 
##     data =
MLexamp1@beta # returns the betas
## [1] 59.116514199  0.009750941  0.027788360 -0.002151446
class(MLexamp1@frame) # returns the class for the frame slot
## [1] "data.frame"
head(MLexamp1@frame) # returns the model frame
##      extro     open    agree    social school
## 1 63.69356 43.43306 38.02668  75.05811     IV
## 2 69.48244 46.86979 31.48957  98.12560     VI
## 3 79.74006 32.27013 40.20866 116.33897     VI
## 4 62.96674 44.40790 30.50866  90.46888     IV
## 5 64.24582 36.86337 37.43949  98.51873     IV
## 6 50.97107 46.25627 38.83196  75.21992      I

The merMod object has a number of methods available -- too many to enumerate here. But, we will go over some of the more common in the list below:

methods(class = "merMod")
##  [1] anova          as.function    coef           confint        cooks.distance deviance      
##  [7] df.residual    display        drop1          extractAIC     extractDIC     family        
## [13] fitted         fixef          formula        getL           getME          hatvalues     
## [19] influence      isGLMM         isLMM          isNLMM         isREML         logLik        
## [25] mcsamp         model.frame    model.matrix   ngrps          nobs           plot          
## [31] predict        print          profile        qqmath         ranef          refit         
## [37] refitML        rePCA          residuals      rstudent       se.coef        show          
## [43] sigma.hat      sigma          sim            simulate       standardize    summary       
## [49] terms          update         VarCorr        vcov           weights       
## see '?methods' for accessing help and source code

A common need is to extract the fixed effects from a merMod object. fixef extracts a named numeric vector of the fixed effects, which is handy.

##  (Intercept)         open        agree       social 
## 59.116514199  0.009750941  0.027788360 -0.002151446

If you want to get a sense of the p-values or statistical significance of these parameters, first consult the lme4 help by running ?mcmcsamp for a rundown of various ways of doing this. One convenient way built into the package is:

confint(MLexamp1, level = 0.99)
##                   0.5 %      99.5 %
## .sig01       4.91840325 23.88757695
## .sigma       2.53286648  2.81455985
## (Intercept) 46.27750884 71.95609747
## open        -0.02464506  0.04414924
## agree       -0.01163700  0.06721354
## social      -0.01492690  0.01062510

From here we can see first that our fixed effect parameters overlap 0 indicating no evidence of an effect. We can also see that .sig01, which is our estimate of the variability in the random effect, is very large and very widely defined. This indicates we may have a lack of precision between our groups - either because the group effect is small between groups, we have too few groups to get a more precise estimate, we have too few units within each group, or a combination of all of the above.

Another common need is to extract the residual standard error, which is necessary for calculating effect sizes. To get a named vector of the residual standard error:

## [1] 2.670886

For example, it is common practice in education research to standardize fixed effects into "effect sizes" by dividing the fixed effect paramters by the residual standard error, which can be accomplished in lme4 easily:

fixef(MLexamp1) / sigma(MLexamp1)
##   (Intercept)          open         agree        social 
## 22.1336707437  0.0036508262  0.0104041726 -0.0008055176

From this, we can see that our predictors of openness, agreeableness and social are virtually useless in predicting extroversion -- as our plots showed. Let's turn our attention to the random effects next.

Explore Group Variation and Random Effects

In all likelihood you fit a mixed-effect model because you are directly interested in the group-level variation in your model. It is not immediately clear how to explore this group level variation from the results of summary.merMod. What we get from this output is the variance and the standard deviation of the group effect, but we do not get effects for individual groups. This is where the ranef function comes in handy.

## $school
##     (Intercept)
## I    -14.090991
## II    -6.183368
## III   -1.970700
## IV     1.965938
## V      6.330710
## VI    13.948412
## with conditional variances for "school"

Running the ranef function gives us the intercepts for each school, but not much additional information -- for example the precision of these estimates. To do that, we need some additional commands:

re1 <- ranef(MLexamp1, condVar=TRUE) # save the ranef.mer object
## [1] "ranef.mer"
attr(re1[[1]], which = "postVar")
## , , 1
##           [,1]
## [1,] 0.0356549
## , , 2
##           [,1]
## [1,] 0.0356549
## , , 3
##           [,1]
## [1,] 0.0356549
## , , 4
##           [,1]
## [1,] 0.0356549
## , , 5
##           [,1]
## [1,] 0.0356549
## , , 6
##           [,1]
## [1,] 0.0356549

The ranef.mer object is a list which contains a data.frame for each group level. The dataframe contains the random effects for each group (here we only have an intercept for each school). When we ask lme4 for the conditional variance of the random effects it is stored in an attribute of those dataframes as a list of variance-covariance matrices.

This structure is indeed complicated, but it is powerful as it allows for nested, grouped, and cross-level random effects. Also, the creators of lme4 have provided users with some simple shortcuts to get what they are really interested in out of a ranef.mer object.

re1 <- ranef(MLexamp1, condVar=TRUE, whichel = "school")
## $school
##     (Intercept)
## I    -14.090991
## II    -6.183368
## III   -1.970700
## IV     1.965938
## V      6.330710
## VI    13.948412
## with conditional variances for "school"
## $school

This graphic shows a dotplot of the random effect terms, also known as a caterpillar plot. Here you can clearly see the effects of each school on extroversion as well as their standard errors to help identify how distinct the random effects are from one another. Interpreting random effects is notably tricky, but for assistance I would recommend looking at a few of these resources:

Using Simulation and Plots to Explore Random Effects

A common econometric approach is to create what are known as empirical Bayes estimates of the group-level terms. Unfortunately there is not much agreement about what constitutes a proper standard error for the random effect terms or even how to consistently define empirical Bayes estimates.[^1] However, in R there are a few additional ways to get estimates of the random effects that provide the user with information about the relative sizes of the effects for each unit and the precision in that estimate. To do this, we use the sim function in the arm package.[^2]

# A function to extract simulated estimates of random effect paramaters from 
# lme4 objects using the sim function in arm
# whichel = the character for the name of the grouping effect to extract estimates for 
# nsims = the number of simulations to pass to sim
# x = model object
REsim <- function(x, whichel=NULL, nsims){
  mysim <- sim(x, n.sims = nsims)
    dat <- plyr::adply(mysim@ranef[[1]], c(2, 3), plyr::each(c(mean, median, sd)))
    warning("Only returning 1st random effect because whichel not specified")
  } else{
    dat <- plyr::adply(mysim@ranef[[whichel]], c(2, 3), plyr::each(c(mean, median, sd)))

REsim(MLexamp1, whichel = "school", nsims = 1000)
##    X1          X2       mean     median       sd
## 1   I (Intercept) -14.133834 -14.157333 3.871655
## 2  II (Intercept)  -6.230981  -6.258464 3.873329
## 3 III (Intercept)  -2.011822  -2.034863 3.872661
## 4  IV (Intercept)   1.915608   1.911129 3.866870
## 5   V (Intercept)   6.282461   6.299311 3.867046
## 6  VI (Intercept)  13.901818  13.878827 3.873950

The REsim function returns for each school the level name X1, the estimate name, X2, the mean of the estimated values, the median, and the standard deviation of the estimates.

Another convenience function can help us plot these results to see how they compare to the results of dotplot:

# Dat = results of REsim
# scale = factor to multiply sd by
# var = character of "mean" or "median"
# sd = character of "sd"
plotREsim <- function(dat, scale, var, sd){
  dat[, sd] <- dat[, sd] * scale
  dat[, "ymax"] <- dat[, var] + dat[, sd] 
  dat[, "ymin"] <- dat[, var] - dat[, sd] 
  dat[order(dat[, var]), "id"] <- c(1:nrow(dat))
  ggplot(dat, aes_string(x = "id", y = var, ymax = "ymax", 
                         ymin = "ymin")) + 
    geom_pointrange() + theme_dpi() + 
    labs(x = "Group", y = "Effect Range", title = "Effect Ranges") + 
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
          axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
    geom_hline(yintercept = 0, color = I("red"), size = I(1.1))

plotREsim(REsim(MLexamp1, whichel = "school", nsims = 1000), scale = 1.2, 
          var = "mean", sd = "sd")

This presents a more conservative view of the variation between random effect components. Depending on how your data was collected and your research question, alternative ways of estimating these effect sizes are possible. However, proceed with caution.[^3]

Another approach recommended by the authors of lme4 involves the RLRsim package. Using this package we can test whether or not inclusion of the random effects improves model fit and we can evaluate the p-value of additional random effect terms using a likelihood ratio test based on simulation.[^4]

m0 <- lm(extro ~ agree + open + social, data # fit the null model
exactLRT(m = MLexamp1, m0 = m0)
##     simulated finite sample distribution of LRT. (p-value based on 10000 simulated values)
## data:  
## LRT = 2957.7, p-value < 2.2e-16

Here exactLRT issues a warning because we originally fit the model with REML instead of full maximum likelihood. Fortunately, the refitML function in lme4 allows us to easily refit our model using full maximum likelihood to conduct an exact test easily.

mA <- refitML(MLexamp1)
exactLRT(m= mA, m0 = m0)
##     simulated finite sample distribution of LRT. (p-value based on 10000 simulated values)
## data:  
## LRT = 2957.8, p-value < 2.2e-16

Here we can see that the inclusion of our grouping variable is significant, even though the effect of each individual group may be substantively small and/or imprecisely measured. This is important in understanding the correct specification of the model. Our next tutorial will cover specification tests like this in more detail.

What do Random Effects Matter?

How do interpret the substantive impact of our random effects? This is often critical in observation work trying to use a multilevel structure to understand the impact that the grouping can have on the individual observation. To do this we select 12 random cases and then we simulate their predicted value of extro if they were placed in each of the 6 schools. Note, that this is a very simple simulation just using the mean of the fixed effect and the conditional mode of the random effect and not replicating or sampling to get a sense of the variability. This will be left as an exercise to the reader and/or a future tutorial!

# Simulate
# Let's create 12 cases of students
#sample some rows
simX <- sample($id, 12)
simX <-[$id %in% simX, c(3:5)] # get their data
# add an intercept
simX[, "Intercept"] <- 1
simX <- simX[, c(4, 1:3)] # reorder
simRE <- REsim(MLexamp1, whichel = "school", nsims = 1000) # simulate randome effects
simX$case <- row.names(simX) # create a case ID
# expand a grid of case IDs by schools
simDat <- expand.grid(case = row.names(simX), school = levels($school)) 
simDat <- merge(simX, simDat) # merge in the data
# Create the fixed effect predictor
simDat[, "fepred"] <- (simDat[, 2] * fixef(MLexamp1)[1]) + (simDat[, 3] * fixef(MLexamp1)[2]) +
          (simDat[, 4] * fixef(MLexamp1)[3]) + (simDat[, 5] * fixef(MLexamp1)[4])
# Add the school effects
simDat <- merge(simDat, simRE[, c(1, 3)], by.x = "school", by.y="X1")
simDat$yhat <- simDat$fepred + simDat$mean # add the school specific intercept

Now that we have set up a simulated dataframe, let's plot it, first by case:

qplot(school, yhat, data = simDat) + facet_wrap(~case) + theme_dpi()

This plot shows us that within each plot, representing a case, there is tremendous variation by school. So, moving each student into a different school has large effects on the extroversion score. But, does each case vary at each school?

qplot(case, yhat, data = simDat) + facet_wrap(~school) + theme_dpi() + 
  theme(axis.text.x = element_blank())

Here we can clearly see that within each school the cases are relatively the same indicating that the group effect is larger than the individual effects.

These plots are useful in demonstrating the relative importance of group and individual effects in a substantive fashion. Even more can be done to make the the graphs more informative, such as placing references to the total variability of the outcome and also looking at the distance moving groups moves each observation from its true value.


lme4 provides a very powerful object-oriented toolset for dealing with mixed effect models in R. Understanding model fit and confidence intervals of lme4 objects requires some diligent research and the use of a variety of functions and extensions of lme4 itself. In our next tutorial we will explore how to identify a proper specification of a random-effect model and Bayesian extensions of the lme4 framework for difficult to specify models. We will also explore the generalized linear model framework and the glmer function for generalized linear modeling with multi-levels.


## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## Matrix products: default
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] RLRsim_3.1-3    eeptools_1.2.2  ggplot2_3.1.1   plyr_1.8.4      lattice_0.20-38 arm_1.10-1     
##  [7] MASS_7.3-51.4   lme4_1.1-21     Matrix_1.2-17   knitr_1.22     
## loaded via a namespace (and not attached):
##  [1] zoo_1.8-5         tidyselect_0.2.5  xfun_0.6          purrr_0.3.2       splines_3.5.3    
##  [6] colorspace_1.4-1  htmltools_0.3.6   yaml_2.2.0        mgcv_1.8-28       rlang_0.3.4      
## [11] nloptr_1.2.1      pillar_1.3.1      foreign_0.8-71    glue_1.3.1        withr_2.1.2      
## [16] sp_1.3-1          stringr_1.4.0     munsell_0.5.0     gtable_0.3.0      coda_0.19-2      
## [21] evaluate_0.13     labeling_0.3      maptools_0.9-5    lmtest_0.9-37     vcd_1.4-4        
## [26] Rcpp_1.0.1        scales_1.0.0      abind_1.4-5       digest_0.6.18     stringi_1.4.3    
## [31] dplyr_0.8.0.1     grid_3.5.3        tools_3.5.3       magrittr_1.5      lazyeval_0.2.2   
## [36] tibble_2.1.1      crayon_1.3.4      pkgconfig_2.0.2   data.table_1.12.2 assertthat_0.2.1 
## [41] minqa_1.2.4       rmarkdown_1.12    R6_2.4.0          boot_1.3-22       nlme_3.1-139     
## [46] compiler_3.5.3

[^1]: [See message from lme4 co-author Doug Bates on this subject]. ( [^2]: Andrew Gelman and Yu-Sung Su (2014). arm: Data Analysis Using Regression and Multilevel/Hierarchical Models. R package version 1.7-03. [^3]: WikiDot FAQ from the R Mixed Models Mailing List [^4]: There are also an extensive series of references available in the References section of the help by running ?exactLRT and ?exactRLRT.

eeptools 0.3 Released!

Version 0.3 of my R package of miscellaneous code has been released, this time with substantial contributions from Jason Becker via GitHub. Progress continues toward the ultimate goal for eeptools to "make it easier for administrators at state and local education agencies to analyze and visualize their data on student, school, and district performance. By putting simple wrappers around a number of R functions to make many common tasks simpler and lower the barrier to entry to statistical analysis.

The goal is not to invent new functionality for R, but instead to lower the barrier of entry to doing common and routine data manipulation, visualization, and analysis tasks with education data. By collaborating with other users of education data we can build transparent, efficient, reproducible, and easy to use functions for analysts.

Find out more at, and check out the development of this tool."

Suggestions and pull requests are very welcome on GitHub

See the NEWS file in the repository for full updates:

eeptools 0.3


* unit tests for decomma, gelmansim, and statamode using `testthat` package

* statamode updated to work with data.table

* age_calc function from Jason Becker given new precision option

* moves_calc function from Jason Becker

* gelmansim function to do post-estimation prediction on new data from model objects using functionality in the `arm` package

* lag_data function to create groupwise nested lags quickly

eeptools 0.2


* new functions for building maps with shapefiles including mapmerge to merge a dataframe and a shapefile, and ggmapmerge to conver this to a document for making a map in ggplot2

* statamode updated to allow for multiple methods for handling multiple modes

* remove_stars deleted and replaced with remove_char to allow for users to specify an arbitrary character string to be removed

* add plotForWord function to export plots in a Windows MetaFile for inclusion in Microsoft Office documents

* add age_calc function to allow calculating the age of a vector of birthdates relative to the current date

* fix typos in documentation

* fix startup message behavior

* remove dependencies of the package dramatically so loading is faster and more lightweight


My current R projects at the end of 2013

I have been meaning to write a long post explaining each of the major R projects I am currently working on, but I can't seem to find the time to get to them. Instead, I give you a short summary of a few R projects/experiments/ideas I am kicking around. Any comments, feedback, suggestions are welcome -- consider it a window into my current goals for working with R and the challenges I am using it to meet in my professional life.

datasynthR - This is an R package to allow easy user creation of simulated data of relative complexity. Key features include the ability to specify the distribution and the correlation of and among numeric variables in a dataframe, the ability to build categorical variables that are correlated with one another or numeric variables, introduction of random and not-at-random missingness, and utility functions to check for such missingness. The goal is to allow users to quickly build more robust tests for the performance of statistical modeling techniques. See the GitHub repo for more information. I hope to release this to CRAN in early 2014. 

maintainR - Between work, home, and various configurations of servers and virtual machines, I find that I have R installed in an awful lot of places on an awful lot of architectures. I can't always keep straight which packages are installed where, where the site library is on each machine, and what version of R I am running (have I gone through the tedious Windows upgrade process on this machine?). maintainR is my solution to this offering some ability to standardize installed libraries and create a common across installations. Needs a lot of work, see the GitHub repo here.  

eeptools 0.3 - After some inattention, I am hard at work improving my eeptools package and releasing version 0.3 to CRAN. New features in the release are better compatibility with data.table, contributed age_calc and moves_calc functions from Jason Becker. There are also some package development best practices introduced including unit tests and some git tagging and milestone focuses. See the GitHub repo for details.

EWStools - This is a fledgling R package I am working on that applies what I have learned in developing a Dropout Early Warning System (DEWS) for the state of Wisconsin to create a more flexible and generalized predictive modeling framework for educational outcomes. Mostly wrappers for various caret functions, it codifies some practices I have used to help me evaluate model fit and hopefully makes the concept of predictive modeling more approachable. 

R modeling tutorials - I have begun compiling course notes from my courses in structural equation modeling and mixed-effect modeling into discrete tutorials for R. This started out because I was afraid I might forget some of the material before I needed to use it again, but it has also proven helpful to others. The initial posts have been well received and I hope to wrap these tutorials up into some GitHub repositories and also fold them into the RBootcamp materials eventually. 

Bayesian modeling in R tutorials - As I work through the newest edition of Bayesian Data Analysis (BDA), I want to implement many of the examples of the book in R. I thought, why not do this in the open so others can benefit from it! These tutorials will follow the format laid out in the modeling tutorials. I hope to work on these throughout 2014. 

Applied modeling for social scientists talk - Not much to say about this yet, except it is an extension and a deepening of the most recent presentations I have given over on the presentations page

Data visualization with R discussion - There have been a lot of great new developments in R including new interactive tools like ggvis, JavaScript connections through rCharts, and a new general web-friendliness to R visualization. These developments make me think I need to update my talk on data visualization to be much more Web focused. 

Getting Started with Mixed Effect Models in R

Update: Since this post was released I have co-authored an R package to make some of the items in this post easier to do. This package is called merTools and is available on CRAN and on GitHub. To read more about it, read my new post here and check out the package on GitHub.


Analysts dealing with grouped data and complex hierarchical structures in their data ranging from measurements nested within participants, to counties nested within states or students nested within classrooms often find themselves in need of modeling tools to reflect this structure of their data. In R there are two predominant ways to fit multilevel models that account for such structure in the data. These tutorials will show the user how to use both the lme4 package in R to fit linear and nonlinear mixed effect models, and to use rstan to fit fully Bayesian multilevel models. The focus here will be on how to fit the models in R and not the theory behind the models. For background on multilevel modeling, see the references. [1]

This tutorial will cover getting set up and running a few basic models using lme4 in R. Future tutorials will cover:

  • constructing varying intercept, varying slope, and varying slope and intercept models in R
  • generating predictions and interpreting parameters from mixed-effect models
  • generalized and non-linear multilevel models
  • fully Bayesian multilevel models fit with rstan or other MCMC methods

Setting up your enviRonment

Getting started with multilevel modeling in R is simple. lme4 is the canonical package for implementing multilevel models in R, though there are a number of packages that depend on and enhance its feature set, including Bayesian extensions. lme4 has been recently rewritten to improve speed and to incorporate a C++ codebase, and as such the features of the package are somewhat in flux. Be sure to update the package frequently.

To install lme4, we just run:

# Main version

# Or to install the dev version

Read in the data

Multilevel models are appropriate for a particular kind of data structure where units are nested within groups (generally 5+ groups) and where we want to model the group structure of the data. For our introductory example we will start with a simple example from the lme4 documentation and explain what the model is doing. We will use data from Jon Starkweather at the University of North Texas. Visit the excellent tutorial available here for more.

library(lme4) # load library
library(arm) # convenience functions for regression in R <- read.table("",
                       header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
##   id    extro     open    agree    social class school
## 1  1 63.69356 43.43306 38.02668  75.05811     d     IV
## 2  2 69.48244 46.86979 31.48957  98.12560     a     VI
## 3  3 79.74006 32.27013 40.20866 116.33897     d     VI
## 4  4 62.96674 44.40790 30.50866  90.46888     c     IV
## 5  5 64.24582 36.86337 37.43949  98.51873     d     IV
## 6  6 50.97107 46.25627 38.83196  75.21992     d      I

Here we have data on the extroversion of subjects nested within classes and within schools.

Fit the Non-Multilevel Models

Let's start by fitting a simple OLS regression of measures of openness, agreeableness, and socialability on extroversion. Here we use the display function in the excellent arm package for abbreviated output. Other options include stargazer for LaTeX typeset tables, xtable, or the ascii package for more flexible plain text output options.

OLSexamp <- lm(extro ~ open + agree + social, data =
## lm(formula = extro ~ open + agree + social, data =
##             coef.est
## (Intercept) 57.84     3.15  
## open         0.02     0.05  
## agree        0.03     0.05  
## social       0.01     0.02  
## ---
## n = 1200, k = 4
## residual sd = 9.34, R-Squared = 0.00

So far this model does not fit very well at all. The R model interface is quite a simple one with the dependent variable being specified first, followed by the ~ symbol. The righ hand side, predictor variables, are each named. Addition signs indicate that these are modeled as additive effects. Finally, we specify that datframe on which to calculate the model. Here we use the lm function to perform OLS regression, but there are many other options in R.

If we want to extract measures such as the AIC, we may prefer to fit a generalized linear model with glm which produces a model fit through maximum likelihood estimation. Note that the model formula specification is the same.

MLexamp <- glm(extro ~ open + agree + social,
## glm(formula = extro ~ open + agree + social, data =
##             coef.est
## (Intercept) 57.84     3.15  
## open         0.02     0.05  
## agree        0.03     0.05  
## social       0.01     0.02  
## ---
##   n = 1200, k = 4
##   residual deviance = 104378.2, null deviance = 104432.7 (difference = 54.5)
##   overdispersion parameter = 87.3
##   residual sd is sqrt(overdispersion) = 9.34
## [1] 8774.291

This results in a poor model fit. Let's look at a simple varying intercept model now.

Fit a varying intercept model

Depending on disciplinary norms, our next step might be to fit a varying intercept model using a grouping variable such as school or classes. Using the glm function and the familiar formula interface, such a fit is easy:

MLexamp.2 <- glm(extro ~ open + agree + social + class, )
## glm(formula = extro ~ open + agree + social + class, data =
##             coef.est
## (Intercept) 56.05     3.09  
## open         0.03     0.05  
## agree       -0.01     0.05  
## social       0.01     0.02  
## classb       2.06     0.75  
## classc       3.70     0.75  
## classd       5.67     0.75  
## ---
##   n = 1200, k = 7
##   residual deviance = 99187.7, null deviance = 104432.7 (difference = 5245.0)
##   overdispersion parameter = 83.1
##   residual sd is sqrt(overdispersion) = 9.12
## [1] 8719.083
anova(MLexamp, MLexamp.2, test="F")
## Analysis of Deviance Table
## Model 1: extro ~ open + agree + social
## Model 2: extro ~ open + agree + social + class
##   Resid. Df Resid. Dev Df Deviance     F   Pr(>F)    
## 1      1196     104378                               
## 2      1193      99188  3   5190.5 20.81 3.82e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is called a fixed-effects specification often. This is simply the case of fitting a separate dummy variable as a predictor for each class. We can see this does not provide much additional model fit. Let's see if school performs any better.

MLexamp.3 <- glm(extro ~ open + agree + social + school, )
## glm(formula = extro ~ open + agree + social + school, data =
##             coef.est
## (Intercept) 45.02     0.92  
## open         0.01     0.01  
## agree        0.03     0.02  
## social       0.00     0.00  
## schoolII     7.91     0.27  
## schoolIII   12.12     0.27  
## schoolIV    16.06     0.27  
## schoolV     20.43     0.27  
## schoolVI    28.05     0.27  
## ---
##   n = 1200, k = 9
##   residual deviance = 8496.2, null deviance = 104432.7 (difference = 95936.5)
##   overdispersion parameter = 7.1
##   residual sd is sqrt(overdispersion) = 2.67
## [1] 5774.203
anova(MLexamp, MLexamp.3, test="F")
## Analysis of Deviance Table
## Model 1: extro ~ open + agree + social
## Model 2: extro ~ open + agree + social + school
##   Resid. Df Resid. Dev Df Deviance      F    Pr(>F)    
## 1      1196     104378                                 
## 2      1191       8496  5    95882 2688.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The school effect greatly improves our model fit. However, how do we interpret these effects?

##        a  b  c  d
##   I   50 50 50 50
##   II  50 50 50 50
##   III 50 50 50 50
##   IV  50 50 50 50
##   V   50 50 50 50
##   VI  50 50 50 50

Here we can see we have a perfectly balanced design with fifty observations in each combination of class and school (if only data were always so nice!).

Let's try to model each of these unique cells. To do this, we fit a model and use the : operator to specify the interaction between school and class.

MLexamp.4 <- glm(extro ~ open + agree + social + school:class, )
## glm(formula = extro ~ open + agree + social + school:class, data =
##                  coef.est
## (Intercept)       80.36     0.37 
## open               0.01     0.00 
## agree             -0.01     0.01 
## social             0.00     0.00 
## schoolI:classa   -40.39     0.20 
## schoolII:classa  -28.15     0.20 
## schoolIII:classa -23.58     0.20 
## schoolIV:classa  -19.76     0.20 
## schoolV:classa   -15.50     0.20 
## schoolVI:classa  -10.46     0.20 
## schoolI:classb   -34.60     0.20 
## schoolII:classb  -26.76     0.20 
## schoolIII:classb -22.59     0.20 
## schoolIV:classb  -18.71     0.20 
## schoolV:classb   -14.31     0.20 
## schoolVI:classb   -8.54     0.20 
## schoolI:classc   -31.86     0.20 
## schoolII:classc  -25.64     0.20 
## schoolIII:classc -21.58     0.20 
## schoolIV:classc  -17.58     0.20 
## schoolV:classc   -13.38     0.20 
## schoolVI:classc   -5.58     0.20 
## schoolI:classd   -30.00     0.20 
## schoolII:classd  -24.57     0.20 
## schoolIII:classd -20.64     0.20 
## schoolIV:classd  -16.60     0.20 
## schoolV:classd   -12.04     0.20 
## ---
##   n = 1200, k = 27
##   residual deviance = 1135.9, null deviance = 104432.7 (difference = 103296.8)
##   overdispersion parameter = 1.0
##   residual sd is sqrt(overdispersion) = 0.98
## [1] 3395.573

This is very useful, but what if we want to understand both the effect of the school and the effect of the class, as well as the effect of the schools and classes? Unfortunately, this is not easily done with the standard glm.

MLexamp.5 <- glm(extro ~ open + agree + social + school*class - 1, )
## glm(formula = extro ~ open + agree + social + school * class - 
##     1, data =
##                  coef.est
## open              0.01     0.00  
## agree            -0.01     0.01  
## social            0.00     0.00  
## schoolI          39.96     0.36  
## schoolII         52.21     0.36  
## schoolIII        56.78     0.36  
## schoolIV         60.60     0.36  
## schoolV          64.86     0.36  
## schoolVI         69.90     0.36  
## classb            5.79     0.20  
## classc            8.53     0.20  
## classd           10.39     0.20  
## schoolII:classb  -4.40     0.28  
## schoolIII:classb -4.80     0.28  
## schoolIV:classb  -4.74     0.28  
## schoolV:classb   -4.60     0.28  
## schoolVI:classb  -3.87     0.28  
## schoolII:classc  -6.02     0.28  
## schoolIII:classc -6.54     0.28  
## schoolIV:classc  -6.36     0.28  
## schoolV:classc   -6.41     0.28  
## schoolVI:classc  -3.65     0.28  
## schoolII:classd  -6.81     0.28  
## schoolIII:classd -7.45     0.28  
## schoolIV:classd  -7.24     0.28  
## schoolV:classd   -6.93     0.28  
## schoolVI:classd   0.06     0.28  
## ---
##   n = 1200, k = 27
##   residual deviance = 1135.9, null deviance = 4463029.9 (difference = 4461894.0)
##   overdispersion parameter = 1.0
##   residual sd is sqrt(overdispersion) = 0.98
## [1] 3395.573

Exploring Random Slopes

Another alternative is to fit a separate model for each of the school and class combinations. If we believe the relationsihp between our variables may be highly dependent on the school and class combination, we can simply fit a series of models and explore the parameter variation among them:


modellist <- dlply(, .(school, class), function(x) 
                              glm(extro~ open + agree + social, data=x))
## glm(formula = extro ~ open + agree + social, data = x)
##             coef.est
## (Intercept) 35.87     5.90  
## open         0.05     0.09  
## agree        0.02     0.10  
## social       0.01     0.03  
## ---
##   n = 50, k = 4
##   residual deviance = 500.2, null deviance = 506.2 (difference = 5.9)
##   overdispersion parameter = 10.9
##   residual sd is sqrt(overdispersion) = 3.30
## glm(formula = extro ~ open + agree + social, data = x)
##             coef.est
## (Intercept) 47.96     2.16  
## open        -0.01     0.03  
## agree       -0.03     0.03  
## social      -0.01     0.01  
## ---
##   n = 50, k = 4
##   residual deviance = 47.9, null deviance = 49.1 (difference = 1.2)
##   overdispersion parameter = 1.0
##   residual sd is sqrt(overdispersion) = 1.02

We will discuss this strategy in more depth in future tutorials including how to performan inference on the list of models produced in this command.

Fit a varying intercept model with lmer

Enter lme4. While all of the above techniques are valid approaches to this problem, they are not necessarily the best approach when we are interested explicitly in variation among and by groups. This is where a mixed-effect modeling framework is useful. Now we use the lmer function with the familiar formula interface, but now group level variables are specified using a special syntax: (1|school) tells lmer to fit a linear model with a varying-intercept group effect using the variable school.

MLexamp.6 <- lmer(extro ~ open + agree + social + (1|school),
## lmer(formula = extro ~ open + agree + social + (1 | school), 
##     data =
##             coef.est
## (Intercept) 59.12     4.10  
## open         0.01     0.01  
## agree        0.03     0.02  
## social       0.00     0.00  
## Error terms:
##  Groups   Name        Std.Dev.
##  school   (Intercept) 9.79    
##  Residual             2.67    
## ---
## number of obs: 1200, groups: school, 6
## AIC = 5836.1, DIC = 5788.9
## deviance = 5806.5

We can fit multiple group effects with multiple group effect terms.

MLexamp.7 <- lmer(extro ~ open + agree + social + (1|school) + (1|class),
## lmer(formula = extro ~ open + agree + social + (1 | school) + 
##     (1 | class), data =
##             coef.est
## (Intercept) 60.20     4.21  
## open         0.01     0.01  
## agree       -0.01     0.01  
## social       0.00     0.00  
## Error terms:
##  Groups   Name        Std.Dev.
##  school   (Intercept) 9.79    
##  class    (Intercept) 2.41    
##  Residual             1.67    
## ---
## number of obs: 1200, groups: school, 6; class, 4
## AIC = 4737.9, DIC = 4683.3
## deviance = 4703.6

And finally, we can fit nested group effect terms through the following syntax:

MLexamp.8 <- lmer(extro ~ open + agree + social + (1|school/class),
## lmer(formula = extro ~ open + agree + social + (1 | school/class), 
##     data =
##             coef.est
## (Intercept) 60.24     4.01  
## open         0.01     0.00  
## agree       -0.01     0.01  
## social       0.00     0.00  
## Error terms:
##  Groups       Name        Std.Dev.
##  class:school (Intercept) 2.86    
##  school       (Intercept) 9.69    
##  Residual                 0.98    
## ---
## number of obs: 1200, groups: class:school, 24; school, 6
## AIC = 3568.6, DIC = 3507.6
## deviance = 3531.1

Here the (1|school/class) says that we want to fit a mixed effect term for varying intercepts 1| by schools, and for classes that are nested within schools.

Fit a varying slope model with lmer

But, what if we want to explore the effect of different student level indicators as they vary across classrooms. Instead of fitting unique models by school (or school/class) we can fit a varying slope model. Here we modify our random effect term to include variables before the grouping terms: (1 +open|school/class) tells R to fit a varying slope and varying intercept model for schools and classes nested within schools, and to allow the slope of the open variable to vary by school.

MLexamp.9 <- lmer(extro ~ open + agree + social + (1+open|school/class),
## lmer(formula = extro ~ open + agree + social + (1 + open | school/class), 
##     data =
##             coef.est
## (Intercept) 60.26     3.46  
## open         0.01     0.01  
## agree       -0.01     0.01  
## social       0.00     0.00  
## Error terms:
##  Groups       Name        Std.Dev. Corr 
##  class:school (Intercept) 2.61          
##               open        0.01     1.00 
##  school       (Intercept) 8.33          
##               open        0.00     1.00 
##  Residual                 0.98          
## ---
## number of obs: 1200, groups: class:school, 24; school, 6
## AIC = 3574.9, DIC = 3505.6
## deviance = 3529.3


Fitting mixed effect models and exploring group level variation is very easy within the R language and ecosystem. In future tutorials we will explore comparing across models, doing inference with mixed-effect models, and creating graphical representations of mixed effect models to understand their effects.


## R version 3.5.3 (2019-03-11)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## Matrix products: default
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] plyr_1.8.4    arm_1.10-1    MASS_7.3-51.4 lme4_1.1-21   Matrix_1.2-17 knitr_1.22   
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1      lattice_0.20-38 digest_0.6.18   grid_3.5.3      nlme_3.1-139    magrittr_1.5   
##  [7] coda_0.19-2     evaluate_0.13   stringi_1.4.3   minqa_1.2.4     nloptr_1.2.1    boot_1.3-22    
## [13] rmarkdown_1.12  splines_3.5.3   tools_3.5.3     stringr_1.4.0   abind_1.4-5     xfun_0.6       
## [19] yaml_2.2.0      compiler_3.5.3  htmltools_0.3.6

[1] Examples include Gelman and Hill, Gelman et al. 2013, etc.

Update: Like this? Then head over to the second part -- using merMod objects in R

Latent Variable Analysis with R: Getting Setup with lavaan

Getting Started with Structural Equation Modeling Part 1

Getting Started with Structural Equation Modeling: Part 1


For the analyst familiar with linear regression fitting structural equation models can at first feel strange. In the R environment, fitting structural equation models involves learning new modeling syntax, new plotting syntax, and often a new data input method. However, a quick reorientation and soon the user is exposed to the differences, fitting structural equation models can be a powerful tool in the analyst's toolkit.

This tutorial will cover getting set up and running a few basic models using lavaan in R.1 Future tutorials will cover:

  • constructing latent variables
  • comparing alternate models
  • multi-group analysis on larger datasets.

Setting up your enviRonment

Getting started using structural equation modeling (SEM) in R can be daunting. There are lots of different packages for implementing SEM in R and there are different features of SEM that a user might be interested in implementing. A few packages you might come across can be found on the CRAN Psychometrics Task View.

For those who want to just dive in the lavaan package seems to offer the most comprehensive feature set for most SEM users and has a well thought out and easy to learn syntax for describing SEM models. To install lavaan, we just run:

# Main version

# Or to install the dev version
install_github("lavaan", "yrosseel")

Read in the data

Once we load up the lavaan package, we need to read in the dataset. lavaan accepts two different types of data, either a standard R dataframe, or a variance-covariance matrix. Since the latter is unfamiliar to us coming from the standard lm linear modeling framework in R, we'll start with reading in the simplest variance-covariance matrix possible and running a path analysis model.

mat1 <- matrix(c(1, 0, 0, 0.6, 1, 0, 0.33, 0.63, 1), 3, 3, byrow = TRUE)

colnames(mat1) <- rownames(mat1) <- c("ILL", "IMM", "DEP")

myN <- 500
##      ILL  IMM DEP
## ILL 1.00 0.00   0
## IMM 0.60 1.00   0
## DEP 0.33 0.63   1
# Note that we only input the lower triangle of the matrix. This is
# sufficient though we could put the whole matrix in if we like

Now we have a variance-covariance matrix in our environment named mat1 and a variable myN corresponding to the number of observations in our dataset. Alternatively, we could provide R with the full dataset and it can derive mat1 and myN itself.

With this data we can construct two possible models:

  1. Depression (DEP) influences Immune System (IMM) influences Illness (ILL)
  2. IMM influences ILL influences DEP

Using SEM we can evaluate which model best explains the covariances we observe in our data above. Fitting models in lavaan is a two step process. First, we create a text string that serves as the lavaan model and follows the lavaan model syntax. Next, we give lavaan the instructions on how to fit this model to the data using either the cfa, lavaan, or sem functions. Here we will use the sem function. Other functions will be covered in a future post.

# Specify the model

mod1 <- "ILL ~ IMM \n        IMM ~ DEP"

# Give lavaan the command to fit the model
mod1fit <- sem(mod1, sample.cov = mat1, sample.nobs = 500)

# Specify model 2

mod2 <- "DEP ~ ILL\n        ILL ~ IMM"

mod2fit <- sem(mod2, sample.cov = mat1, sample.nobs = 500)

Now we have two objects stored in our environment for each model. We have the model string and the modelfit object. The model fit objects (mod1fit and mod2fit) are lavaan class objects. These are S4 objects with many supported methods, including the summary method which provides a lot of useful output:

# Summarize the model fit
## lavaan (0.5-14) converged normally after  12 iterations
##   Number of observations                           500
##   Estimator                                         ML
##   Minimum Function Test Statistic                2.994
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.084
## Parameter estimates:
##   Information                                 Expected
##   Standard Errors                             Standard
##                    Estimate  Std.err  Z-value  P(>|z|)
## Regressions:
##   ILL ~
##     IMM               0.600    0.036   16.771    0.000
##   IMM ~
##     DEP               0.630    0.035   18.140    0.000
## Variances:
##     ILL               0.639    0.040
##     IMM               0.602    0.038

## lavaan (0.5-14) converged normally after  11 iterations
##   Number of observations                           500
##   Estimator                                         ML
##   Minimum Function Test Statistic              198.180
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## Parameter estimates:
##   Information                                 Expected
##   Standard Errors                             Standard
##                    Estimate  Std.err  Z-value  P(>|z|)
## Regressions:
##   DEP ~
##     ILL               0.330    0.042    7.817    0.000
##   ILL ~
##     IMM               0.600    0.036   16.771    0.000
## Variances:
##     DEP               0.889    0.056
##     ILL               0.639    0.040

One of the best ways to understand an SEM model is to inspect the model visually using a path diagram. Thanks to the semPlot package, this is easy to do in R.2 First, install semPlot:

# Official version

# Or to install the dev version
install_github("semPlot", "SachaEpskamp")

Next we load the library and make some path diagrams.

semPaths(mod1fit, what = "est", layout = "tree", title = TRUE, style = "LISREL")

plot of chunk modelvis

semPaths(mod2fit, what = "est", layout = "tree", title = TRUE, style = "LISREL")

plot of chunk modelvis

These two simple path models look great. But which is better? We can run a simple chi-square test on the lavaan objects mod1fit and mod2fit.

anova(mod1fit, mod2fit)
## Chi Square Difference Test
##         Df  AIC  BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## mod1fit  1 3786 3803   2.99                                  
## mod2fit  1 3981 3998 198.18        195       0     <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that very clearly we prefer Model 2. Let's look at some properties of model 2 that we can access through the lavaan object with convenience functions.

# Goodness of fit measures
##              fmin             chisq                df            pvalue 
##             0.198           198.180             1.000             0.000 
##    baseline.chisq       baseline.df   baseline.pvalue               cfi 
##           478.973             3.000             0.000             0.586 
##               tli              nnfi               rfi               nfi 
##            -0.243            -0.243             1.000             0.586 
##              pnfi               ifi               rni              logl 
##             0.195             0.587             0.586         -1986.510 
## unrestricted.logl              npar               aic               bic 
##         -1887.420             4.000          3981.020          3997.878 
##            ntotal              bic2             rmsea 
##           500.000          3985.182             0.628             0.556 
##      rmsea.pvalue               rmr        rmr_nomean 
##             0.703             0.000             0.176             0.176 
##              srmr       srmr_nomean             cn_05             cn_01 
##             0.176             0.176            10.692            17.740 
##               gfi              agfi              pgfi               mfi 
##             0.821            -0.075             0.137             0.821 
##              ecvi 
##             0.412

# Estimates of the model parameters
parameterEstimates(mod2fit, ci = TRUE, = "norm")
##   lhs op rhs   est    se      z pvalue ci.lower ci.upper
## 1 DEP  ~ ILL 0.330 0.042  7.817      0    0.247    0.413
## 2 ILL  ~ IMM 0.600 0.036 16.771      0    0.530    0.670
## 3 DEP ~~ DEP 0.889 0.056 15.811      0    0.779    1.000
## 4 ILL ~~ ILL 0.639 0.040 15.811      0    0.560    0.718
## 5 IMM ~~ IMM 0.998 0.000     NA     NA    0.998    0.998

# Modification indices
modindices(mod2fit, standardized = TRUE)
##    lhs op rhs    mi    epc sepc.all sepc.nox
## 1  DEP ~~ DEP   0.0  0.000   0.000    0.000    0.000
## 2  DEP ~~ ILL 163.6 -0.719  -0.719   -0.720   -0.720
## 3  DEP ~~ IMM 163.6  0.674   0.674    0.675    0.674
## 4  ILL ~~ ILL   0.0  0.000   0.000    0.000    0.000
## 5  ILL ~~ IMM    NA     NA      NA       NA       NA
## 6  IMM ~~ IMM   0.0  0.000   0.000    0.000    0.000
## 7  DEP  ~ ILL   0.0  0.000   0.000    0.000    0.000
## 8  DEP  ~ IMM 163.6  0.675   0.675    0.675    0.676
## 9  ILL  ~ DEP 163.6 -0.808  -0.808   -0.808   -0.808
## 10 ILL  ~ IMM   0.0  0.000   0.000    0.000    0.000
## 11 IMM  ~ DEP 143.8  0.666   0.666    0.666    0.666
## 12 IMM  ~ ILL   0.0  0.000   0.000    0.000    0.000

That's it. From inputing a variance-covariance matrix to fitting a model, drawing a path diagram, comparing to alternate models, and finally inspecting the parameters of the preferred model. lavaan is an amazing project which adds great capabilities to R. These will be explored in future posts.


citation(package = "lavaan")
## To cite lavaan in publications use:
##   Yves Rosseel (2012). lavaan: An R Package for Structural
##   Equation Modeling. Journal of Statistical Software, 48(2), 1-36.
##   URL
## A BibTeX entry for LaTeX users is
##   @Article{,
##     title = {{lavaan}: An {R} Package for Structural Equation Modeling},
##     author = {Yves Rosseel},
##     journal = {Journal of Statistical Software},
##     year = {2012},
##     volume = {48},
##     number = {2},
##     pages = {1--36},
##     url = {},
##   }
citation(package = "semPlot")
## To cite package 'semPlot' in publications use:
##   Sacha Epskamp (2013). semPlot: Path diagrams and visual analysis
##   of various SEM packages' output. R package version 0.3.3.
## A BibTeX entry for LaTeX users is
##   @Manual{,
##     title = {semPlot: Path diagrams and visual analysis of various SEM packages' output},
##     author = {Sacha Epskamp},
##     year = {2013},
##     note = {R package version 0.3.3},
##     url = {},
##   }
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] semPlot_0.3.3  lavaan_0.5-14  quadprog_1.5-5 pbivnorm_0.5-1
## [5] mnormt_1.4-5   boot_1.3-9     MASS_7.3-28    knitr_1.4.1   
## loaded via a namespace (and not attached):
##  [1] car_2.0-18       cluster_1.14.4   colorspace_1.2-2 corpcor_1.6.6   
##  [5] digest_0.6.3     ellipse_0.3-8    evaluate_0.4.7   formatR_0.9     
##  [9] grid_3.0.1       Hmisc_3.12-2     igraph_0.6.5-2   jpeg_0.1-6      
## [13] lattice_0.20-23  lisrelToR_0.1.4  plyr_1.8         png_0.1-6       
## [17] psych_1.3.2      qgraph_1.2.3     rockchalk_1.8.0  rpart_4.1-2     
## [21] sem_3.1-3        stats4_3.0.1     stringr_0.6.2    tools_3.0.1     
## [25] XML_3.98-1.1

Writing a Minimal Working Example (MWE) in R

How to Ask for Help using R

How to Ask for Help using R

The key to getting good help with an R problem is to provide a minimally working reproducible example (MWRE). Making an MWRE is really easy with R, and it will help ensure that those helping you can identify the source of the error, and ideally submit to you back the corrected code to fix the error instead of sending you hunting for code that works. To have an MWRE you need the following items:

  • a minimal dataset that produces the error
  • the minimal runnable code necessary to produce the data, run on the dataset provided
  • the necessary information on the used packages, R version, and system
  • a seed value, if random properties are part of the code

Let's look at the tools available in R to help us create each of these components quickly and easily.

Producing a Minimal Dataset

There are three distinct options here:

  1. Use a built in R dataset
  2. Create a new vector / data.frame from scratch
  3. Output the data you are currently working on in a shareable way

Let's look at each of these in turn and see the tools R has to help us do this.

Built in Datasets

There are a few canonical buit in R datasets that are really attractive for use in help requests.

  • mtcars
  • diamonds (from ggplot2)
  • iris

To see all the available datasets in R, simply type: data(). To load any of these datasets, simply use the following:

head(mtcars)  # to look at the data
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This option works great for a problem where you know you are having trouble with a command in R. It is not a great option if you are having trouble understanding why a command you are familiar with won't work on your data.

Note that for education data that is fairly “realistic”, there are built in simulated datasets in the eeptools package, created by Jared Knowles.

 [1] "X"           "school"      "stuid"       "grade"       "schid"      
 [6] "dist"        "white"       "black"       "hisp"        "indian"     
[11] "asian"       "econ"        "female"      "ell"         "disab"      
[16] "sch_fay"     "dist_fay"    "luck"        "ability"     "measerr"    
[21] "teachq"      "year"        "attday"      "schoolscore" "district"   
[26] "schoolhigh"  "schoolavg"   "schoollow"   "readSS"      "mathSS"     
[31] "proflvl"     "race"       

Create Your Own Data

Inputing data into R and sharing it back out with others is really easy. Part of the power of R is the ability to create diverse data structures very easily. Let's create a simulated data frame of student test scores and demographics.

Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

  id gender mathSS readSS race
1  1 female  396.6  349.2    H
2  2   male  369.5  330.7    W
3  3 female  423.3  354.3    B
4  4   male  348.7  333.1    W
5  5   male  299.7  353.4    H
6  6 female  338.0  422.1    I

And, just like that, we have simulated student data. This is a great way to evaluate problems with plotting data or with large datasets, since we can ask R to generate a random dataset that is incredibly large if necessary. However, let's look at the relationship among our variables using a quick plot:

qplot(mathSS, readSS, data = Data, color = race) + theme_bw()

plot of chunk evalsimmeddata

It looks like race is pretty evenly distributed and there is no relationship among mathSS and readSS. For some applications this data is sufficient, but for others we may wish for data that is more realistic.


  A   B   H   I   W 
192 195 202 203 208 
cor(Data$mathSS, Data$readSS)
[1] -0.01236

Output Your Current Data

Sometimes you just want to show others the data you are using and see why the problem won't work. The best practice here is to make a subset of the data you are working on, and then output it using the dput command.

dput(head(stulevel, 5))
structure(list(X = c(44L, 53L, 116L, 244L, 274L), school = c(1L, 
1L, 1L, 1L, 1L), stuid = c(149995L, 13495L, 106495L, 45205L, 
142705L), grade = c(3L, 3L, 3L, 3L, 3L), schid = c(495L, 495L, 
495L, 205L, 205L), dist = c(105L, 45L, 45L, 15L, 75L), white = c(0L, 
0L, 0L, 0L, 0L), black = c(1L, 1L, 1L, 1L, 1L), hisp = c(0L, 
0L, 0L, 0L, 0L), indian = c(0L, 0L, 0L, 0L, 0L), asian = c(0L, 
0L, 0L, 0L, 0L), econ = c(0L, 1L, 1L, 1L, 1L), female = c(0L, 
0L, 0L, 0L, 0L), ell = c(0L, 0L, 0L, 0L, 0L), disab = c(0L, 0L, 
0L, 0L, 0L), sch_fay = c(0L, 0L, 0L, 0L, 0L), dist_fay = c(0L, 
0L, 0L, 0L, 0L), luck = c(0L, 1L, 0L, 1L, 0L), ability = c(87.8540493076978, 
97.7875614875502, 104.493033823157, 111.671512686787, 81.9253913501755
), measerr = c(11.1332639734731, 6.8223938284885, -7.85615858883968, 
-17.5741522573204, 52.9833376218976), teachq = c(39.0902471213577, 
0.0984819168655733, 39.5388526976972, 24.1161227728637, 56.6806130368238
), year = c(2000L, 2000L, 2000L, 2000L, 2000L), attday = c(180L, 
180L, 160L, 168L, 156L), schoolscore = c(29.2242722609726, 55.9632592971956, 
55.9632592971956, 55.9632592971956, 55.9632592971956), district = c(3L, 
3L, 3L, 3L, 3L), schoolhigh = c(0L, 0L, 0L, 0L, 0L), schoolavg = c(1L, 
1L, 1L, 1L, 1L), schoollow = c(0L, 0L, 0L, 0L, 0L), readSS = c(357.286464546893, 
263.904581222636, 369.672179143784, 346.595665384202, 373.125445669888
), mathSS = c(387.280282915207, 302.572371332695, 365.461432571883, 
344.496386434725, 441.15810279391), proflvl = structure(c(2L, 
3L, 2L, 2L, 2L), .Label = c("advanced", "basic", "below basic", 
"proficient"), class = "factor"), race = structure(c(2L, 2L, 
2L, 2L, 2L), .Label = c("A", "B", "H", "I", "W"), class = "factor")), .Names = c("X", 
"school", "stuid", "grade", "schid", "dist", "white", "black", 
"hisp", "indian", "asian", "econ", "female", "ell", "disab", 
"sch_fay", "dist_fay", "luck", "ability", "measerr", "teachq", 
"year", "attday", "schoolscore", "district", "schoolhigh", "schoolavg", 
"schoollow", "readSS", "mathSS", "proflvl", "race"), row.names = c(NA, 
5L), class = "data.frame")

The resulting code can be copied and pasted into an R terminal and it will automatically build the dataset up exactly as described. Note, that in the above example, it might have been better if I first cut out all the unnecessary variables for my problem before I executed the dput command. The goal is to make the data only necessary to reproduce your code available.

Also, note, that we never send student level data from LDS over e-mail as this is unsecure. For work on student level data, it is better to either simulate the data or to use the built in simulated data from the eeptools package to run your examples.

Anonymizing Your Data

It may also be the case that you want to dput your data, but you want to keep the contents of your data anonymous. A Google search came up with a decent looking function to carry this out:

anonym <- function(df) {
    if (length(df) > 26) {
        LETTERS <- replicate(floor(length(df)/26), {
            LETTERS <- c(LETTERS, paste(LETTERS, LETTERS, sep = ""))
    names(df) <- paste(LETTERS[1:length(df)]) <- function(df) { <- function(i) {
            if (class(df[, i]) == "factor" | class(df[, i]) == "character") {
                column <- paste(names(df)[i], as.numeric(as.factor(df[, i])), 
                  sep = ".")
            } else if (is.numeric(df[, i])) {
                column <- df[, i]/mean(df[, i], na.rm = T)
            } else {
                column <- df[, i]
        DF <- data.frame(sapply(seq_along(df),
        names(DF) <- names(df)
    df <-

test <- anonym(stulevel)
head(test[, c(2:6, 28:32)])
                    B                 C                 D
1 0.00217632592657076  1.51160611230132 0.551020408163265
2 0.00217632592657076 0.135998696526593 0.551020408163265
3 0.00217632592657076  1.07322572705443 0.551020408163265
4 0.00217632592657076 0.455562880806568 0.551020408163265
5 0.00217632592657076  1.43813960635994 0.551020408163265
6 0.00217632592657076 0.151115261535106 0.551020408163265
                  E                 F BB                CC
1   1.3475499092559  2.01923076923077  0 0.720073808281278
2   1.3475499092559 0.865384615384615  0 0.531872308862454
3   1.3475499092559 0.865384615384615  0 0.745035931291952
4 0.558076225045372 0.288461538461538  0 0.698527611516136
5 0.558076225045372  1.44230769230769  0 0.751995631770993
6   1.3475499092559  2.01923076923077  0 0.880245964840198
                 DD   EE   FF
1 0.801153708902007 EE.2 FF.2
2 0.625921298341795 EE.3 FF.2
3 0.756017786295901 EE.2 FF.2
4 0.712648099763826 EE.2 FF.2
5 0.912608944625505 EE.2 FF.2
6 0.958626895492888 EE.4 FF.2

That looks pretty generic and anonymized to me!


  • Most of these solutions do not include missing data (NAs) which are often the source of problems in R. That limits their usefulness.
  • So, always check for NA values.

Creating the Example

Once we have our minimal dataset, we need to reproduce our error on that dataset. This part is critical. If the error goes away when you apply your code to the minimal dataset, then it will be very hard for others to diagnose the problem remotely, and it might be time to get some “at your desk” help.

Let's look at an example where we have an error aggregating data. Let's assume I am creating a new data frame for my example, and trying to aggregate that data by race.

Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
Error: unused argument(s) (by = race)
Error: object 'myAgg' not found

Why do I get an error? Well, if you sent the above code to someone, they could quickly evaluate it for errors, and look at the mistake if they knew you were attempting to use the data.table package.

Data <- data.frame(id = seq(1, 1000), gender = sample(c("male", "female"), 1000, 
    replace = TRUE), mathSS = rnorm(1000, mean = 400, sd = 60), readSS = rnorm(1000, 
    mean = 370, sd = 58.3), race = sample(c("H", "B", "W", "I", "A"), 1000, 
    replace = TRUE))

Data <- data.table(Data)
myAgg <- Data[, list(meanM = mean(mathSS)), by = race]
   race meanM
1:    H 398.6
2:    B 405.1
3:    A 397.8
4:    W 395.7
5:    I 399.1

Session Info

However, they might not know this, so we need to provide one final piece of information. This is known was the sessionInfo for our R session. To diagnose the error it is necessary to know what system you are running on, what packages are loaded in your workspace, and what version of R and a given package you are using.

Thankfully, R makes this incredibly easy. Just tack on the output from the sessionInfo() function. This is easy enough to copy and paste or include in a knitr document.

R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.8 eeptools_0.2     ggplot2_0.9.3.1  knitr_1.2       

loaded via a namespace (and not attached):
 [1] colorspace_1.2-2   dichromat_2.0-0    digest_0.6.3      
 [4] evaluate_0.4.3     formatR_0.7        grid_2.15.2       
 [7] gtable_0.1.2       labeling_0.1       MASS_7.3-23       
[10] munsell_0.4        plyr_1.8           proto_0.3-10      
[13] RColorBrewer_1.0-5 reshape2_1.2.2     scales_0.2.3      
[16] stringr_0.6.2      tools_2.15.2      


For more information, visit:

Get the source code for this blogpost in a Gist here:

Announcing eeptools 0.2

My R package eeptools has reached version 0.2. As with the last release, this is still a preliminary release which means that functionality is not full, function names and code behavior may still change from version to version, and I am still looking for suggestions and ideas to improve the package -- check it out on GitHub. Below, here are the changes in the new version:

eeptools 0.2


  • New functions for building maps with shapefiles including mapmerge to merge a dataframe and a shapefile, and ggmapmerge to conver this to a document for making a map in ggplot2
  • statamode updated to allow for multiple methods for handling multiple modes
  • remove_stars deleted and replaced with remove_char to allow for users to specify an arbitrary character string to be removed
  • New plotForWord function to export plots in a Windows MetaFile for inclusion in Microsoft Office documents
  • New age_calc function to allow calculating the age of a vector of birthdates relative to the current date
  • Fix typos in documentation
  • Fix startup message behavior
  • Remove dependencies of the package dramatically so loading is faster and more lightweight

R Bootcamp Materials!

Learn about ColoRs in R!

Analyze model results with custom functions.

Good and Bad Graphics

To train new employees at the Wisconsin Department of Public Instruction, I have developed a 2-3 day series of training modules on how to get work done in R. These modules cover everything from setting up and installing R and RStudio to creating reproducible analyses using the knitr package. There are also some experimental modules for introductions to basic computer programming, and a refresher course on statistics. I hope to improve both of these over time. 

I am happy to announce that all of these materials are available online, for free.

The bootcamp covers the following topics:

  1. Introduction to R : History of R, R as a programming language, and features of R.
  2. Getting Data In : How to import data into R, manipulate, and manage multiple data objects. 
  3. Sorting and Reshaping Data :  Long to wide, wide to long, and everything in between!
  4. Cleaning Education Data : Includes material from the Strategic Data Project about how to implement common business rules in processing administrative data. 
  5. Regression and Basic Analytics in R : Using school mean test scores to do OLS regression and regression diagnostics -- a real world example. 
  6. Visualizing Data : Harness the power of R's data visualization packages to make compelling and informative visualizations.
  7. Exporting Your Work : Learn the knitr package, and how to export graphics, and create PDF reports.
  8. Advanced Topics : A potpourri of advanced features in R (by request)
  9. A Statistics Refresher : With interactive examples using shiny 
  10. Programming Principles : Tips and pointers about writing code. (Needs work)

The best part is, all of the materials are available online and free of charge! (Check out the R Bootcamp page). They are constantly evolving. We have done two R Bootcamps so far, and hope to do more. Each time the materials get a little better.

For those not ready for a full 2 to 3 day training, together with a colleague at wor (Justin Meyer of we have created a 2-3 hour introduction that is also available on the webpage. 

And, of course, all the materials are online on GitHub. Look for future blog posts on tips for running an R bootcamp and some practical advice. For now, enjoy the materials, and feel free to leave a comment here for feedback, fork the GitHub repo, make a pull request, or take and adopt the materials however you see fit! One parting piece of advice though -- don't wait until day two for the data visualization module -- give them the ggplot2 goodness ASAP. 

Data Visualization for Education

Recently I was invited to give a talk to two cohorts of Strategic Data Project fellows. I was asked to speak about using data visualization to help inform decision-making of policy makers. At the same time, the group had a lot of variation in their interest and prior experience with data visualization. For my talk I decided to try to fit a little bit of everything into a 60 minute discussion:  

  • Principles and why they matter
  • Best practices and routines to follow
  • Branding and when style is allowed to trump function
  • Visualizing very large datasets

The talk is very specific around education data, but education data has a lot in common with datasets from any number of other fields, and so the talk might be useful to others interested in learning more about data visualization, education data, policy making, and/or all of the above!


Additionally, this was my first attempt at making a slideshow using the slidify package for R. My previous workflow involved using knitr and pandoc to make slideshows, for example, for the R Bootcamp. However, I wanted to see if slidify provided an upgrade. This package has many fantastic features that I loved, including the ability to publish the slides directly to GitHub, which is how I distributed them. This is great, but for projects I don't want in public, distributing the slides offline is a little less straightforward. 

The big advantage for me with slidify is the style flexibility. The pandoc HTML5 slides are a little vanilla in their style and somehow seem more difficult to customize. slidify seems to have found a way to make this much more direct, which I appreciate.

I think learning both has been helpful. I hope to post more soon on pieces of CSS and HTML that make web-based slides superior to their PowerPoint ancestors, and streamlines the styling process.

Shiny Apps

I have been working on developing some interactive tools for demonstrating statistical principles. The advent of the shiny framework for R from the RStudio team has made this very easy to do. These demos are still in the development stage, but they are designed to help refresh statistical knowledge for applied folks now working more and more with data in their jobs.

A link to each demo online, where you can use it, and a link to the code so you can run it on your local machine or clone it and modify it on your own is below. There is also the R code to run the demo locally on your machine.


Shiny is great!

Shiny is great!

Coin Flipping

To run this Gist locally, if you have R, just run:


Picking Lotto Numbers

There's not a lot you can do to increase your odds of winning the lottery tonight. With the PowerBall at $500 million though, a lot of otherwise rational folks might be tempted into playing. For those of you newly tempted, it is important to remember an interesting fact about the lottery--if two players hit the jackpot simultaneously, they split the money. So if $500 million is a jackpot is just large enough for your expected value to be greater than the cost of a ticket, the $250 million after splitting one time is not!

While you can't increase your odds of winning with the numbers you pick, you can decrease your chances of sharing the winnings. This is because of two simple facts:

  1. Lottery numbers are drawn randomly
  2. Human players of the lottery choose their numbers non-randomly

Remember, the PowerBall consists of 5 numbers chosen from 1:59 (without replacement) and 1 number (the PowerBall) chosen from 1:39.

Statistics about lottery numbers chosen by players are hard to come by, but we can use a little common sense to help us understand how people might choose numbers. For example, people who choose their own numbers overwhelming use personally significant numbers, like dates and times (aka integers less than 31). Since the 5 white balls include 28 digits that are not in that range, it might make sense to draw disproportionately from these numbers. Again, this does nothing to hurt our chances of winning--it just decreases our likelihood of choosing a number that is also chosen by another player.

Additionally, by letting our own random number generator pick the numbers, we remove the human element--which has been documented to be decidedly non-random--and thus increase our chances of winning independently of any other player. 

So, if you are looking to play the lottery tonight, I present a simple R script to help you pick your winning numbers! And don't worry, few enough people will use it that it should have a near-zero effect on your chances of splitting the winnings with a fellow R aficionado! (Bonus problem: calculate how many users of the script it would take for this to be untrue!)

eeptools 0.1 Available on CRAN Now!

eeptools 0.1 is available now on CRAN! You can install it by simply typing:


in your R console now. The package allows users to play with a number of built in datasets for folks in education beginning to learn R, custom themes for the popular ggplot2 plotting package, and convenience functions for calculating modes and maximums on incomplete vectors, among other utilities.

The package will be updated regularly with some new and exciting features built for teaching and using R in a state education agency.

Many thanks to the support and help from my colleagues at the Department of Public Instruction for feedback and ideas!

Converting a Markdown File to PDF Using Pandoc

Working with knitr and markdown is a great way to share quick reports with colleagues, but in cases where IE8 is still the dominant browser, shipping an HTML file with embedded graphics is a non-starter. IE8 does not support the Data URI format used to embed images directly in the HTML file if those files are greater than 32kb ( This means that you can't easily share a graphic heavy report as an HTML file with colleagues (and really, what statistical report isn't figure heavy?).

So the next best thing is to ship a PDF. One way to do this would be to print the HTML file from a browser that can display it as a PDF. In this case, the resulting file is generally quite ugly, the images are distorted often, and the header and footer are problematic. Another way is to rewrite your report with Markdown more friendly for conversion into LaTeX and then to PDF. Neither of these is fun, neither is efficient, and neither looks ideal.

Luckily, I found a great way to use pandoc to convert the HTML report into a good looking PDF without resorting to rewriting the report in LaTeX and reknitting. This means you can get the power of Markdown with the portability of PDF for long form documents and one-off data reports. All you need is a handy little script to do the translating from format to format.

You can read the StackOverflow discussion here:

My interpretation/use of this is below:

# Define your report
# Knit the Rmd to an Md file
# Convert the MD file to Html
system("Rscript -e 'require(knitr);require(markdown);knit('$RMDFILE.rmd', '$');
markdownToHTML('$', '$RMDFILE.html', options=c(\"use_xhml\"))'")
markdownToHTML('','myreport.html', options=c("use_xhml"))
# convert the system("pandoc -s myreport.html -o myreport.pdf")

A Chlorpleth Map of Free and Reduced Price Lunch in R

Charles Blow has an excellent op-ed in the New York Times about public education this week. The most important point he makes is that the defunding of public education is coming at precisely the time when American school children are most vulnerable:

Not only is our education system being starved of investment, but many of our children are literally too hungry to learn.

This sums up the problem, but it doesn't show the extent of it. One way to show the truth of this point and how dramatically things have changed is to look at state reported proportions of students eligible for Free and Reduced Price Lunch meals. FRL eligibility is set at roughly 160% of the poverty line and students qualifying for these meals are the students Blow refers to in his piece. 

For my work in Wisconsin I have spent a lot of time looking at the impact of increasing numbers of students FRL eligible in the state on education policy. Mapping this data can give a sense of the breadth and depth of the problem of children living in poverty, particularly in recent years. 

Luckily, the state of Wisconsin makes data on the proportions of FRL students in every school district available in an easy to download format. Additionally, a shapefile of the state of Wisconsin school districts is easy enough to find here.

Using a GitHub repository (FRLmap), I have wrapped up the data necessary, the scripts necessary, and some code for downloading the shapefile to make the map below. 

Then all we need is R and a few trusty packages to draw the map. I won't explain all the code here (you can snag it from the GIST embedded below or the repo) but I did want to explain a couple of pieces of the code that are important. 

First, downloading the shapefile from a web source is not straightforward in R.

download.file("", destfile=paste(getwd(),"/shapefile/publicshapefileUHS.shp",sep=""),mode="wb")

Here we need the "mode='wb'" argument to tell R to download and save the file as a binary instead of as a table. Otherwise, it will download the file with the proper extension, but the file won't behave as designed or have the properties necessarily to be read back in properly. Next, we have a crucial piece of code to merge the shapefile data with the data frame we downloaded. 

d = distall@data
##Drop extraneous data##

This little block of code is useful for merging up shapefile data and dataframes. @data indicates we want to create an object that pieces out our data from our shapefile. We then merge our data of interest (y), with the new dataframe created from the shapefile (d). After we do this, we resort the data and call spCbind to bring it back into the shapefile data structure. 

Finally, to draw the plot, we use fortify from the ggplot2 package to allow us to create a polygon object we can draw with ggplot2

The result is a striking graphic depicting the socioeconomic changes in Wisconsin over the last decade.

FRL Proportions from 2001-2012

Scoping functions in R

I want to test embedding source code in the blog by using the handy Gist tool provided by GitHub. These two R functions are a good opportunity to test out embedding a Gist on the website. These functions allow for threshold testing within a vector in R, or over rows or columns of a dataframe as well as is shown at the end of the code. They are not complex, and probably not as efficient as they could be, but it is an example of writing readable and well-documented code. And, it may be of use to others. Eventually these will be incorporated into the LDS_TOOLS package, which you can find on GitHub currently under development.