The lack of publicly available, contextualized information on your local police results in a paradox of democratic accountability. All levels of government rely extensively on data analytics to carry out work on the public’s behalf. I should know; I’ve consulted on data analytics for many of them. But all too often this work is done without incorporating the perspectives of the whole community, which too often results in tragic consequences for low-income residents and people of color, as Virginia Eubanks and Cathy O’Neil have shown.Read More
merTools is an R package that is designed to make working with multilevel models from lme4, particularly large models with many random effects, fast and easy. With merTools you can generate prediction intervals that incorporate various components of uncertainty (fixed effect, random effect, and model uncertainty), you can get the expected rank of individual random effect levels (a combination of magnitude and precision of the estimate) and you can explore the substantive effect of variables in the model using a Shiny application interactively!
Recently, we've updated the package to significantly improve performance and accuracy. You can get it on CRAN now.
- Improve handling of formulas. If the original
merModhas functions specified in the formula, the
wigglefunctions will check for this and attempt to respect these variable transformations. Where this is not possible a warning will be issued. Most common transformations are respected as long as the the original variable is passed untransformed to the model.
- Change the calculations of the residual variance. Previously residual variance was used to inflate both the variance around the fixed parameters and around the predicted values themselves. This was incorrect and resulted in overly conservative estimates. Now the residual variance is appropriately only used around the final predictions
- New option for
predictIntervalthat allows the user to return the full interval, the fixed component, the random component, or the fixed and each random component separately for each observation
- Fixed a bug with slope+intercept random terms that caused a miscalculation of the random component
- Add comparison to
rstanarmto the Vignette
tidylike and allow function to calculate expected rank for all terms at once
- Note, this breaks the API by changing the names of the columns in the output of this function
- Remove tests that test for timing to avoid issues with R-devel JIT compiler
plyrand replace with
- Fix issue #62
varListwill now throw an error if
==is used instead of
- Fix issue #54
predictIntervaldid not included random effects in calculations when
newdatahad more than 1000 rows and/or user specified
parallel=TRUE. Note: fix was to disable the
predictInterval... user can still specify for temporary backward compatibility but this should be either removed or fixed in the permanent solution.
- Fix issue #53 about problems with
predictIntervalwhen only specific levels of a grouping factor are in
newdatawith the colon specification of interactions
- Fix issue #52 ICC wrong calculations ... we just needed to square the standard deviations that we pulled
Have you ever watched a livestream? Have you ever wondered what the actual minute to minute of doing data science looks like? Do you wonder if other R users have the same frustrations as you? If yes -- then read on!
I'm off on a new professional adventure where I am doing public facing work for the first time in years. While working at home the other day I thought it would be a great idea to keep myself on-task and document my decisions if I recorded myself working with my webcam. Then, I thought, why stop there -- why not livestream my work?
And thus, the R Data Science Livestream was born. The idea is that every day for an hour or two I will livestream myself doing some data science tasks related to my current project -- which is to analyze dozens of years of FBI Uniform Crime reports (read more). I haven't done much R coding in the last 4 months, so it's also a good way to shake off the rust of being out of the game for so long.
So if you are at all interested or curious why someone would do this, check out the landing page I put up to document the project and if you are really curious, maybe even tune in or watch the archives on YouTube!
Update 1: Package is now available on CRAN.
Update 2: Development version also supports models from the blme package.
By far the most popular content on this blog are my two tutorials on how to fit and use multilevel models in R. Since publishing those tutorials I have received numerous questions, comments, and hits to this blog looking for more information about multilevel models in R. Since my day job involves fitting and exploring multilevel models, as well as explaining them to a non-technical audience, I began working with my colleague Carl Frederick on an R package to make these tasks easier. Today, I'm happy to announce that our solution, merTools, is now available. Below, I reproduce the package README file, but you can find out more on GitHub. There are two extensive vignettes that describe how to make use of the package, as well as a shiny app that allows interactive model exploration. The package should be available on CRAN within the next few days.
Working with generalized linear mixed models (GLMM) and linear mixed models (LMM) has become increasingly easy with advances in the
lme4 package. As we have found ourselves using these models more and more within our work, we, the authors, have developed a set of tools for simplifying and speeding up common tasks for interacting with
merMod objects from
lme4. This package provides those tools.
# development version library(devtools) install_github("jknowles/merTools") # CRAN version -- coming soon install.packages("merTools")
Shiny App and Demo
The easiest way to demo the features of this application is to use the bundled Shiny application which launches a number of the metrics here to aide in exploring the model. To do this:
devtools::install_github("jknowles/merTools") library(merTools) m1 <- lmer(y ~ service + lectage + studage + (1|d) + (1|s), data=InstEval) shinyMer(m1, simData = InstEval[1:100, ]) # just try the first 100 rows of data
On the first tab, the function presents the prediction intervals for the data selected by user which are calculated using the
predictInterval function within the package. This function calculates prediction intervals quickly by sampling from the simulated distribution of the fixed effect and random effect terms and combining these simulated estimates to produce a distribution of predictions for each observation. This allows prediction intervals to be generated from very large models where the use of
bootMer would not be feasible computationally.
On the next tab the distribution of the fixed effect and group-level effects is depicted on confidence interval plots. These are useful for diagnostics and provide a way to inspect the relative magnitudes of various parameters. This tab makes use of four related functions in
plotREsim which are available to be used on their own as well.
On the third tab are some convenient ways to show the influence or magnitude of effects by leveraging the power of
predictInterval. For each case, up to 12, in the selected data type, the user can view the impact of changing either one of the fixed effect or one of the grouping level terms. Using the
REimpact function, each case is simulated with the model’s prediction if all else was held equal, but the observation was moved through the distribution of the fixed effect or the random effect term. This is plotted on the scale of the dependent variable, which allows the user to compare the magnitude of effects across variables, and also between models on the same data.
Standard prediction looks like so.
predict(m1, newdata = InstEval[1:10, ]) #> 1 2 3 4 5 6 7 8 #> 3.146336 3.165211 3.398499 3.114248 3.320686 3.252670 4.180896 3.845218 #> 9 10 #> 3.779336 3.331012
predictInterval we obtain predictions that are more like the standard objects produced by
#predictInterval(m1, newdata = InstEval[1:10, ]) # all other parameters are optional predictInterval(m1, newdata = InstEval[1:10, ], n.sims = 500, level = 0.9, stat = 'median') #> fit lwr upr #> 1 3.074148 1.112255 4.903116 #> 2 3.243587 1.271725 5.200187 #> 3 3.529055 1.409372 5.304214 #> 4 3.072788 1.079944 5.142912 #> 5 3.395598 1.268169 5.327549 #> 6 3.262092 1.333713 5.304931 #> 7 4.215371 2.136654 6.078790 #> 8 3.816399 1.860071 5.769248 #> 9 3.811090 1.697161 5.775237 #> 10 3.337685 1.417322 5.341484
predictInterval is slower because it is computing simulations. It can also return all of the simulated
yhat values as an attribute to the predict object itself.
predictInterval uses the
sim function from the
arm package heavily to draw the distributions of the parameters of the model. It then combines these simulated values to create a distribution of the
yhat for each observation.
merTools also provides functionality for inspecting
merMod objects visually. The easiest are getting the posterior distributions of both fixed and random effect parameters.
feSims <- FEsim(m1, n.sims = 100) head(feSims) #> term mean median sd #> 1 (Intercept) 3.22673524 3.22793168 0.01798444 #> 2 service1 -0.07331857 -0.07482390 0.01304097 #> 3 lectage.L -0.18419526 -0.18451731 0.01726253 #> 4 lectage.Q 0.02287717 0.02187172 0.01328641 #> 5 lectage.C -0.02282755 -0.02117014 0.01324410 #> 6 lectage^4 -0.01940499 -0.02041036 0.01196718
And we can also plot this:
plotFEsim(FEsim(m1, n.sims = 100), level = 0.9, stat = 'median', intercept = FALSE)
We can also quickly make caterpillar plots for the random-effect terms:
reSims <- REsim(m1, n.sims = 100) head(reSims) #> groupFctr groupID term mean median sd #> 1 s 1 (Intercept) 0.15317316 0.11665654 0.3255914 #> 2 s 2 (Intercept) -0.08744824 -0.03964493 0.2940082 #> 3 s 3 (Intercept) 0.29063126 0.30065450 0.2882751 #> 4 s 4 (Intercept) 0.26176515 0.26428522 0.2972536 #> 5 s 5 (Intercept) 0.06069458 0.06518977 0.3105805 #> 6 s 6 (Intercept) 0.08055309 0.05872426 0.2182059
plotREsim(REsim(m1, n.sims = 100), stat = 'median', sd = TRUE)
plotREsim highlights group levels that have a simulated distribution that does not overlap 0 – these appear darker. The lighter bars represent grouping levels that are not distinguishable from 0 in the data.
Sometimes the random effects can be hard to interpret and not all of them are meaningfully different from zero. To help with this
merTools provides the
expectedRank function, which provides the percentile ranks for the observed groups in the random effect distribution taking into account both the magnitude and uncertainty of the estimated effect for each group.
ranks <- expectedRank(m1, groupFctr = "d") head(ranks) #> d (Intercept) (Intercept)_var ER pctER #> 1 1866 1.2553613 0.012755634 1123.806 100 #> 2 1258 1.1674852 0.034291228 1115.766 99 #> 3 240 1.0933372 0.008761218 1115.090 99 #> 4 79 1.0998653 0.023095979 1112.315 99 #> 5 676 1.0169070 0.026562174 1101.553 98 #> 6 66 0.9568607 0.008602823 1098.049 97
It can still be difficult to interpret the results of LMM and GLMM models, especially the relative influence of varying parameters on the predicted outcome. This is where the
REimpact and the
wiggle functions in
merTools can be handy.
impSim <- REimpact(m1, InstEval[7, ], groupFctr = "d", breaks = 5, n.sims = 300, level = 0.9) impSim #> case bin AvgFit AvgFitSE nobs #> 1 1 1 2.787033 2.801368e-04 193 #> 2 1 2 3.260565 5.389196e-05 240 #> 3 1 3 3.561137 5.976653e-05 254 #> 4 1 4 3.840941 6.266748e-05 265 #> 5 1 5 4.235376 1.881360e-04 176
The result of
REimpact shows the change in the
yhat as the case we supplied to
newdata is moved from the first to the fifth quintile in terms of the magnitude of the group factor coefficient. We can see here that the individual professor effect has a strong impact on the outcome variable. This can be shown graphically as well:
library(ggplot2) ggplot(impSim, aes(x = factor(bin), y = AvgFit, ymin = AvgFit - 1.96*AvgFitSE, ymax = AvgFit + 1.96*AvgFitSE)) + geom_pointrange() + theme_bw() + labs(x = "Bin of `d` term", y = "Predicted Fit")
Here the standard error is a bit different – it is the weighted standard error of the mean effect within the bin. It does not take into account the variability within the effects of each observation in the bin – accounting for this variation will be a future addition to
A long overdue overhaul of my eeptools package for R was released to CRAN today and should be showing up in the mirrors soon. The release notes for this version are extensive as this represents a modernization of the package infrastructure and the reimagining of many of the utility functions contained in the package. From the release notes:
This is a major update including removing little used functions and renaming and restructuring functions.
- A new package vignette is now included
nth_maxfunction for finding the
nthhighest value in a vector
retained_calcnow accepts user specified values for
destringfunction deprecated and renamed to
makenumto better reflect the use of the function
crosstabsfunction exported to allow the user to generate the data behind
crosstabplotbut not draw the plot
dropbox_sourcedeprecated, use the
plotForWordfunction deprecated in favor of packages like
mapmerge2has been deprecated in favor of a tested
mosaictabs.labelshas been deprecated in favor of
gelmansimwas renamed to
n.simsto align with the
- Fixed bug in
retained_calcwhere user specified
sidresulted in wrong ids being returned
- Inserted a meaningful error in
age_calcwhen the enddate is before the date of birth
- Fixed issue with
age_calcwhich lead to wrong fraction of age during leap years
lag_datanow can do leads and lags and includes proper error messages
- fix major bugs for
statamodeincluding faulty default to method and returning objects of the wrong class
- add unit tests and continuous integration support for better package updating
- fix behavior of
max_misin cases when it is passed an empty vector or a vector of NA
leading_zerofunction made robust to negative values
- added NA handling options to
- Codebase is now tested with
lintrto improve readability
Last week version 1.0 of the caretEnsemble package was released to CRAN. I have co-authored this package with Zach Mayer, who had the original idea of allowing for ensembles of train objects in the caret package. The package is designed to make it easy for the user to optimally combine models of various types together to produce a meta-model with superior fit than the sub-models.
"caretEnsemble has 3 primary functions: caretList, caretEnsemble and caretStack. caretList is used to build lists of caret models on the same training data, with the same re-sampling parameters. caretEnsemble andcaretStack are used to create ensemble models from such lists of caret models. caretEnsemble uses greedy optimization to create a simple linear blend of models and caretStack uses a caret model to combine the outputs from several component caret models."
I am excited about this package because the ensembling features in caretEnsemble are used to provide additional predictive power in the Wisconsin Dropout Early Warning System (DEWS). I've written about this system before, but it is a large-scale machine learning system used to provide schools with a prediction on the likely graduation of their middle grade students. It is easy to implement and provides additional predictive power for the cost of some CPU cycles.
Additionally, Zach and I have worked hard to make ensembling models *easy*. For example, you can automatically build lists of models -- a library of models -- for ensembling using the caretList function. This caretList can then be used directly in either the caretEnsemble or caretStack mode, depending on how you want to combine the predictions from the submodels. These new caret objects also come with their own S3 methods (adding more in future releases) to allow you to interact with them and explore the results of ensembling -- including summary, print, plot, and variable importance calculations. They also include the all important predict method allowing you to generate predictions for use elsewhere.
Zach has written a great vignette that should give you a feel for how caretEnsemble works. And, we are actively improving caretEnsemble over on GitHub. Drop by and let us know if you find a bug, have a feature request, or want to let us know how it is working for you!
Recently the NPR program Marketplace did a story about the rise of the use of dropout early warning systems in public schools that you can read or listen to online. I was lucky enough to be interviewed for the piece because of the role I have played in creating the Wisconsin Dropout Early Warning System. Marketplace did a great job explaining the nuances of how these systems fit into the ways schools and districts work. I wanted to use this as an opportunity just write a few thoughts about early warning systems based on my work in this area.
Not discussed in the story was the more wonky but important question of how these predictions are obtained. While much academic research discusses the merits of various models in terms of their ability to correctly identify students, there is not as much work done discussing the choice of which system to use in application. By its nature, the problem of identifying dropouts early presents a fundamental trade-off between simplicity and accuracy. When deploying an EWS to educators in the field, then, analysts should focus on not how accurate a model is, but if it is accurate enough to be useful and actionable. Unfortunately, most of the research literature on early warning systems focuses on the accuracy of a specific model and not the question of sufficient accuracy.
Part of the reason for this focus is that each model tended to have its own definition of accuracy. A welcome and recent shift in the field to using ROC curves to measure the trade-off between false-positives and false-negatives now allows for these discussions of simple vs. complex to use a common and robust accuracy metric. (Hat tip to Alex Bowers for working to provide these metrics for dozens of published early warning indicators.) For example, a recent report by the Chicago Consortium on School Research (CCSR) demonstrates how simple indicators such as grade 8 GPA and attendance can be used to accurately project whether a student will be on-track in grade 9 or not. Using ROC curves, the CCSR can demonstrate on a common scale how accurate these indicators are relative to other more complex indicators and make a compelling case that in Chicago Public Schools these indicators are sufficiently accurate to merit use.
However, in many cases these simple approaches will not be sufficiently accurate to merit use in decision making in schools. Many middle school indicators in the published literature have true dropout identification rates that are quite low, and false-positive rates that are quite high (Bowers, Sprott and Taff 2013). Furthermore, local conditions may mean that a linkage between GPA and dropout that holds in Chicago Public Schools is not nearly as predictive in another context. Additionally, though not empirically testable in most cases, many EWS indicator systems simply serve to provide a numeric account of information that is apparent to schools in other ways -- that is, the indicators selected identify only "obvious" cases of students at risk of dropping out. In this case the overhead of collecting data and conducting identification using the model does not generate a payoff of new actionable information with which to intervene.
More complex models have begun to see use perhaps in part to respond to the challenge of providing value added beyond simple checklist indicators. Unlike checklist or indicator systems, machine learning approaches determine the risk factors empirically from historical data. Instead of asserting that an attendance rate above 95% is necessary to be on-track to graduate, a machine learning algorithm identifies the attendance rate cutoff that that best predicts successful graduation. Better still, the algorithm can do this while jointly considering several other factors simultaneously. This approach is the approach I have previously written about taking in Wisconsin, and has also been developed in Montgomery County Public Schools by Data Science for Social Good fellows.
In fact, the machine learning model is much more flexible than a checklist approach. Once you have moved away from the desire to provide simple indicators that can be applied by users on the fly, and are willing to deliver analytics much like another piece of data, the sky is the limit. Perhaps the biggest advantage to users is that machine learning approaches allow analysts to help schools understand the degree of student risk. Instead of providing a simple yes or no indicator, these approaches can assign probabilities to student completion, allowing the school to use this information to decide on the appropriate level of response.
This concept of degree is important because not all dropouts are simply the lowest performing students in their respective classes. While low performing students do represent a majority of dropouts in many schools, these students are often already identified and being served because of their low-performance. A true early warning system, then, should seek to identify both students who are already identified by schools and those students who are likely non-completers, but who may not already be receiving intervention services. To live up to their name, early warning systems should identify students earlier than after they have started showing acute signs of low performance or disengagement in school. This is where the most value can be delivered to schools.
Despite the improvements possible with a machine learning approach, a lot of work remains to be done. One issue that was raised in the piece in the Marketplace story is understanding how schools put this information to work. An EWS alone will not improve outcomes for students -- it only enables schools more time to make changes. There has not been much research on how schools use information like an early warning system to make decisions about students. There needs to be more work done to understand how schools as organizations respond to analytics like early warning indicators. What are their misconceptions? How do they work together? What are the barriers to trusting these more complex calculations and the data that underlie them?
The drawback of the machine learning approach, as the authors of the CCSR report note, is that the results are not intuitive to school staff and this makes the resulting intervention strategy seem less clear. This trade-off strikes at the heart of the changing ways in which data analysis is assisting humans in making decisions. The lack of transparency in the approach must be balanced by an effort on the part of the analysts providing the prediction to communicate the results. Communication can make the results easier to interpret, can build trust in the underlying data, and build capacity within organizations to create the feedback loops necessary to sustain the system. Analysts must actively seek out feedback on the performance of the model, learn where users are struggling to understand it, and where users are finding it clash with their own observations. This is a critical piece in ensuring that the trade-off in complexity does not undermine the usefulness of the entire system.
EWS work represents just the beginning for meaningful analytics to replace the deluge of data in K-12 schools. Schools don't need more data, they need actionable information that reduces the time not spent on instruction and student services. Analysts don't need more student data, they need meaningful feedback loops with educators who are tasked with interpreting these analyses and applying the interventions to drive real change. As more work is done to integrate machine learning and eased data collection into the school system, much more work must be done to understand the interface between school organizations, individual educators, and analytics. Analysts and educators must work together to continually refine what information schools and teachers need to be successful and how best to deliver that information in an easy to use fashion at the right time.
Read about the machine learning approach applied in Montgomery County Public Schools.
Learn about the ROC metric and how various early warning indicators have performed relative to one another in this paper by Bowers, Sprott, and Taff.
Learn about the Wisconsin DEWS machine learning system and how it was developed.
Read the comparison of many early warning indicators and their performance within Chicago Public Schools.
Really excited to launch my new website - DATA-COPE, a place for education data analysts to share ideas, learn about the latest tools and policies affecting their work, and to keep the pulse on education analytics and the role they play in improving education outcomes. The group is a loosely organized affiliation of state and local education analysts in the United States as well as external researchers at research organizations which provides support to such agencies. The group's aim is to better learn from one another, share resources, and keep the pulse on any policy or technology related developments that may significantly impact our shared work.
My first major post on the website covers selecting an analytics platform and software suite to best meet the needs of your agency. Spoiler alert, I'm a big fan of R!
For the past two years I have been working on the Wisconsin Dropout Early Warning System, a predictive model of on time high school graduation for students in grades 6-9 in Wisconsin. The goal of this project is to help schools and educators have an early indication of the likely graduation of each of their students, early enough to allow time for individualized intervention. The result is that nearly 225,000 students receive an individualized prediction at the start and end of the school year. The workflow for the system is mapped out in the diagram below:
The system is moving into its second year of use this fall and I recently completed a research paper describing the predictive analytic approach taken within DEWS. The research paper is intended to serve as a description and guide of the decisions made in developing an automated prediction system using administrative data. The paper covers both the data preparation and model building process as well as a review of the results. A preview is shown below which demonstrates how the EWS models trained in Wisconsin compare to the accuracy reported in the research literature - represented by the points on the graph. The accuracy is measured using the ROC curve. The article is now available via figshare.
The colored lines represent different types of ensembled statistical models and their accuracy across various thresholds of their predicted probabilities. The points represent the accuracy of comparable models in the research literature using reported accuracy from a paper by Alex Bowers:
Bowers, A.J., Sprott, R.*, Taff, S.A.* (2013) Do we Know Who Will Drop Out? A Review of the Predictors of Dropping out of High School: Precision, Sensitivity and Specificity. The
High School Journal, 96(2), 77-100. doi:10.1353/hsj.2013.0000. This article serves as good background and grounds the benchmarking of the models built in Wisconsin and for others when benchmarking their own models.
The state of Wisconsin has one of the highest four year graduation rates in the nation, but deep disparities among student subgroups remain. To address this the state has created the Wisconsin Dropout Early Warning System (DEWS), a predictive model of student dropout risk for students in grades six through nine. The Wisconsin DEWS is in use statewide and currently provides predictions on the likelihood of graduation for over 225,000 students. DEWS represents a novel statistical learning based approach to the challenge of assessing the risk of non-graduation for students and provides highly accurate predictions for students in the middle grades without expanding beyond mandated administrative data collections.
Similar dropout early warning systems are in place in many jurisdictions across the country. Prior research has shown that in many cases the indicators used by such systems do a poor job of balancing the trade off between correct classification of likely dropouts and false-alarm (Bowers et al., 2013). Building on this work, DEWS uses the receiver-operating characteristic (ROC) metric to identify the best possible set of statistical models for making predictions about individual students.
This paper describes the DEWS approach and the software behind it, which leverages the open source statistical language R (R Core Team, 2013). As a result DEWS is a flexible series of software modules that can adapt to new data, new algorithms, and new outcome variables to not only predict dropout, but also impute key predictors as well. The design and implementation of each of these modules is described in detail as well as the open-source R package, EWStools, that serves as the core of DEWS (Knowles, 2014).
The code that powers the EWS is an open source R extension of the caret package which is available on GitHub: EWStools on GitHub
Fun with merMod Objects
Jared E. Knowles
Saturday, May 17, 2014
First of all, be warned, the terminology surrounding multilevel models is vastly inconsistent. For example, multilevel models themselves may be referred to as hierarchical linear models, random effects models, multilevel models, random intercept models, random slope models, or pooling models. Depending on the discipline, software used, and the academic literature many of these terms may be referring to the same general modeling strategy. In this tutorial I will attempt to provide a user guide to multilevel modeling by demonstrating how to fit multilevel models in R and by attempting to connect the model fitting procedure to commonly used terminology used regarding these models.
We will cover the following topics:
- The structure and methods of
- Extracting random effects of
- Plotting and interpreting
If you haven’t already, make sure you head over to the Getting Started With Multilevel Models tutorial in order to ensure you have set up your environment correctly and installed all the necessary packages. The tl;dr is that you will need:
- A current version of R (2.15 or greater)
Read in the data
Multilevel models are appropriate for a particular kind of data structure where units are nested within groups (generally 5+ groups) and where we want to model the group structure of the data. We will use data from Jon Starkweather at the University of North Texas on student social attributes within schools and classes. Visit the excellent tutorial available here for more.
library(lme4) # load library library(arm) # convenience functions for regression in R lmm.data <- read.table("http://www.unt.edu/rss/class/Jon/R_SC/Module9/lmm.data.txt", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE) #summary(lmm.data) head(lmm.data)
## id extro open agree social class school ## 1 1 63.69 43.43 38.03 75.06 d IV ## 2 2 69.48 46.87 31.49 98.13 a VI ## 3 3 79.74 32.27 40.21 116.34 d VI ## 4 4 62.97 44.41 30.51 90.47 c IV ## 5 5 64.25 36.86 37.44 98.52 d IV ## 6 6 50.97 46.26 38.83 75.22 d I
Here we have data on the extroversion of subjects nested within classes and within schools.
Let’s understand the structure of the data a bit before we begin:
## 'data.frame': 1200 obs. of 7 variables: ## $ id : int 1 2 3 4 5 6 7 8 9 10 ... ## $ extro : num 63.7 69.5 79.7 63 64.2 ... ## $ open : num 43.4 46.9 32.3 44.4 36.9 ... ## $ agree : num 38 31.5 40.2 30.5 37.4 ... ## $ social: num 75.1 98.1 116.3 90.5 98.5 ... ## $ class : Factor w/ 4 levels "a","b","c","d": 4 1 4 3 4 4 4 4 1 2 ... ## $ school: Factor w/ 6 levels "I","II","III",..: 4 6 6 4 4 1 3 4 3 1 ...
Here we see we have two possible grouping variables –
school. Let’s explore them a bit further:
## ## a b c d ## 300 300 300 300
## ## I II III IV V VI ## 200 200 200 200 200 200
## ## I II III IV V VI ## a 50 50 50 50 50 50 ## b 50 50 50 50 50 50 ## c 50 50 50 50 50 50 ## d 50 50 50 50 50 50
This is a perfectly balanced dataset. In all likelihood you aren’t working with a perfectly balanced dataset, but we’ll explore the implications for that in the future. For now, let’s plot the data a bit. Using the excellent
xyplot function in the
lattice package, we can explore the relationship between schools and classes across our variables.
require(lattice) xyplot(extro ~ open + social + agree | class, data = lmm.data, auto.key = list(x = .85, y = .035, corner = c(0, 0)), layout = c(4,1), main = "Extroversion by Class")