R Bootcamp Materials!

​Learn about ColoRs in R!

​Analyze model results with custom functions.

​Good and Bad Graphics

To train new employees at the Wisconsin Department of Public Instruction, I have developed a 2-3 day series of training modules on how to get work done in R. These modules cover everything from setting up and installing R and RStudio to creating reproducible analyses using the knitr package. There are also some experimental modules for introductions to basic computer programming, and a refresher course on statistics. I hope to improve both of these over time. 

I am happy to announce that all of these materials are available online, for free.

​The bootcamp covers the following topics:

  1. Introduction to R​ : History of R, R as a programming language, and features of R.
  2. Getting Data In :​ How to import data into R, manipulate, and manage multiple data objects. 
  3. Sorting and Reshaping Data :  ​Long to wide, wide to long, and everything in between!
  4. Cleaning Education Data​ : Includes material from the Strategic Data Project about how to implement common business rules in processing administrative data. 
  5. Regression and Basic Analytics in R​ : Using school mean test scores to do OLS regression and regression diagnostics -- a real world example. 
  6. Visualizing Data : ​Harness the power of R's data visualization packages to make compelling and informative visualizations.
  7. Exporting Your Work : ​Learn the knitr​ package, and how to export graphics, and create PDF reports.
  8. Advanced Topics :​ A potpourri of advanced features in R (by request)
  9. A Statistics Refresher : ​With interactive examples using shiny​ 
  10. ​Programming Principles : ​Tips and pointers about writing code. (Needs work)

The best part is, all of the materials are available online and free of charge! (Check out the R Bootcamp page). They are constantly evolving. We have done two R Bootcamps so far, and hope to do more. Each time the materials get a little better. ​

For those not ready for a full 2 to 3 day training, together with a colleague at wor (Justin Meyer of RProgramming.net) we have created a 2-3 hour introduction that is also available on the webpage. 

And, of course, all the materials are online on GitHub.​ Look for future blog posts on tips for running an R bootcamp and some practical advice. For now, enjoy the materials, and feel free to leave a comment here for feedback, fork the GitHub repo, make a pull request, or take and adopt the materials however you see fit! One parting piece of advice though -- don't wait until day two for the data visualization module -- give them the ggplot2​ goodness ASAP. 

R Bootcamp @ Sector67 in Madison

I am pleased to announce that together with Justin Meyer (also from the Wisconsin Department of Public Instruction) I will be presenting a two hour version of the R Bootcamp. Sector67 is a collaborative maker/hacker space in Madison, and is a great venue for sharing knowledge with others. 

We're trying out a 2 hour overview session to see what might be a good way to get people introduced to R and excited about it at the same time.

Participants can get the workshop data and code right here:

https://dl.dropbox.com/u/1811289/RBootcampPreviewS67.zip

Data Visualization for Education

Recently I was invited to give a talk to two cohorts of Strategic Data Project fellows. I was asked to speak about using data visualization to help inform decision-making of policy makers. At the same time, the group had a lot of variation in their interest and prior experience with data visualization. For my talk I decided to try to fit a little bit of everything into a 60 minute discussion:  

  • Principles and why they matter
  • Best practices and routines to follow
  • Branding and when style is allowed to trump function
  • Visualizing very large datasets

The talk is very specific around education data, but education data has a lot in common with datasets from any number of other fields, and so the talk might be useful to others interested in learning more about data visualization, education data, policy making, and/or all of the above!

VIEW THE SLIDES HERE

Additionally, this was my first attempt at making a slideshow using the slidify package for R. My previous workflow involved using knitr and pandoc to make slideshows, for example, for the R Bootcamp. However, I wanted to see if slidify provided an upgrade. This package has many fantastic features that I loved, including the ability to publish the slides directly to GitHub, which is how I distributed them. This is great, but for projects I don't want in public, distributing the slides offline is a little less straightforward. 

The big advantage for me with slidify is the style flexibility. The pandoc HTML5 slides are a little vanilla in their style and somehow seem more difficult to customize. slidify seems to have found a way to make this much more direct, which I appreciate.

I think learning both has been helpful. I hope to post more soon on pieces of CSS and HTML that make web-based slides superior to their PowerPoint ancestors, and streamlines the styling process.

Shiny Apps

I have been working on developing some interactive tools for demonstrating statistical principles. The advent of the shiny framework for R from the RStudio team has made this very easy to do. These demos are still in the development stage, but they are designed to help refresh statistical knowledge for applied folks now working more and more with data in their jobs.

A link to each demo online, where you can use it, and a link to the code so you can run it on your local machine or clone it and modify it on your own is below. There is also the R code to run the demo locally on your machine.

Demos:

Shiny is great!

Shiny is great!

Coin Flipping

http://glimmer.rstudio.com/wisconsindpi/coin/

https://gist.github.com/4484930

To run this Gist locally, if you have R, just run:

library(shiny)
runGist('4484930')
library(shiny)
runGist('4484951')
library(shiny)
runGist('4484910')
library(shiny)
runGist('4484886')
library(shiny)
runGist('4484868')
library(shiny)
runGist('4484845')
library(shiny)
runGist('4484754')

Picking Lotto Numbers

There's not a lot you can do to increase your odds of winning the lottery tonight. With the PowerBall at $500 million though, a lot of otherwise rational folks might be tempted into playing. For those of you newly tempted, it is important to remember an interesting fact about the lottery--if two players hit the jackpot simultaneously, they split the money. So if $500 million is a jackpot is just large enough for your expected value to be greater than the cost of a ticket, the $250 million after splitting one time is not!

While you can't increase your odds of winning with the numbers you pick, you can decrease your chances of sharing the winnings. This is because of two simple facts:

  1. Lottery numbers are drawn randomly
  2. Human players of the lottery choose their numbers non-randomly

Remember, the PowerBall consists of 5 numbers chosen from 1:59 (without replacement) and 1 number (the PowerBall) chosen from 1:39.

Statistics about lottery numbers chosen by players are hard to come by, but we can use a little common sense to help us understand how people might choose numbers. For example, people who choose their own numbers overwhelming use personally significant numbers, like dates and times (aka integers less than 31). Since the 5 white balls include 28 digits that are not in that range, it might make sense to draw disproportionately from these numbers. Again, this does nothing to hurt our chances of winning--it just decreases our likelihood of choosing a number that is also chosen by another player.

Additionally, by letting our own random number generator pick the numbers, we remove the human element--which has been documented to be decidedly non-random--and thus increase our chances of winning independently of any other player. 

So, if you are looking to play the lottery tonight, I present a simple R script to help you pick your winning numbers! And don't worry, few enough people will use it that it should have a near-zero effect on your chances of splitting the winnings with a fellow R aficionado! (Bonus problem: calculate how many users of the script it would take for this to be untrue!)

eeptools 0.1 Available on CRAN Now!

eeptools 0.1 is available now on CRAN! You can install it by simply typing:

install.packages('eeptools')

in your R console now. The package allows users to play with a number of built in datasets for folks in education beginning to learn R, custom themes for the popular ggplot2 plotting package, and convenience functions for calculating modes and maximums on incomplete vectors, among other utilities.

The package will be updated regularly with some new and exciting features built for teaching and using R in a state education agency.

Many thanks to the support and help from my colleagues at the Department of Public Instruction for feedback and ideas!

Converting a Markdown File to PDF Using Pandoc

Working with knitr and markdown is a great way to share quick reports with colleagues, but in cases where IE8 is still the dominant browser, shipping an HTML file with embedded graphics is a non-starter. IE8 does not support the Data URI format used to embed images directly in the HTML file if those files are greater than 32kb (http://en.wikipedia.org/wiki/Data_URI_scheme). This means that you can't easily share a graphic heavy report as an HTML file with colleagues (and really, what statistical report isn't figure heavy?).

So the next best thing is to ship a PDF. One way to do this would be to print the HTML file from a browser that can display it as a PDF. In this case, the resulting file is generally quite ugly, the images are distorted often, and the header and footer are problematic. Another way is to rewrite your report with Markdown more friendly for conversion into LaTeX and then to PDF. Neither of these is fun, neither is efficient, and neither looks ideal.

Luckily, I found a great way to use pandoc to convert the HTML report into a good looking PDF without resorting to rewriting the report in LaTeX and reknitting. This means you can get the power of Markdown with the portability of PDF for long form documents and one-off data reports. All you need is a handy little script to do the translating from format to format.

You can read the StackOverflow discussion here: http://stackoverflow.com/questions/11025123/how-to-convert-r-markdown-to-pdf

My interpretation/use of this is below:

# Define your report
system("RMDFILE=myreport")
# Knit the Rmd to an Md file
# Convert the MD file to Html
system("Rscript -e 'require(knitr);require(markdown);knit('$RMDFILE.rmd', '$RMDFILE.md');
markdownToHTML('$RMDFILE.md', '$RMDFILE.html', options=c(\"use_xhml\"))'")
require(knitr)
require(markdown)
knit("myreport.Rmd")
markdownToHTML('myreport.md','myreport.html', options=c("use_xhml"))
# convert the system("pandoc -s myreport.html -o myreport.pdf")

On School Boards and Policy Shocks

The dissertation process has many steps. The prospectus or proposal is one of the last. Awhile ago I was lucky enough to have my dissertation proposal defense and pass!  My project is seeking to understand the linkage between political activity at the state level and voter and candidate participation at the local level. To evaluate this, I take the case of Wisconsin--an extreme example of domain specific policy activity--and see if the events of the last two years in Wisconsin, particularly around education reform, drove more candidates and more voters to participate in school board elections statewide. 

​There are over 14,000 school boards in the United States, and they are responsible for expenditures equal to those of the US Department of Defense annually. However, little is known about the democratic process by which individuals on school boards come into office. Some work has been done on large urban school boards, but this work has largely concerned itself with either the question of mayoral control, or of district wide vs. regional board electoral districts. The broader question of whether or not school boards are democratic institutions that respond to community pressures and have meaningful participation has only been studied intermittently since the 1970s. Worse, the political dynamics between state and federal policymaking and local participation in school board elections has received little or no attention over this same period, despite a large increase in both state and federal involvement over this time period. 

​Classic model of dissatisfaction theory as presented by Wu 1995

​The dissatisfaction model is a nice theoretical model, but it leaves something to be desired in terms of generating predictions and allowing us to understand the school board as part of democratic system. It only describes the actions of the board, but not of the voters deciding to vote, and the congruence between voter beliefs and voter turnout.

In my research I found a political science dissertation out of Stanford that helped with this.​ Wu 1995 proposed a much more fully developed game-theoretic model of the interaction between voters, board members, and policy. The model is depicted below. 

Wu 1995 Model of Political Game for School Boards​

​This is a classic game theoretical model, simplified here, explaining conditions under which various actors undertake certain actions. The most important feature is that the decision of voters to vote based on policy decisions, and board members base policy decisions on the likelihood of voters voting and defeating them. This paints a much more comprehensive system than the dissatisfaction theory, but builds on that theory nicely. Wu's work also dovetails well with the work of other scholars trying to incorporate public choice models (Rada 1987, 1988). 

These innovations are necessary to help understand the Wisconsin political context. The main puzzle of school board elections is whether or not they retain features of a democratic entity. Rational choice theory does a lot to help move the discussion away from school boards having to have high turnout elections to be considered democratic--indeed, if the school board is passing policy in line with voters preferences there is no need for voters to vote in this case. However, no serious empirical tests of the models above have been conducted to understand their predictive power and how closely they reflect reality.

The timeline below gives a picture of political activity in education at the state level in Wisconsin. This historic and unprecedented political and policy activity focused very closely around issues related to education--school budget cuts, reduction in collective bargaining rights for public workers, etc.-- allows a test of the democratic linkages between state and local policy. There is no doubt that state level politics in Wisconsin have never seen a more active electorate. ​

​Wisconsin Political Timeline 2011-Present

If state and local politics are linked in their activity levels, then we should see corresponding increases in the activity levels of citizens participating in local elections. In fact, in policy areas of high contestation at the state level, we would expect strong increases in political activity at the local level as well. The figure below attempts to understand how state and local politics might be linked. 

Knowles Model of Wisconsin School Boards​

The essential belief is that the number of challengers in an election for school board is influenced by the congruence of voter preferences vis a vis the state policy changes, the overall support for the controversial reforms of Governor Walker, and the strength of independent interest groups--particularly teachers' unions--in the school district. All of these determine the policy decisions of school boards both in making budget reductions in response to the fiscal tightening in the new state budget, and in decisions to suspend or extend union contracts in response to the new local authority given to districts in bargaining with public employees. 

​This is in turn linked the voter participation. Voters will only turn out in elections that present serious choices of candidates, but they may not uniformly turn out in these cases. Finally, the results of the election have an impact on the make up of the board and possibly on the policy direction of the board depending on the number of incumbents defeated. The table below summarizes the expected relationships between dependent and independent variables in the study. 

Dependent Variable Independent Variable Expected Relationship
Candidate Participation Unity on State Policy Negative
Candidate Participation Prior Challenger Emergence Null
Voter Participation Prior Turnout Positive
Voter Participation Number of Challengers Positive
Voter Participation Policy Divergence Positive
Union Policy Union Strength Policy Resistance
Union Policy Walker Support Policy Support
Budget Policy Budgetary Health (up) Fewer cuts
Budget Policy Walker Support Greater Cuts

A few rows need to be explained. In the first row, the more unified a community is in support of Governor Walker's policies, the less likely there will be emergence of a higher than usual number of candidates because of policy stability in the district. Prior challenger emergence, then, should yield little predictive effect because of the changed policy environment.

​For voter participation, policy divergence is a critical variable. Here, policy divergence means the split among voters in their support of the policies enacted at the state level over the last two years. The wider the divergence, the more likely the election will be contested and voters have reason to mobilize and participate.

​For union policy, the stronger the union in a district--not surprisingly--the more likely the school board should be to resist the policies at the state level, if possible. This will be counteracted by the strength of support for the governor among the electorate.

Finally, for budget policy, the healthier the budget of a district, the fewer cuts in the budget should be experienced. Net of that, districts with high support for the Governor should experience greater cuts. ​

Conclusion

That, in a nutshell, is the dissertation proposal. You can read more about it by reading the official abstract submitted to my political science department, or the full proposal, both available here.​ I encourage you to do so, and to check back here as the project progresses!

A Chlorpleth Map of Free and Reduced Price Lunch in R

Charles Blow has an excellent op-ed in the New York Times about public education this week. The most important point he makes is that the defunding of public education is coming at precisely the time when American school children are most vulnerable:

Not only is our education system being starved of investment, but many of our children are literally too hungry to learn.

This sums up the problem, but it doesn't show the extent of it. One way to show the truth of this point and how dramatically things have changed is to look at state reported proportions of students eligible for Free and Reduced Price Lunch meals. FRL eligibility is set at roughly 160% of the poverty line and students qualifying for these meals are the students Blow refers to in his piece. 

For my work in Wisconsin I have spent a lot of time looking at the impact of increasing numbers of students FRL eligible in the state on education policy. Mapping this data can give a sense of the breadth and depth of the problem of children living in poverty, particularly in recent years. 

Luckily, the state of Wisconsin makes data on the proportions of FRL students in every school district available in an easy to download format. Additionally, a shapefile of the state of Wisconsin school districts is easy enough to find here.

Using a GitHub repository (FRLmap), I have wrapped up the data necessary, the scripts necessary, and some code for downloading the shapefile to make the map below. 

Then all we need is R and a few trusty packages to draw the map. I won't explain all the code here (you can snag it from the GIST embedded below or the repo) but I did want to explain a couple of pieces of the code that are important. 

First, downloading the shapefile from a web source is not straightforward in R.

download.file("http://dl.dropbox.com/u/1811289/shapefile/publicshapefileUHS.shp", destfile=paste(getwd(),"/shapefile/publicshapefileUHS.shp",sep=""),mode="wb")

Here we need the "mode='wb'" argument to tell R to download and save the file as a binary instead of as a table. Otherwise, it will download the file with the proper extension, but the file won't behave as designed or have the properties necessarily to be read back in properly. Next, we have a crucial piece of code to merge the shapefile data with the data frame we downloaded. 

#Merge
y<-frl_w
d = distall@data
d$sort=1:nrow(d)
y<-y[order(y$district_number),]
di<-merge(y,d,by.x="district_number",
by.y="CODE",all.y=TRUE)
di<-di[order(di$sort),]
##Drop extraneous data##
di$sort2<-di$sort-1
row.names(di)<-di$sort2
#row.names(di)<-di$sort
distall2<-spCbind(distall,di)

This little block of code is useful for merging up shapefile data and dataframes. @data indicates we want to create an object that pieces out our data from our shapefile. We then merge our data of interest (y), with the new dataframe created from the shapefile (d). After we do this, we resort the data and call spCbind to bring it back into the shapefile data structure. 

Finally, to draw the plot, we use fortify from the ggplot2 package to allow us to create a polygon object we can draw with ggplot2

The result is a striking graphic depicting the socioeconomic changes in Wisconsin over the last decade.

​FRL Proportions from 2001-2012

Announcing eeptools

The LDS_TOOLS package for R has been replaced by the new eeptools package.

​I am renaming the package for a number of reasons. First, LDS is a technical industry term in education that stands for Longitudinal Data Systems, which is not commonly known and understood. Second, LDS_TOOLS is not a valid package name in CRAN as package names can only contain alphanumeric characters and dots (whoops!). Third, the new name, eeptools is catchier. eeptools stands for Education Evaluation and Policy tools and that fits the broader mission for the vision of the package--to build convenience functions that speed the work for people working in the education space.

Currently there are only a handful of functions and some synthetic student data in the package. You can check it out online at GitHub: https://github.com/jknowles/eeptools

The good news is that using the devtools package you can build and install it today. Simply type:

require(devtools)

install_github('eeptools','jknowles')

Sourcing an R Script from Dropbox

Working on my R bootcamp materials and I thought it would be handy to get the bootcamp computers setup by sourcing an R script that will install all necessary non-core packages in it. The problem? How to deploy this script efficiently. A quick method would be to simply source the script from a public Dropbox share. But, as the BioBucket points out, it isn't that straightforward!

So I created a simple little function called source_dropbox to handle this work. It is hosted as a Gist on GitHub for others to use.

 

Next up, sourcing a Gist!

R Tutorials and Learning Materials

We are getting ready to host an R bootcamp this summer at work and I am looking at building on materials that already exist. I just wanted to list a few here while I figure out the best ways to incorporate them. 

Video Tutorials:

This is a fairly new resource, but now contains over 60 instructional videos that are two minutes or less for how to do basic tasks in R. An excellent reference and place to start.

http://www.twotorials.com/

R Tutorials:

Pairach Piboonrungroj has put together an excellent compliation of R introductory tutorials from universities around the world. Lots of great material.

http://pairach.com/2012/02/26/r-tutorials-from-universities-around-the-world/

R Tips and Tricks:

This is a great resource for how to do some basic tasks that can be incredibly frustrating in R.

http://pj.freefaculty.org/R/Rtips.html

R Compared to SPSS/SAS:

For users coming from other statistical packages this reference is a great point in the right direction for how to do familiar things in R.

http://r4stats.com/add-ons

Stay tuned for the further development of my R tutorial for education data--developing on GitHub:

https://github.com/jknowles/r_tutorial_ed

All Your Source Code Are Belong to... Nature?

The Journal of Nature put out an interesting op-ed recently discussing the need to make source code available for scientific articles that require statistical computation to produce their results.
The article is hits on a point that is absolutely critical--statistical computing is difficult. Honest mistakes get made. A lot. The peer review process catches theoretical flaws, omitted bibliographic references, and some criticism of the methods based on the amount of detail provided in the article itself. But, all of those flaws could be absent and an article could still be fatally flawed and draw completely false conclusions, simply due to an error in the code, and it would still be published if that code was never reviewed or made public.
A big concern here is transparency, as the authors state so well:
Our view is that we have reached the point that, with some exceptions, anything less than release of actual source code is an indefensible approach for any scientific results that depend on computation, because not releasing such code raises needless, and needlessly confusing, roadblocks to reproducibility.
And of course, R and Sweave are mentioned as an elegant solution to this problem:
There are a number of tools that enable code, data and the text of the article that depends on them to be packaged up. Two examples here are Sweave associated with the programming language R and the text-processing systems LaTeX and LyX, and GenePattern-Word RRS, a system specific to genomic research31.Sweave allows text documents, figures, experimental data and computer programs to be combined in such a way that, for example, a change in a data file will result in the regeneration of all the research outputs.
Technology has changed the tools necessary to ensure rigor and replicability in science, but not the principle behind it. It is great to see a journal such as Nature making the case for this level of scrutiny to be applied to the computational routines used to derive results. 

RStudio, BibTex, and LaTeX

I was having a little trouble getting RStudio to process BibTex entries and compile a LaTeX file. Bumping around on the great RStudio help forum, I found this entry, which pointed me in the direction. I needed to set a system environment variable in R to let RStudio and R know where to look for the proper BibTex style, LaTeX input, and BIB input files. I have put up a little Gist on GitHub that shows how I got it done, since I had never messed with R Environment variables before.

 

 

Scoping functions in R

I want to test embedding source code in the blog by using the handy Gist tool provided by GitHub. These two R functions are a good opportunity to test out embedding a Gist on the website. These functions allow for threshold testing within a vector in R, or over rows or columns of a dataframe as well as is shown at the end of the code. They are not complex, and probably not as efficient as they could be, but it is an example of writing readable and well-documented code. And, it may be of use to others. Eventually these will be incorporated into the LDS_TOOLS package, which you can find on GitHub currently under development.

Software for Surviving Graduate School Part 1

After introducing a colleague to the wonders of Dropbox today (more on that later) I realized that it might be useful to put out a list of software that is of use to graduate students. I often find that many of the software products I find indispensable are virtually unknown to many of my fellow graduate students. Certainly this is no fault of theirs--I can understand how not being a tech geek (like I am) may mean not hearing about such software, or not having time to evaluate its helpfulness.

The first part of this series focuses on general use software that just makes managing multiple tasks and projects like research papers, class notes, and thesis readings really easy. I tried to choose software that is intuitive, efficient, and accessible on a wide variety of machines. All of the software products are products I use on a regular basis or have used extensively at some point (the latter are categorized as 'notable' instead of essential).

The software is organized into three categories based loosely on how often I find myself using it: essential, useful, and notable. I also present some alternatives I have come across that duplicate the function of the software I choose so you can explore other options. While some of these products may seem obvious, I try to note the features that may not be obvious that graduate students should find particularly attractive.

(Sidenote: I am a Windows user though most of the software I mention here is both free and available on Mac and PC (and often Linux). If the software is not Mac compatible I will try to suggest a Mac and Linux alternative.)

Essential:


Dropbox


What is it?: Dropbox is a drop dead simple tool for backing up and synchronizing files across multiple computers. Dropbox sets up a new folder on your computer that is constantly synchronized (quietly in the background) with the Dropbox webserver. Every file in that folder is automatically uploaded whenever changes are made to it. Any computer you own that you link to your Dropbox account will also synchronize with this server ensuring that any changes you make to files in your Dropbox folder propagate nearly instantaneously across all machines connected.

Why do I need it?: If you are tired of using a USB stick to constantly synchronize files between two or more machines and share files with friends and colleagues (and who isn't?) then Dropbox is for you.

Pros: Simple backup. Easy sharing. Tons of control. 2GB of storage for free (more than enough for a semester's worth of readings/projects/data). Minimal system footprint (read: small install and no drag on your system). Makes file sharing extremely simple--you can e-mail someone a link to a file that will always be the most up to date and current version of that file (great for sharing dissertation chapters with your faculty adviser).

Cons: Requires a connection to the internet. Accessing Dropbox folder via web-interface is not as slick as using it on your machine. No intermediate paid storage option (the first tier plan is $5.99 for 50GB).

Price and Availability: Free for up to 2GB of storage and unlimited synchronized machines. Works on Windows, Linux, and Mac.

Alternatives: Paid alternatives like Mozy and Carbonite offer more comprehensive backup and online storage, but not the ease of access and simple synchronization across machines that Dropbox provides.

GMail


What is it?: Gmail is Google's take on e-mail. It is a web-based e-mail service like Hotmail and Yahoo! Mail that is accessed through a browser.

Why do I need it?: Simply put, Gmail is very simple. It can manage multiple e-mail accounts easily. It is powered by Google search, making it easy to find what you are looking for, and it has an incredibly flexible sorting system, making it easy to manage your workflow by sorting e-mail into folders as it arrives (think sorting all TA related e-mail into one folder and course related e-mails into their own separate folder). It also sports a great task list that allows you to create multiple to-do lists to keep track of multiple projects.

Pros: Huge storage space for holding your thousands of attachments. Large attachment limits for sending big files. Easily searchable. Quick to access from your browser. Works on any computer anywhere (whether in the lab, in the office, or on the bus from your iPhone). Compatible with popular mail clients (Outlook, Thunderbird, etc.) if you like keeping a local copy of e-mails just in case.

Cons: If Google experiences a server outage, so too might your e-mail capabilities. Not accessible offline when flying to a conference. You won't know what to do with all the time it saves you.

Price and Availability: Free and available from any computer and operating system that is capable of running a modern browser.

Alternatives: As mentioned, Hotmail (or Windows Live! Mail) and Yahoo! Mail. Both are serviceable alternatives but lack the sorting features mentioned above.

Google Reader


What is it?: An online RSS aggregator and reader. (Note, if you don't know what RSS is, check out Google's quick guide to Reader for an explanation.)

Why do I need it?: RSS feeds allow you to monitor multiple webpages simultaneously from one single place. Just about any site you visit regularly has RSS feeds allowing you to quickly and easily scan for new content and either read the whole article or flip quickly through headlines.

Pros: Quick and easy and accessible from any internet connected computer anywhere. Saves you lots of browsing to different sites by aggregating content. Less ads. Tons of tools for sharing content with friends via e-mail, Facebook, Twitter and more.

Cons: Not intuitive to set up (but it does offer pre-packaged groups of feeds for selection and features a great suggestion mechanism). Not accessible offline when you are on the bus on your way to class.

Price and Availability: Free and available from any computer and OS capable of running a modern browser.

Alternatives: Firefox plugins, Outlook, and several desktop clients for RSS feeds.

Google Calendar


What is it?: A web-based calendar tracking system from Google.

Why do I need it?: To keep track of your schedule and the schedules of others. It is a great way to keep track of the different events in your life all in one easy to use place and it integrates well with GMail.

Pros: Easily accessible and always up to date anywhere you access it from. Can coordinate multiple calendars and event types, multiple notification methods to keep you from forgetting an event, and easily shared with friends and family. New improvements added all the time. Keeps track of tasks from Gmail.

Cons: Multiple calendar system is not always intuitive. Interface is not as graphically impressive as some alternatives.

Price and Availability: Free and available from any computer and OS capable of running a modern browser.

Alternatives: iCal, Outlook, Zoho Planner.

CCleaner


What is it?: A small utility for deleting unnecessary temporary files on your computer that collect over time.

Why do I need it?: To keep your computer running lean and clean and prevent it from getting bogged down with unnecessary temporary files that are useless.

Pros: Small, light, fast, and easy to use. Keeps your computer in tip top shape.

Cons: Deleting browser cache may mean you have to re-login to all the websites you normally visit. Telling CCleaner which cookies to keep to avoid this nuisance is not a straightforward process.

Price and Availability: Free. Only available for Windows PCs.

Alternatives: None.

FoxIt Reader


What is it?: A lightweight PDF reader that serves as an alternative to an Adobe Reader.

Why do I need it?: In addition to being a lightweight alternative to Adobe that loads more quickly, it also allows you to annotate PDF documents for free. This can save you the hassle of printing out all of those e-reserves and simply take notes directly on the documents including highlighting, underlining, and commenting all within the document itself. It also features a tabbed browsing interface when viewing multiple documents simultaneously.

Pros: Tabbed interface keeps clutter from the taskbar when viewing multiple desktops. Small and fast to load. Commenting is free and included in the software. Can be run from a USB key if not installed on the machines in your office.

Cons: Free version occasionally asks a user to upgrade to the Pro version. When underlining and highlighting in the free version a small watermark is placed on the pages when printing.

Price and Availability: Free and available for both Windows and Linux (though with less functionality on Linux machines).

Alternatives: Adobe Acrobat Reader.

Microsoft Office 2007


What is it?: Microsoft's ubiquitous office suite.

Why do I need it?: This is hands down the best version of Office so far.While free alternatives may do 90-95% of the things that Microsoft's Office can do, the parts they are missing are key. Microsoft Word and Excel are professional quality document creation programs. In a world where reading, writing, and editing research papers is the main craft you simply cannot use anything less than professional grade. If you haven't used 2007 the new ribbon interface may throw you for a loop, but once you embrace it you will discover capabilities in this Microsoft product you never knew existed.

Pros: 100% compatible with the documents your colleagues are all using. Easy to output to multiple formats (PDF, docx, rtf, doc, etc). Can be integrated with Office Live for online backup of your files. Industry standard software.

Cons: It is not free. It can be a bit of a resource hog especially if you only need to make minor edits to a document. Track changes is not an ideal editing system.

Price and Availability: Available for Windows (Mac users get Office 2008). Students with a valid .edu address can get Microsoft Office 2007 Ultimate for only $59.95.

Alternatives: OpenOffice, GoogleDocs, Zoho, and Writeboard.

Useful:


Evernote


What is it?: Evernote's goal is to be a more efficient version of your brain. It is a web-based storage and organization system for keeping track of anything you want to remember in virtually any format you need to remember it in. Evernote allows you to drop any document/picture/note into it, tag it for organization, and store it to look at later.

Why do I need it?: Evernote is a quick and easy way to keep track of research ideas, data sources, and bibliography entries that you don't want to forget on the go. By tagging items you upload into Evernote you can easily keep track of why you stored them there and deal with them later. The range of uses for Evernote is virtually infinite, but two of my personal favorites are uploading pictures of books I want to buy or read (from my mobile phone) while at a bookstore or using Evernote to convert a handwritten note (again uploaded via a picture from my phone) into editable text.

Pros: Accessible from anywhere using your cell phone or a PC with an internet connection. Virtually unlimited possible uses for organizing. Free. Updated frequently. Loads of tips and tricks made freely available by its active fanbase. Super secure.

Cons: Like any organization system it takes some getting used to.

Price and Availability: Free and available from any computer (via a desktop application or the web browser) or any mobile phone (through an application or via SMS).

Alternatives: Microsoft Office OneNote,  Zoho Notebook, Various Firefox extensions including Zotero.

GoogleDocs


What is it?: Google's online answer to the Microsoft Office suite including word processor, spreadsheet, and presentation building utilities.

Why do I need it?: For collaborating on documents with multiple authors Google Docs is a huge step up from spawning several clones of a file and trying to coordinate them by e-mailing them back and forth. Also, as a quick and easy way to edit files on a machine that is not your own, Google Docs is more than capable of quickly editing a file and doing 75-80% of the tasks capable in Microsoft Word. It also allows you to upload a large amount of files and store them online as another source of backup for your important documents.

Pros: Easy to use. Integrates beautifully with GMail allowing you to click on an attachment to an e-mail and edit it directly in GoogleDocs and e-mail right back without ever leaving your browser. Lots of room to share and store documents online. Offline access possible with Google Gears. No set up required, you can start using Google Docs the minute you set up a Google Account.

Cons: Not so great with very large documents. Formatting and printing are not as straightforward as in a desktop office suite. Cannot batch upload files, meaning if you want to migrate to Google Docs you'll have to do it , uploading a few files at a time.

Price and Availability: Free and available from any computer and OS capable of running a modern browser.

Alternatives: Microsoft Office 2007, OpenOffice.org, iWork, Zoho, Microsoft Office Web Applications.

Google Groups


What is it?: An online tool for collaboration on documents, sharing content, and coordinating activities and events. It includes an ability to create custom pages, set up an e-mail listserv, share files, and share news quickly and easily.

Why do I need it?: Have you ever had a group project where everything the group needed to read, create, and revise was all in one easy to access place? Didn't think so. Google Groups allows you to create just such a place so quickly and easily it is feasible to set up a group for even a short term project. Just a few clicks and the group is set up and then you can e-mail invites to everyone on the project and get to work. The e-mail list feature alone is worth the price of admission (free, by the way) because it allows you to be sure everyone on the group is on the same page and avoid miscommunication ("You brought the CD right? No? I must have forgotten to CC you on that e-mail"). The ability to create ad hoc mailing list-servs outside of the constraints of the university makes your life a thousand times easier and your group more efficient.

Pros: Easy to set up. E-mail lists are a snap. Pages can be added for all group members to see and edit. Lots of control over user permissions, allowing you to share your work without being afraid of someone changing it and/or destroying it.

Cons: Not completely intuitive to some users at first. Relatively small online storage space for files. Cannot collaborate on uploaded files or simultaneously edit pages within the group.

Price and Availability: Free and available from any computer and OS capable of running a modern browser. Not available offline.

Alternatives: Yahoo! Groups is a serviceable alternative. For collaboration and group workflow see Google Docs, Google Wave, Zoho collaboration tools and the wikispaces option listed below.

Google Chrome


What is it?: The next generation browser from Google that is lightweight and incredibly fast.

Why do I need it?: If you use some of the online web applications mentioned in this post you will see a noticeable increase in speed and rendering fidelity. If you are working on a lightweight machine (like a netbook) you will also see a dramatic increase in load times for your browser as well.  It also allows you to create application shortcuts (on the desktop or in the start menu)  to many web applications (like Facebook, Gmail, etc.) which allows you to load these pages more quickly and without the clutter of the rest of the browser interface.

Pros: Lightweight and incredibly fast. Clean interface that feels refreshingly simple. Just works.

Cons: Not compatible with some websites optimized for Internet Explorer (such as many of Microsoft's own web pages). No ad blocking plugins or plugins of any kind. Bookmarking system not as sophisticated as that found in Firefox.

Price and Availability: Free. Officially only available on Windows, but betas of versions for both Linux and Mac are also available.

Alternatives: Mozilla Firefox. Safari. Opera.

CutePDF


What is it?: CutePDF is a simple utility that allows you to turn any document into a PDF file that is easily readable on any machine.

Why do I need it?: CutePDF allows you to take any file on your computer that is normally printable (a Word document, Excel spreadsheet, web page, etc.) and instead turn it into a professional quality PDF. It installs a simple print driver on your machine that let's you "print" any page to a PDF file and then save that file and share it with others.

Pros: Works on any document type that can be printed on a computer. It is a tiny utility that stays out of your way until you need it.

Cons: Requires an install so not an option for a machine you don't own. Some web alternatives now exist that are accessible anywhere. Interface feels old and clunky.

Price and Availability: Free and available for both Windows and Mac.

AlternativesPDF Converter Online.

Notable:


Google Wave


What is it?: Google bills it as the future of e-mail, but it is better to think of it as a tool for online collaboration both in real time and back and forth like e-mail. However, it perpetually defies description and you are probably better off just watching these videos.

Why do I need it?: For collaborating on a document or some other content with another author in real time. Google Wave gives you the ability to make simultaneous edits to a block of text, insert photos, and collaborate and share a variety of other content all within your browser. It is also highly flexible and expandable, allowing your "waves" to interact with a variety of web services you already know and love.

Pros: Real time communication. Eases collaboration. Looks incredibly cool.

Cons: You can't get it yet. It is still in the preview (alpha) stage of development and only a very few (just over 100,000) users even have access to it yet. The user interface may not be intuitive. Doesn't work well unless everyone on your team is familiar with it and knows how to use it.

Price and Availability: Free. Still in an invite-only preview phase of development that should expand within the next 6 months to open up more to the public.

Alternatives: wikis, Google Groups, Zoho Projects, e-mail.

WordPress


What is it?: An extensible professional personal blogging platform.

Why do I need it?: Blogs are not something all graduate students need by a long shot, however there are some really powerful potential uses if the need strikes. Blogs serve as a great way to share ideas, files, photos, and other content in a central place. They help organize thoughts and coordinate team efforts. And, they can serve as a great promotional tool when heading onto the academic job market.

Pros: Incredibly flexible. Tons of help available online for free. Powerful editing and content sharing tools. Expandable as your site grows to add multiple authors, multiple comment streams, and social media tie-ins.

Cons: Overwhelming at first. Best features require you to host the blog on your own hosting solution (costing money). Advanced editing and site design require some familiarity with HTML.

Price and Availability: Free including hosting at wordpress.com. For self-hosted sites you will need to have a web-host that allows FTP access.

Alternatives: Movable Type, Tumblr, Posterous, Blogger.

Digsby


What is it?: While it bills itself as an IM client, I like to think of Digsby as a real time communication dashboard. Digsby has the standard features of letting you connect to all of the standard IM clients simultaneously and chat with your friends. However, more interestingly, Digsby also integrates with social media websites (like MySpace, Facebook, Twitter, and LinkedIN) as well as your POP3 compatible e-mail service (virtually any service) seamlessly.

Why do I need it?: Digsby is your one stop shop for real time communication. Whenever I have a wi-fi signal I leave Digsby open and set it up to notify me as e-mails, tweets, Facebook messages/wall posts, and IM messages come streaming in. I can choose to deal with any of these events immediately (and often from within Digsby), or I can choose to leave them for later. Digsby keeps you connected without forcing you to have several web browser tabs open, or a resource hogging IM client like MSN Live.

Pros: Another lightweight program. Twitter functionality is impressive and unparalleled in an IM client. Offers lots of connectivity for its users. It synchronizes your settings and IM credentials online across PCs, so once you set it up on one PC you simply have to log in to Digsby on another and it is pre-configured and ready to go!

Cons: Ad-supported installer can be a little confusing and trick you into installing a toolbar (but the software itself is ad-free). No Skype support.

Price and Availability: Free and available for Windows. Mac and Linux versions are in development.

Alternatives: Adium, iChat, Trillian, Pidgin.

Wikispaces


What is it?: Wikispaces is an easy hosted wiki solution for quickly creating a wiki to coordinate collaboration among several group members.

Why do I need it?: Like Google Groups, Wikispaces allows you to easily set up a single place for collaborators to share ideas and edit the same content. Wikispaces requires minimal start up costs and is easily accessed by everyone in your team no matter where they are. Great for a research team working together to coordinate data collection, preliminary and final results, as well as arrange meeting times and keep track of contact information.

Pros: Free. Easy to set up. Simple and intuitive interface. Lack of distracting design clutter.

Cons: Requires some planning to make efficient use of. small file sizes (10mb) on the per-file size limit. Ad-supported. Requires registration at yet another website.

Price and Availability: Free and accessible from any web browser. Paid solutions provide valuable upgrades at reasonable prices.

Alternatives: TikiWiki (a self-hosted solution), TiddlyWiki, Zoho Wiki. And for a review of other alternatives check out this review of wiki solutions (from the School Library Journal).

Miscellaneous:


Mozilla Firefox Extensions


If you do choose to use some of these tools and access them from the Mozilla Firefox browser, I highly recommend a collection of Firefox plugins to make your experience faster, cleaner, and more enjoyable. Be sure to check them out here.

Conclusion:


I find myself using the above software all the time. The best part is that most of it is free and easy to find support for online. Sometimes the startup cost of trying new software can prohibit users from finding a digital solution to a workflow problem, and the software above is no exception. However, I have found that most of the software above is well worth the minimal cost it imposes in familiarizing yourself with it, and once you pick up most of these tools you won't ever put them down.

Look for some more software list posts in the future on general computer utilities (of interest not just to graduate students but all computer users), academic software, and self-marketing/web presence tools online.

For now, leave a comment or pass the list along to your friends.

Don't think I gave some software a fair shake? Disagree about my choices? Think I'm a Google fan-boy? Let me know in the comments.

Geography and Data

The Economist recently ran a fascinating article about the emergence of geographical databases and their uses for presenting and analyzing data.
All this has made it much easier to create maps that explain—at a glance—something that might otherwise require pages of tables or verbiage. “A percentage or a table is still abstract for people,” says Dan Newman of MAPLight.org, a group based in Berkeley, California that charts the links between politicians and money. “With maps, you can show people how an abstract concept connects to where they live.”

Political scientists are just starting to get with the program. James G. Gimpel at the University of Maryland is one of them. He describes one of his on-going research projects as follows (from his website):
Joined by my colleague Wendy Cho (University of Illinois, Urbana-Champaign), we are using new geographic analysis tools to investigate the spatial distribution of campaign contributors and volunteers to political parties and candidates.   We are leading a new investigation that will use flow maps to show patterns of cross-state voter mobility and the redistribution of the electorate through migration.   We are advancing the application of new methodologies for the exploration of a variety of political phenomenon.

This is close to the kind of work I see myself doing as well. I would like to learn as much as I can about geospatial analysis and GIS tools to be able to bring them to bear on questions of policy outcomes. I can often think of the small scale story of how geography affects local policy outcomes--a rancher living a hundred miles from the county seat is unable to attend a meeting about the location of a county road and thus does not have his/her voice heard.[1. Interestingly enough, if we could reasonably calculate the costs associated with attending the meeting, we could estimate a threshhold of benefits that must be reached before this rancher would attend the meeting and be heard.] But, the real power here would be in aggregating up this information and see how it plays out in representation on state and national issues as well. Personally I would like to focus on state policies, specifically education policy, but I would like to see tools developed that help us answer local and national questions as well.

Another interest along these lines is the ways that technology is changing these geographic constraints. From virtual education programs for rural schools to open government initiatives that put government data on the web to be accessed anytime and anywhere, the role that geography plays is shifting fundamentally. As a self-professed geek I would certainly like to look at the ways that emerging technologies are reshaping the ways geography enables and constrains political participation and policy outcomes.

John Geraci has an interesting post over at O'Reilly Radar on this very subject:
What we really want (or what I really want anyway) is not simply government transparency, but an open civic system - a civic system that operates, and flourishes, as a fully open system, for whatever level we happen to be talking about - federal, state, city, neighborhood, whatever. And transparency is a big part of that open civic system, but it is still only one part.

Mr. Geraci has come up with an interesting way of organizing the different interactions necessary for truly open government as follows:

  • Government to Citizen (G2C)

  • Citizen to Government (C2G)

  • Citizen to Citizen (C2C)

  • Government to Government (G2G)


Obviously the G2C stuff is happening a lot now with the Obama administration and social scientists have been using government information sources to help promote this openness for quite some time. Technology has really accelerated the opportunities here though, and it is something to keep an eye on. Further on the post John points out some emerging examples of the other types of data being used for open government (such as a project that allows city residents to report bad roads and lobby for them to be fixed).

If technologies like these continue to develop they may abate many of the main barriers to participation found in rural areas. Yet other costs will remain. Something a lot of advocates of open and transparent data sources seem to forget is that this transparency is only accessible to a small audience, namely the tech-savvy and well-connected advocates. True government transparency does not mean simply providing the data on a website in an open file format, it also means civic education to make citizens aware of the information available to them and inculcate them with a responsibility about the use of that data. Pretty soon information literacy will not simply be a skill for researchers, students, and technologists, it will be a fundamental part of civic engagement in much the same way literacy was a hundred years ago.

Yet, if we can learn anything from the "old" literacy, it is this: a literacy gap will develop along some lines[2. These could be racial, social, economic, geographic or some combination.] and, if unattended, that gap could lead to policy outcomes that disproportionately affect the literate population at the expense of those left behind.

Where are the advocates for Education 2.0? (Sounds like a topic for a future post...)

Footnotes: