How to organize your analyses with R Studio Projects
Or how to stay sane when working on big projects
Here is a post that I am sharing from my old blog to get this one started. Enjoy!
In this post I’ll go over a basic method method for organizing your ecological data analysis projects in R. Why do this? Reproducing analyses is critical for good science. There is nothing worse than trying to re-run a script when you finally get comments back from your reviewers only to find that your results are a bit different than before. What?! Speaking from personal experience, it’s taken days of blood, sweat, and tears to figure out what was different in the data, what code I was running in the wrong order, or that I was running the wrong code all together! Start now and get in the habit of sticking to a system for organizing your R projects.
While there are many methods and variations on how to do this (see links at the end of the post), the scope of this current post is to offer a short and simple overview of my own method so that you can get started ASAP. Those that follow me know that I am a big fan of getting right into the code and data—that is the best way to learn. So let’s get to it.
1) Use R Studio for all your analyses. Some of you 1% hardcore coders might prefer the minimalist terminal-type interface included in the basic R download, but for everyone else, use R Studio. It’s a no-brainer. See my video tutorial here on how to install it.
2) Create a new project (File > New Project). The directory you set here will be the folder where you store your data, scripts, and other files related to your analysis.
3) Create the folder structure inside your project folder so that it looks like this:
- “data” is where you keep your data, split into two folders, “raw” and “processed”. This is self explanatory. “Raw” is where you save your data as you entered or downloaded it (usually an excel spreadsheet file), and “processed” is where you save the CSV file ready for uploading into R
- “output” is where you save all the figures and tables that you generate with your R scripts. “scripts” is where you keep all the R code files.
- Finally, “temp” is not necessary, but I’ve found it very useful. It is a folder where I can save any temporary outputs or scripts that I want to test out or explore, but that I know should not get confused with the final output of my analyses.
4) Create your R scripts. Unless your analysis is very simple and direct, you should be using multiple scripts (pretty much always the case when your project is large enough for an entire publication). Ideally, each script should be a set of code that you can run in one go. This is not always possible, but strive for that and use a separate script for each component of the analysis. I recommend you create the following scripts right away:
- Script for loading packages and custom functions
- Script for cleaning up and preparing the data for analysis
- Script for each analysis in the project. For example, in one study you might need both a figure that presents two histograms for visualization purposes, along with one linear mixed effects regression to test your primary hypothesis. Each of those should have their own script
- Name each script using this format: “##_name_v#”, where ## indicates the order that the scripts should be run in, “name” is a descriptor, and “v#” indicates the version number. Sometimes you want to change the script, but should keep older versions in case you mess something up. That’s where saving a new file with an updated version makes sense. So, all together your first set of scripts might look like this: 00_packages_v1.r 01_dataclean_v1.r 02_HistogramFigure_v1.r 03_LMER_v1.r
5) Start off each R script with a good description of the entire project and particular scope of the script. The more comments the better, but more on script commenting in another post. Here’s an example:
That’s pretty much it! Each time you open the project in RStudio, all the scripts will open. Just make sure to run the packages and dataclean scripts before the others. By using RStudio Projects, there is no need to include a setwd() line, just add in “data/processed/“ before your filename whenever uploading any data, or add “output/“ or “temp/“ whenever exporting something.
If you want some longer in-depth explanations on code management in R, check out these other excellent blog posts:
Also be sure to check out R-bloggers for other great tutorials on learning R