How to use pipes to clean up your R code

A quick guide to one of R’s most important operators

I’ve talked a little bit about pipes (written as %>%) in a past blog post, but they’re so important in R that I thought they deserved their own post.

In this tutorial, I’m going to give an explanation of what pipes are and when they can be used, and then I’m going to demonstrate how useful they can be for writing clean and neat R code.

Image saying 'Using pipes in R', showing pipes connecting a workflow from data to filtering to mutating to grouping to summarizing.

What is a pipe?

A pipe is a type of operator in R that comes with the magrittr package. It takes the output of one function and passes it as the first argument of the next function, allowing us to chain together several steps in R. Pipes help your code flow better, making it cleaner and more efficient.

The pipe shines when used in conjunction with the dplyr package and its functions such as filter, mutate, and summarise, as we often need to use these one after another to manipulate our data. Luckily, the pipe comes loaded with dplyr, so there’s no need to load the magrittr package unless you specifically need to use the other magrittr operators.

Image of a pipe in R with crossed out text below it reading 'This is not a pipe' in French, in reference to Magritte's painting called 'The Treachery of Images'. There is text on top that says 'This is a pipe.'

A quick demonstration on how to use pipes

Let’s see pipes in action. First, load the dplyr package and download the classic iris data set that comes with R. If you don’t have dplyr installed yet, you’ll need to run install.packages("dplyr") before loading the package.

# Load dplyr
library(dplyr)
# Load data
data("iris")

# View data
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

These data describe several measurements for three plant species (Iris setosa, Iris versicolor, and Iris virginica). These measurements describe morphological differences among the three species in terms of sepal length and width and petal length and width, all in centimeters.

I want to keep only the largest plants in the data set, so let’s only include plants with Sepal.Length greater than 5 cm, and Petal.Length greater than 3 cm. I also want to create two columns called “Sepal.Area” and “Petal.Area”, equivalent to length x width (for an approximation of sepal/petal area). To do this, I’ll use the filter() and mutate() functions. Notice that I also hit “Enter” or “Return” to add a new line after every pipe to keep the code clean and keep each function on a separate line.

# Filter and mutate data
new_iris <- iris %>% 
  filter(Sepal.Length > 5 & Petal.Length > 3) %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width,
         Petal.Area = Petal.Length * Petal.Width)

# View new data
head(new_iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Sepal.Area
## 1          7.0         3.2          4.7         1.4 versicolor      22.40
## 2          6.4         3.2          4.5         1.5 versicolor      20.48
## 3          6.9         3.1          4.9         1.5 versicolor      21.39
## 4          5.5         2.3          4.0         1.3 versicolor      12.65
## 5          6.5         2.8          4.6         1.5 versicolor      18.20
## 6          5.7         2.8          4.5         1.3 versicolor      15.96
##   Petal.Area
## 1       6.58
## 2       6.75
## 3       7.35
## 4       5.20
## 5       6.90
## 6       5.85

Our data set looks good. You’ll see that my arguments in the filter() and mutate() functions are a bit different from usual. Normally, most of the dplyr functions are formatted like this: function(data, arguments).

Remember that pipes take the output of what came before it and passes it as the first argument of the function that follows. Thus, the filter() function receives iris as it’s data argument, and then the mutate() function receives filter(data=iris, Sepal.Length > 5 & Petal.Length > 3) as its data argument.

With pipes there was no need for me to write filter(iris, Sepal.Length > 5 & Petal.Length > 3), because that would be repetitive—I could just skip straight to the arguments and write filter(Sepal.Length > 5 & Petal.Length > 3).

To summarize in plain English (each then in this sentence can be substituted for a pipe):

  • I wrote code starting with the iris data set, then filtered it by Sepal.Length and Petal.Length, then used mutate to create two new columns.

Without pipes, our sentence becomes longer:

  • I wrote code starting with the iris data set. I filtered the iris data set by Sepal.Length and Petal.Length. Using the filtered iris data, I used mutate to create two new columns.

And those are the essentials of using pipes!

Cleaning code with pipes

After that last example, you might be thinking, OK, that’s pretty cool. But can it really make that big of a difference for organizing my code? The answer is…yes! And I’ll quickly demonstrate why.

Example 1: Creating new variables for each step

Let’s filter and mutate our data like we did above, then group by species and summarize to find the average sepal and petal area within each species. Without pipes, our code might look like this:

filtered_iris <- filter(iris, Sepal.Length > 5 & Petal.Length > 3)
mutated_iris <- mutate(filtered_iris, 
                       Sepal.Area = Sepal.Length * Sepal.Width,
                       Petal.Area = Petal.Length * Petal.Width)
grouped_iris <- group_by(mutated_iris, Species)
summary_iris <- summarize(grouped_iris, 
                          avg.sepal.area = mean(Sepal.Area),
                          avg.petal.area = mean(Petal.Area))

# View result
summary_iris
## # A tibble: 2 × 3
##   Species    avg.sepal.area avg.petal.area
##   <fct>               <dbl>          <dbl>
## 1 versicolor           17.0           5.93
## 2 virginica            19.8          11.4

Whew. It can be a little exhausting to have to save each step as a new variable, and now our environment will be cluttered with a bunch of intermediate variables. Aside from the clutter, your code is also much more prone to errors if you change something in the earlier steps but forget to run those lines before the later steps again. So let’s not do that then.

Example 2: Nesting functions

Let’s try another method, where we nest each function inside the previous one.

summarize(group_by(mutate(filter(iris, 
                                 Sepal.Length > 5 & Petal.Length > 3), 
                          Sepal.Area = Sepal.Length * Sepal.Width,
                          Petal.Area = Petal.Length * Petal.Width), 
                   Species),
          avg.sepal.area = mean(Sepal.Area),
          avg.petal.area = mean(Petal.Area))
## # A tibble: 2 × 3
##   Species    avg.sepal.area avg.petal.area
##   <fct>               <dbl>          <dbl>
## 1 versicolor           17.0           5.93
## 2 virginica            19.8          11.4

That doesn’t really look much better. If all these nested functions are making your head spin, don’t worry, it’s doing that to me too. Code like this is a great way to spend hours searching for errors… only to realize you’re missing a parenthesis. 😖

Example 3: Pipes!

Let’s try it with pipes:

iris %>% 
  filter(Sepal.Length > 5 & Petal.Length > 3) %>%
  mutate(Sepal.Area = Sepal.Length * Sepal.Width,
         Petal.Area = Petal.Length * Petal.Width) %>%
  group_by(Species) %>%
  summarize(avg.sepal.area = mean(Sepal.Area),
            avg.petal.area = mean(Petal.Area))
## # A tibble: 2 × 3
##   Species    avg.sepal.area avg.petal.area
##   <fct>               <dbl>          <dbl>
## 1 versicolor           17.0           5.93
## 2 virginica            19.8          11.4

Now the flow of our code is much cleaner and clearer. Others will be able to follow our code much more easily, and there’s no need to create new variables each step of the way. Pipes take us smoothly from beginning to end.

This way of writing the code also lets us insert comments at each step so we can clearly document our process:

iris %>% 
  # first filter and keep only sepals greater than 5cm long and 3cm wide:
  filter(Sepal.Length > 5 & Petal.Length > 3) %>%
  # then approximate sepal and petal area by multiplying length and width:
  mutate(Sepal.Area = Sepal.Length * Sepal.Width,
         Petal.Area = Petal.Length * Petal.Width) %>%
  # after that group by species to summarize the mean 
  # sepal/petal area of each species:
  group_by(Species) %>%
  summarize(avg.sepal.area = mean(Sepal.Area),
            avg.petal.area = mean(Petal.Area))
## # A tibble: 2 × 3
##   Species    avg.sepal.area avg.petal.area
##   <fct>               <dbl>          <dbl>
## 1 versicolor           17.0           5.93
## 2 virginica            19.8          11.4

All that said, I’m not suggesting that your entire R analysis script fit inside one long set of pipes. Find what works best for you and your analyses in terms of splitting up your code into neat organized chunks that make sense.

We owe a big thank you to Stefan Milton Bache (@stefanbache on Twitter), creator of the magrittr package and the almighty pipe! Hope you found this tutorial helpful. Happy coding!

P.S. A highly relevant tweet explaining pipes… (from WeAreRLadies on Twitter) Image of text saying I woke up, then showered, then dressed, then glammed up, then showed up to work, with pipes instead of the word then



If you enjoyed this tutorial and want learn more, you can check out Luka Negoita's full course on the complete basics of R for ecology here:

Also be sure to check out R-bloggers for other great tutorials on learning R

Related