A Reproducible Pipeline for Processing the Global Dataset of Historical Yields (1981-2016) by Iizumi et al.

Daniel Vartanian &amp; Aline M. de Carvalho

Author

Daniel Vartanian & Aline M. de Carvalho

Published

August 18, 2025

Overview

This report contains a reproducible pipeline for processing the Global Dataset of Historical Yield (Iizumi, 2019; Iizumi & Sakai, 2020) in R.

Data Availability

The processed data are available in the tif format via a dedicated repository on the Open Science Framework (OSF), accessible here. You can also access these files directly from R using the osfr package.

Methods

Source of Data

The data used in this analysis come from the following source:

PANGAEA: A data publisher for earth and environmental sciences, which hosts the Global Dataset of Historical Yield (Iizumi, 2019; Iizumi & Sakai, 2020).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

Spatial data analysis was performed using the terra R package. For data manipulation and workflow, packages from the tidyverse and rOpenSci ecosystems—adhering to the tidy tools manifesto (Wickham, 2023)—were prioritized. All steps were designed to ensure transparency and reproducibility of results.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Set the Environment

library(brandr)
library(fs)
library(geodata)
library(ggplot2)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(httr2)
library(magrittr)
library(ncdf4)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(readr)
library(stringr)
library(terra)
library(tidyterra)
library(zip)

Set the Initial Variables

crop <- c(
  "maize", "maize_major", "maize_second", "rice", "rice_major", "rice_second",
  "soybean", "wheat", "wheat_spring", "wheat_winter"
)

crop <- "rice"

raw_data_dir <- here("data-raw")

valid_data_dir <- here("data")

Code

dirs <- c(raw_data_dir, valid_data_dir)

for (i in dirs) {
  if (!dir_exists(i)) {
    dir_create(i, recurse = TRUE)
  }
}

Download the Data

raw_file <- here(raw_data_dir, "raw.zip")

paste0(
    "https://store.pangaea.de/Publications/",
    "IizumiT_2019/",
    "global-historical-yield_v1.2_v1.3_20190128.zip"
  ) |>
  request() |>
  req_progress() |>
  req_perform(raw_file)

Unzip the Data

raw_file |> unzip(exdir = raw_data_dir, overwrite = TRUE)

file_delete(raw_file)

Read the Data

dir <- here(raw_data_dir, crop)

files <- dir |> dir_ls(type = "file", regexp = "\\.nc4$")

data <- files |> rast()

Tidy the Data

years <- files |> str_extract("\\d{4}")

names(data) <- years
varnames(data) <- rep(paste0(crop, "_yield"), nlyr(data))

data <- data |> shift_and_rotate(dx = 180)

Data Dictionary

The data dictionary for the Global Dataset of Historical Yield is available here. For detailed information, see Iizumi (2019) and Iizumi & Sakai (2020).

Save the Data

valid_file <- here("data", paste0(crop, ".tif"))

data |> terra::writeRaster(valid_file, overwrite = TRUE)

Visualize the Data

max_value <-
  data |>
  minmax() |>
  c() |>
  max(na.rm = TRUE) |>
  divide_by(5) |>
  ceiling() |>
  multiply_by(5)

world_shape <- world(path = raw_data_dir)

for (i in cut_vector(names(data), n = 9)) {
  i_data <- data |> select(all_of(i))

  i_plot <-
    ggplot() +
    geom_spatvector(data = world_shape) +
    geom_spatraster(data = i_data) +
    facet_wrap(~lyr, ncol = 2) +
    scale_fill_brand_b(
      direction = -1,
      limits = c(0, max_value),
      breaks = seq(0, max_value, 5) |> remove_caps()
    ) +
    labs(fill = "Yield (t/ha)")

  print(i_plot)
}

How to Cite

When using this data, you must also cite the original data sources.

To cite this work, please use the following format:

Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing the Global Dataset of Historical Yields (1981–2016) by Iizumi et al. [Computer software]. Sustentarea Research and Extension Group of the University of São Paulo. https://sustentarea.github.io/global-historical-yield

A BibTeX entry for LaTeX users is

@misc{vartanian2025,
  title = {A reproducible pipeline for processing the Global Dataset of Historical Yields (1981–2016) by Iizumi et al.},
  author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group of the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/global-historical-yield}
}

License

The original data sources may have their own license terms and conditions.

The code in this report is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons CC0 License.

Copyright (C) 2025 Daniel Vartanian

The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Department of Science and Technology of the Secretariat of Science, Technology, and Innovation and of the Health Economic-Industrial Complex (SECTICS) of the Ministry of Health of Brazil, and the National Council for Scientific and Technological Development (CNPq) (grant no. 444588/2023-0).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

Iizumi, T. (2019). Global dataset of historical yields v1.2 and v1.3 aligned version (pp. 15.2 MBytes) [Data set]. PANGAEA. https://doi.org/10.1594/PANGAEA.909132

Iizumi, T., & Sakai, T. (2020). The global dataset of historical yields for major crops 1981–2016. Scientific Data, 7(1), 97. https://doi.org/10.1038/s41597-020-0433-7

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz