A Reproducible Pipeline for Processing DATASUS Annual Population Estimates by Municipality, Age, and Sex in Brazil (2000-2024)

Daniel Vartanian &amp; Aline Martins de Carvalho

Author

Daniel Vartanian & Aline Martins de Carvalho

Published

2025-05-05

Overview

This report provides a reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024). The main goal is to provide an open and reliable workflow for processing these data, supporting research and informed public policy decisions.

Data Availability

The processed data are available in both csv and rds formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr package.

A backup copy of the raw data is also available in OSF. You can access it here.

Methods

Source of Data

The data used in this report were sourced from the Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides annual population estimates for Brazil by municipality, age, and sex for the period 2000–2024 (Comitê de Gestão de Indicadores et al., n.d.). These estimates are produced using data by the Brazilian Institute of Geography and Statistics (IBGE).

For technical information about the raw dataset, see the official technical note (in Portuguese).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Setting the Environment

library(brandr)
library(cli)
library(dplyr)
library(foreign)
library(fs)
library(ggplot2)
library(here)
library(httr2)
library(janitor)
library(labelled)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(plotr) # github.com/danielvartan/plotr
library(readr)
library(utils)

Setting the Initial Variables

year <- 2024

Downloading the Data

Code

if (!dir.exists(here::here("data"))) dir.create("data")

raw_file_pattern <- paste0("raw-", year)

file <- here::here("data", paste0(raw_file_pattern, ".zip"))

paste0(
  "ftp.datasus.gov.br/dissemin/publicos/IBGE/POPSVS/POPSBR",
  year |> stringr::str_sub(start= -2),
  ".zip"
) |>
  httr2::request() |>
  httr2::req_progress() |>
  httr2::req_perform(file)

Unzipping the Data

Code

file <-
  file |>
  utils::unzip(exdir = here::here("data"), overwrite = TRUE)

Code

file <-
  file |>
  fs::file_move(here::here("data", paste0(raw_file_pattern, ".csv")))

Code

fs::file_delete(here::here("data", paste0(raw_file_pattern, ".zip")))

Reading the Data

data <-
  file |>
  foreign::read.dbf() |>
  dplyr::as_tibble()

data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ COD_MUN <fct> 1100015, 1100015, 1100015, 1100015, 1100015, 1100015, 11000…
#> $ ANO     <fct> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024,…
#> $ SEXO    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ IDADE   <fct> 000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011,…
#> $ POP     <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178, 176, 174,…

Renaming the Data

data <-
  data |>
  janitor::clean_names() |>
  dplyr::rename(
    municipality_code = cod_mun,
    year = ano,
    sex = sexo,
    age = idade,
    n = pop
  )

data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ municipality_code <fct> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ year              <fct> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ sex               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ age               <fct> 000, 001, 002, 003, 004, 005, 006, 007, 008, 009,…
#> $ n                 <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…

Tidying the Data

data <-
  data |>
  dplyr::mutate(
    dplyr::across(
      .cols = where(is.factor),
      .fns = ~ .x |> as.character() |> as.integer()
    )
  ) |>
  dplyr::mutate(
    sex = factor(
      sex,
      levels = 1:2,
      labels = c("male", "female"),
      ordered = FALSE
    )
  )

data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ year              <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n                 <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…

Adding State and Region Data

brazil_municipalities <- orbis::get_brazil_municipality(
  year = plotr:::get_closest_geobr_year(year, type = "municipality")
)
#> ! The closest map year to 2024 is 2022. Using year 2022 instead.

data <-
  data |>
  dplyr::left_join(
    brazil_municipalities,
    by = "municipality_code"
  ) |>
  dplyr::relocate(
    year,
    region_code,
    region,
    state_code,
    state,
    federal_unit,
    municipality_code,
    municipality
  )

data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 11
#> $ year              <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n                 <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    year,
    region_code,
    state_code,
    municipality_code,
    sex,
    age
  )

data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 11
#> $ year              <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n                 <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…

Data Dictionary

Code

metadata <-
  data |>
  labelled::`var_label<-`(
    list(
      year = "Year of the population estimate",
      region_code = "IBGE region code",
      region = "Region name",
      state_code = "IBGE state code",
      state = "State name",
      federal_unit = "Federal unit name",
      municipality_code = "IBGE municipality code",
      municipality = "Municipality name",
      sex = "Sex of the population",
      age = "Age of the population",
      n = "Population estimate"
    )
  ) |>
  labelled::generate_dictionary(details = "full") |>
  labelled::convert_list_columns_to_character()

Code

metadata

Code

data

Saving the Valid Data

Data

Code

valid_file_pattern <- paste0("valid-", year)

Code

data |>
  readr::write_csv(
    here::here("data", paste0(valid_file_pattern, ".csv"))
  )

Code

data |>
  readr::write_rds(
    here::here("data", paste0(valid_file_pattern, ".rds"))
  )

Metadata

Code

metadata_file_pattern <- paste0("metadata-", year)

Code

metadata |>
  readr::write_csv(
    here::here("data", paste0(metadata_file_pattern, ".csv"))
  )

Code

metadata |>
  readr::write_rds(
    here::here("data", paste0(metadata_file_pattern, ".rds"))
  )

Visualizing the Data

Code

brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n_prop = x,
    colors = c(
      brandr::get_brand_color("dark-red"),
      # brandr::get_brand_color("white"),
      brandr::get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      brandr::get_brand_color("dark-red-triadic-blue")
    )
  )
}

Code

data |>
  dplyr::summarize(
    n = sum(n, na.rm = TRUE),
    .by = c("municipality_code")
  ) |>
  plotr:::plot_hist(
    col = "n",
    density_line_color = "red",
    x_label = "Population estimate",
    print = FALSE
  ) +
  ggplot2::labs(
    title = "Population Estimates by Municipality in Brazil",
    subtitle = paste0("Year: ", year),
    caption = "Source: DATASUS/IBGE"
  )

Code

data |>
  dplyr::summarize(
    n = sum(n, na.rm = TRUE),
    .by = c("municipality_code")
  ) |>
  plotr:::plot_brazil_municipality(
    col_fill = "n",
    col_code = "municipality_code",
    year = plotr:::get_closest_geobr_year(year, type = "municipality"),
    comparable_areas = FALSE,
    reverse = FALSE,
    transform = "log10",
    palette = brand_div_palette,
    print = FALSE
  ) +
  ggplot2::labs(
    title = "Population Estimates by Municipality in Brazil",
    subtitle = paste0("Year: ", year),
    caption = "Source: DATASUS/IBGE"
  )
#> ! The closest map year to 2024 is 2022. Using year 2022 instead.
#> Scale on map varies by more than 10%, scale bar may be inaccurate

How to Cite

To cite this work, please use the following format:

Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024) [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-pop-estimates

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024)},
  author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/datasus-pop-estimates}
}

License

This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz