A Reproducible Pipeline for Processing DATASUS Annual Population Estimates by Municipality, Age, and Sex in Brazil

Author

Daniel Vartanian & Aline Martins de Carvalho

Published

December 15, 2025

Project Status: Active – The project has reached a stable, usable state and is being actively developed. OSF DOI License: GPLv3 License: CC BY-NC-SA 4.0

Overview

This report provides a reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil.

For instructions on how to run the pipeline, see the repository README.

Problem

Accurate population estimates are fundamental for public health planning, resource allocation, and demographic research. The Department of Informatics of the Brazilian Unified Health System (DATASUS) publishes annual population estimates by municipality, age, and sex for Brazil. While these datasets are comprehensive, they require some processing to be directly usable for analysis in R. This pipeline provides a reproducible workflow to download and structure the data, enabling efficient and reliable downstream analyses.

Data Availability

The processed data are available in csv, rds, and parquet formats via a dedicated repository on the Open Science Framework (OSF), accessible here. Each dataset is accompanied by a metadata file describing its structure and contents.

You can also retrieve these files directly from R using the osfr package.

Methods

Source of Data

The data used in this report come from the following sources:

Data Munging

The data munging follow the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

For data manipulation and workflow, priority was given to packages from the tidyverse, rOpenSci and r-spatial ecosystems, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Source: Reproduced from Wickham et al. (2023).

Code Style

The Tidyverse Tidy Tools Manifesto (Wickham, 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, n.d.) is used to manage and restore the R environment. See the README file in the code repository to learn how to run it.

Set Environment

Load Packages

Set Data Directories

raw_data_dir <- here("data-raw")
data_dir <- here("data")
for (i in c(raw_data_dir, data_dir)) {
  if (!dir_exists(i)) dir_create(i, recurse = TRUE)
}

Set Initial Variables

The year variable represent the year of the consolidated DATASUS dataset. Click here to see the available years.

year <- 2025

Download Data

See the Source of Data section for more information.

Get Data

file <-
  "POPSBR" |>
  paste0(str_sub(year, 3, 4), ".zip")
"ftp.datasus.gov.br" |>
  path(
    "dissemin",
    "publicos",
    "IBGE",
    "POPSVS",
    file
  ) |>
  request() |>
  req_progress() |>
  req_perform(here(raw_data_dir, file))

Unzip Data

here(raw_data_dir, file) |>
  unzip(exdir = raw_data_dir)
file <-
  raw_data_dir |>
  dir_ls(
    type = "file",
    regexp = paste0(str_sub(year, 3, 4), "\\.dbf$")
  ) |>
  basename()

Delete Zip Files

raw_data_dir |>
  dir_ls(
    type = "file",
    regexp = "\\.zip$"
  ) |>
  file_delete()

Import Data

data <-
  raw_data_dir |>
  here(file) |>
  read.dbf() |>
  as_tibble()
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ cod_mun <fct> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, 11000…
#> $ ano     <fct> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025,…
#> $ sexo    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ idade   <fct> 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,…
#> $ pop     <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 368, 934, 24…

Tidy Data

Rename Columns

data <-
  data |>
  clean_names() |>
  rename(
    municipality_code = cod_mun,
    year = ano,
    sex = sexo,
    age = idade,
    population = pop
  )
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ municipality_code <fct> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ year              <fct> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ sex               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ age               <fct> 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,…
#> $ population        <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…

Standardize Columns

data <-
  data |>
  mutate(
    across(
      .cols = where(is.factor),
      .fns = \(x) x |> as.character() |> as.integer()
    )
  ) |>
  mutate(
    sex = sex |>
      factor(
        levels = 1:2,
        labels = c("male", "female"),
        ordered = FALSE
      )
  )
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ municipality_code <int> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ year              <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ population        <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…

Relocate Columns

data <- data |> relocate(year)
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year              <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ population        <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…

Arrange Data

data <-
  data |>
  arrange(
    year,
    municipality_code,
    sex,
    age
  )
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year              <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population        <int> 148, 150, 154, 159, 163, 169, 177, 177, 172, 174,…

Create Data Dictionary

Prepare Metadata

metadata <-
  data |>
  `var_label<-`(
    list(
      year = "Year of the population estimate",
      municipality_code = "IBGE municipality code",
      sex = "Sex of the population",
      age = "Age of the population",
      population = "Population estimate"
    )
  ) |>
  generate_dictionary(details = "full") |>
  convert_list_columns_to_character()

Visualize Final Data

metadata |> glimpse()
#> Rows: 5
#> Columns: 14
#> $ pos           <int> 1, 2, 3, 4, 5
#> $ variable      <chr> "year", "municipality_code", "sex", "age", "populatio…
#> $ label         <chr> "Year of the population estimate", "IBGE municipality…
#> $ col_type      <chr> "int", "int", "fct", "int", "int"
#> $ missing       <int> 0, 0, 0, 0, 0
#> $ levels        <chr> "", "", "male; female", "", ""
#> $ value_labels  <chr> "", "", "", "", ""
#> $ class         <chr> "integer", "integer", "factor", "integer", "integer"
#> $ type          <chr> "integer", "integer", "integer", "integer", "integer"
#> $ na_values     <chr> "", "", "", "", ""
#> $ na_range      <chr> "", "", "", "", ""
#> $ n_na          <int> 0, 0, 0, 0, 0
#> $ unique_values <int> 1, 5571, 2, 81, 7982
#> $ range         <chr> "2025 - 2025", "1100015 - 5300108", "", "0 - 80", "1 …
metadata
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year              <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population        <int> 148, 150, 154, 159, 163, 169, 177, 177, 172, 174,…
data

Save Data

The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). See the Data Availability section for more information.

Write Data

valid_file_pattern <- year
data |>
  write_csv(
    here(data_dir, paste0(valid_file_pattern, ".csv"))
  )
data |>
  write_rds(
    here(data_dir, paste0(valid_file_pattern, ".rds"))
  )
data |>
  write_parquet(
    here(data_dir, paste0(valid_file_pattern, ".parquet"))
  )

Write Metadata

metadata_file_pattern <-
  "metadata-" |>
  paste0(year)
metadata |>
  write_csv(
    here(data_dir, paste0(metadata_file_pattern, ".csv"))
  )
metadata |>
  write_rds(
    here(data_dir, paste0(metadata_file_pattern, ".rds"))
  )
metadata |>
  write_parquet(
    here(data_dir, paste0(metadata_file_pattern, ".parquet"))
  )

Visualize Data

Plot Histogram by Municipality

data |>
  summarize(
    population = sum(population, na.rm = TRUE),
    .by = "municipality_code"
  ) |>
  ggplot(aes(x = population)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    fill = get_brand_color("gray-d25"),
    color = get_brand_color("white")
  ) +
  geom_density(
    color = "red",
    linewidth = 1
  ) +
  scale_x_continuous(
    transform = "log10"
  ) +
  labs(
    title = "Population Estimates by Municipality in Brazil",
    subtitle = paste0("Year: ", year),
    x = "Population Estimate",
    y = "Density",
    caption = "Source: DATASUS/IBGE."
  )

Plot Map by Municipality

Set Shape

shape <-
  read_municipality(
    year = year |>
      closest_geobr_year(type = "municipality"),
    showProgress = FALSE
  ) |>
  st_transform(st_crs(4326))
#> ! The closest map year to 2025 is 2024. Using year 2024 instead.
#> Using year/date 2024

Prepare Plot Data

plot_data <-
  data |>
  summarize(
    population = sum(population, na.rm = TRUE),
    .by = "municipality_code"
  ) |>
  left_join(
    shape,
    by = join_by(municipality_code == code_muni)
  ) |>
  rename(geometry = geom) |>
  select(
    municipality_code,
    population,
    geometry
  ) |>
  drop_na(population)

Plot Data

brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n = x,
    colors = c(
      get_brand_color("dark-red"),
      # get_brand_color("white"),
      get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      get_brand_color("dark-red-triadic-blue")
    )
  )
}
plot_data |>
  st_as_sf() |>
  ggplot(aes(fill = population)) +
  geom_sf(
    color = get_brand_color("gray"),
    linewidth = 0.05
  ) +
  scale_fill_binned(
    palette = brand_div_palette,
    na.value = get_brand_color("gray-d25"),
    transform = "log10"
  ) +
  annotation_scale(
    aes(),
    location = "br",
    style = "tick",
    height = unit(0.5, "lines")
  ) +
  annotation_north_arrow(
    location = "br",
    height = unit(2, "lines"),
    width = unit(2, "lines"),
    pad_x = unit(0.25, "lines"),
    pad_y = unit(1.25, "lines"),
    style = north_arrow_fancy_orienteering
  ) +
  labs(
    title = "Population Estimates by Municipality in Brazil",
    subtitle = paste0("Year: ", year),
    fill = NULL,
    caption = "Source: DATASUS/IBGE."
  )
#> Scale on map varies by more than 10%, scale bar may be inaccurate

Citation

When using this data, you must also cite the original data sources.

To cite this work, please use the following format:

Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil [Computer software]. Sustentarea Research and Extension Group, University of São Paulo. https://sustentarea.github.io/population-estimates

A BibLaTeX entry for LaTeX users is:

@software{vartanian2025,
  title = {A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil},
  author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group, the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/population-estimates}
}

License

License: GPLv3 License: CC BY-NC-SA 4.0

The original data sources may be subject to their own licensing terms and conditions.

The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Copyright (C) 2025 Sustentarea Research and Extension Group

The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

Acknowledgments


Sustentarea Logo
This work is part of a research project by the Sustentarea Research and Extension Group of the University of São Paulo (USP) titled: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil's public health system.
CNPq Logo
This work was supported by the Department of Science and Technology of the Secretariat of Science, Technology, and Innovation and of the Health Economic-Industrial Complex (SECTICS) of the Ministry of Health of Brazil, and the National Council for Scientific and Technological Development (CNPq) (grant no. 444588/2023-0).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048
Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org
Ushey, K., & Wickham, H. (n.d.). renv: Project environments [Computer software]. https://doi.org/10.32614/CRAN.package.renv
Wickham, H. (n.d.-a). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org
Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org
Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz