A Reproducible Pipeline for Processing DATASUS Annual Population Estimates by Municipality, Age, and Sex in Brazil (2000-2024)
Overview
This report provides a reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024). The main goal is to provide an open and reliable workflow for processing these data, supporting research and informed public policy decisions.
Data Availability
The processed data are available in both csv
and rds
formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr
package.
A backup copy of the raw data is also available in OSF. You can access it here.
Methods
Source of Data
The data used in this report were sourced from the Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides annual population estimates for Brazil by municipality, age, and sex for the period 2000–2024 (Comitê de Gestão de Indicadores et al., n.d.). These estimates are produced using data by the Brazilian Institute of Geography and Statistics (IBGE).
For technical information about the raw dataset, see the official technical note (in Portuguese).
Data Munging
The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.
The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.
Source: Reproduced from Wickham et al. (2023).
Code Style
The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.
Setting the Environment
Setting the Initial Variables
year <- 2024
Downloading the Data
Code
if (!dir.exists(here::here("data"))) dir.create("data")
raw_file_pattern <- paste0("raw-", year)
file <- here::here("data", paste0(raw_file_pattern, ".zip"))
paste0(
"ftp.datasus.gov.br/dissemin/publicos/IBGE/POPSVS/POPSBR",
year |> stringr::str_sub(start= -2),
".zip"
) |>
httr2::request() |>
httr2::req_progress() |>
httr2::req_perform(file)
Unzipping the Data
Code
fs::file_delete(here::here("data", paste0(raw_file_pattern, ".zip")))
Reading the Data
data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ COD_MUN <fct> 1100015, 1100015, 1100015, 1100015, 1100015, 1100015, 11000…
#> $ ANO <fct> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024,…
#> $ SEXO <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ IDADE <fct> 000, 001, 002, 003, 004, 005, 006, 007, 008, 009, 010, 011,…
#> $ POP <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178, 176, 174,…
Renaming the Data
data <-
data |>
janitor::clean_names() |>
dplyr::rename(
municipality_code = cod_mun,
year = ano,
sex = sexo,
age = idade,
n = pop
)
data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ municipality_code <fct> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ year <fct> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ sex <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ age <fct> 000, 001, 002, 003, 004, 005, 006, 007, 008, 009,…
#> $ n <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…
Tidying the Data
data <-
data |>
dplyr::mutate(
dplyr::across(
.cols = where(is.factor),
.fns = ~ .x |> as.character() |> as.integer()
)
) |>
dplyr::mutate(
sex = factor(
sex,
levels = 1:2,
labels = c("male", "female"),
ordered = FALSE
)
)
data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…
Adding State and Region Data
brazil_municipalities <- orbis::get_brazil_municipality(
year = plotr:::get_closest_geobr_year(year, type = "municipality")
)
#> ! The closest map year to 2024 is 2022. Using year 2022 instead.
data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 11
#> $ year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…
Arranging the Data
data <-
data |>
dplyr::arrange(
year,
region_code,
state_code,
municipality_code,
sex,
age
)
data |> dplyr::glimpse()
#> Rows: 902,340
#> Columns: 11
#> $ year <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ n <int> 158, 159, 161, 163, 168, 176, 176, 172, 174, 178,…
Data Dictionary
Code
metadata <-
data |>
labelled::`var_label<-`(
list(
year = "Year of the population estimate",
region_code = "IBGE region code",
region = "Region name",
state_code = "IBGE state code",
state = "State name",
federal_unit = "Federal unit name",
municipality_code = "IBGE municipality code",
municipality = "Municipality name",
sex = "Sex of the population",
age = "Age of the population",
n = "Population estimate"
)
) |>
labelled::generate_dictionary(details = "full") |>
labelled::convert_list_columns_to_character()
Code
metadata
Code
data
Saving the Valid Data
Data
Code
valid_file_pattern <- paste0("valid-", year)
Metadata
Code
metadata_file_pattern <- paste0("metadata-", year)
Visualizing the Data
Code
brand_div_palette <- function(x) {
brandr:::make_color_ramp(
n_prop = x,
colors = c(
brandr::get_brand_color("dark-red"),
# brandr::get_brand_color("white"),
brandr::get_brand_color_mix(
position = 950,
color_1 = "dark-red",
color_2 = "dark-red-triadic-blue",
alpha = 0.5
),
brandr::get_brand_color("dark-red-triadic-blue")
)
)
}
Code
data |>
dplyr::summarize(
n = sum(n, na.rm = TRUE),
.by = c("municipality_code")
) |>
plotr:::plot_hist(
col = "n",
density_line_color = "red",
x_label = "Population estimate",
print = FALSE
) +
ggplot2::labs(
title = "Population Estimates by Municipality in Brazil",
subtitle = paste0("Year: ", year),
caption = "Source: DATASUS/IBGE"
)
Code
data |>
dplyr::summarize(
n = sum(n, na.rm = TRUE),
.by = c("municipality_code")
) |>
plotr:::plot_brazil_municipality(
col_fill = "n",
col_code = "municipality_code",
year = plotr:::get_closest_geobr_year(year, type = "municipality"),
comparable_areas = FALSE,
reverse = FALSE,
transform = "log10",
palette = brand_div_palette,
print = FALSE
) +
ggplot2::labs(
title = "Population Estimates by Municipality in Brazil",
subtitle = paste0("Year: ", year),
caption = "Source: DATASUS/IBGE"
)
#> ! The closest map year to 2024 is 2022. Using year 2022 instead.
#> Scale on map varies by more than 10%, scale bar may be inaccurate
How to Cite
To cite this work, please use the following format:
Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024) [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-pop-estimates
A BibTeX entry for LaTeX users is
@techreport{vartanian2025,
title = {A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024)},
author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group at the University of São Paulo},
langid = {en},
url = {https://sustentarea.github.io/datasus-pop-estimates}
}
License
This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.
Acknowledgments
This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).
This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).