library(brandr)
library(cli)
library(dplyr)
library(foreign)
library(fs)
library(geobr)
library(ggplot2)
library(ggspatial)
library(here)
library(htmltools)
library(httr2)
library(janitor)
library(labelled)
library(nanoparquet)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(readr)
library(sf)
library(stringr)
library(tidyr)
library(zip)A Reproducible Pipeline for Processing DATASUS Annual Population Estimates by Municipality, Age, and Sex in Brazil
Overview
This report provides a reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil.
For instructions on how to run the pipeline, see the repository README.
Problem
Accurate population estimates are fundamental for public health planning, resource allocation, and demographic research. The Department of Informatics of the Brazilian Unified Health System (DATASUS) publishes annual population estimates by municipality, age, and sex for Brazil. While these datasets are comprehensive, they require some processing to be directly usable for analysis in R. This pipeline provides a reproducible workflow to download and structure the data, enabling efficient and reliable downstream analyses.
Data Availability
The processed data are available in csv, rds, and parquet formats via a dedicated repository on the Open Science Framework (OSF), accessible here. Each dataset is accompanied by a metadata file describing its structure and contents.
You can also retrieve these files directly from R using the osfr package.
Methods
Source of Data
The data used in this report come from the following sources:
- Department of Informatics of the Brazilian Unified Health System (DATASUS):
- Annual population estimates by municipality, age, and sex for Brazil (Comitê de Gestão de Indicadores et al., n.d.), the primary dataset for this pipeline.
Data Munging
The data munging follow the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.
For data manipulation and workflow, priority was given to packages from the tidyverse, rOpenSci and r-spatial ecosystems, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).
Source: Reproduced from Wickham et al. (2023).
Code Style
The Tidyverse Tidy Tools Manifesto (Wickham, 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, n.d.) is used to manage and restore the R environment. See the README file in the code repository to learn how to run it.
Set Environment
Load Packages
Set Data Directories
for (i in c(raw_data_dir, data_dir)) {
if (!dir_exists(i)) dir_create(i, recurse = TRUE)
}Set Initial Variables
year <- 2025Download Data
See the Source of Data section for more information.
Get Data
"ftp.datasus.gov.br" |>
path(
"dissemin",
"publicos",
"IBGE",
"POPSVS",
file
) |>
request() |>
req_progress() |>
req_perform(here(raw_data_dir, file))Unzip Data
Delete Zip Files
raw_data_dir |>
dir_ls(
type = "file",
regexp = "\\.zip$"
) |>
file_delete()Import Data
data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ cod_mun <fct> 1100015, 1100023, 1100031, 1100049, 1100056, 1100064, 11000…
#> $ ano <fct> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025,…
#> $ sexo <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ idade <fct> 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,…
#> $ pop <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 368, 934, 24…Tidy Data
Rename Columns
data <-
data |>
clean_names() |>
rename(
municipality_code = cod_mun,
year = ano,
sex = sexo,
age = idade,
population = pop
)data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ municipality_code <fct> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ year <fct> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ sex <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ age <fct> 000, 000, 000, 000, 000, 000, 000, 000, 000, 000,…
#> $ population <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…Standardize Columns
data <-
data |>
mutate(
across(
.cols = where(is.factor),
.fns = \(x) x |> as.character() |> as.integer()
)
) |>
mutate(
sex = sex |>
factor(
levels = 1:2,
labels = c("male", "female"),
ordered = FALSE
)
)data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ municipality_code <int> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ year <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ population <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…Relocate Columns
data <- data |> relocate(year)data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ population <int> 148, 732, 31, 640, 112, 86, 48, 103, 221, 362, 36…Arrange Data
data <-
data |>
arrange(
year,
municipality_code,
sex,
age
)data |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population <int> 148, 150, 154, 159, 163, 169, 177, 177, 172, 174,…Create Data Dictionary
Prepare Metadata
metadata <-
data |>
`var_label<-`(
list(
year = "Year of the population estimate",
municipality_code = "IBGE municipality code",
sex = "Sex of the population",
age = "Age of the population",
population = "Population estimate"
)
) |>
generate_dictionary(details = "full") |>
convert_list_columns_to_character()Visualize Final Data
metadata |> glimpse()
#> Rows: 5
#> Columns: 14
#> $ pos <int> 1, 2, 3, 4, 5
#> $ variable <chr> "year", "municipality_code", "sex", "age", "populatio…
#> $ label <chr> "Year of the population estimate", "IBGE municipality…
#> $ col_type <chr> "int", "int", "fct", "int", "int"
#> $ missing <int> 0, 0, 0, 0, 0
#> $ levels <chr> "", "", "male; female", "", ""
#> $ value_labels <chr> "", "", "", "", ""
#> $ class <chr> "integer", "integer", "factor", "integer", "integer"
#> $ type <chr> "integer", "integer", "integer", "integer", "integer"
#> $ na_values <chr> "", "", "", "", ""
#> $ na_range <chr> "", "", "", "", ""
#> $ n_na <int> 0, 0, 0, 0, 0
#> $ unique_values <int> 1, 5571, 2, 81, 7982
#> $ range <chr> "2025 - 2025", "1100015 - 5300108", "", "0 - 80", "1 …metadatadata |> glimpse()
#> Rows: 902,502
#> Columns: 5
#> $ year <int> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population <int> 148, 150, 154, 159, 163, 169, 177, 177, 172, 174,…dataSave Data
The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). See the Data Availability section for more information.
Write Data
valid_file_pattern <- yeardata |>
write_parquet(
here(data_dir, paste0(valid_file_pattern, ".parquet"))
)Write Metadata
metadata_file_pattern <-
"metadata-" |>
paste0(year)metadata |>
write_parquet(
here(data_dir, paste0(metadata_file_pattern, ".parquet"))
)Visualize Data
Plot Histogram by Municipality
data |>
summarize(
population = sum(population, na.rm = TRUE),
.by = "municipality_code"
) |>
ggplot(aes(x = population)) +
geom_histogram(
aes(y = after_stat(density)),
bins = 30,
fill = get_brand_color("gray-d25"),
color = get_brand_color("white")
) +
geom_density(
color = "red",
linewidth = 1
) +
scale_x_continuous(
transform = "log10"
) +
labs(
title = "Population Estimates by Municipality in Brazil",
subtitle = paste0("Year: ", year),
x = "Population Estimate",
y = "Density",
caption = "Source: DATASUS/IBGE."
)Plot Map by Municipality
Set Shape
shape <-
read_municipality(
year = year |>
closest_geobr_year(type = "municipality"),
showProgress = FALSE
) |>
st_transform(st_crs(4326))
#> ! The closest map year to 2025 is 2024. Using year 2024 instead.
#> Using year/date 2024Prepare Plot Data
Plot Data
brand_div_palette <- function(x) {
brandr:::make_color_ramp(
n = x,
colors = c(
get_brand_color("dark-red"),
# get_brand_color("white"),
get_brand_color_mix(
position = 950,
color_1 = "dark-red",
color_2 = "dark-red-triadic-blue",
alpha = 0.5
),
get_brand_color("dark-red-triadic-blue")
)
)
}plot_data |>
st_as_sf() |>
ggplot(aes(fill = population)) +
geom_sf(
color = get_brand_color("gray"),
linewidth = 0.05
) +
scale_fill_binned(
palette = brand_div_palette,
na.value = get_brand_color("gray-d25"),
transform = "log10"
) +
annotation_scale(
aes(),
location = "br",
style = "tick",
height = unit(0.5, "lines")
) +
annotation_north_arrow(
location = "br",
height = unit(2, "lines"),
width = unit(2, "lines"),
pad_x = unit(0.25, "lines"),
pad_y = unit(1.25, "lines"),
style = north_arrow_fancy_orienteering
) +
labs(
title = "Population Estimates by Municipality in Brazil",
subtitle = paste0("Year: ", year),
fill = NULL,
caption = "Source: DATASUS/IBGE."
)
#> Scale on map varies by more than 10%, scale bar may be inaccurateCitation
When using this data, you must also cite the original data sources.
To cite this work, please use the following format:
Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil [Computer software]. Sustentarea Research and Extension Group, University of São Paulo. https://sustentarea.github.io/population-estimates
A BibLaTeX entry for LaTeX users is:
@software{vartanian2025,
title = {A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil},
author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group, the University of São Paulo},
langid = {en},
url = {https://sustentarea.github.io/population-estimates}
}
License
The original data sources may be subject to their own licensing terms and conditions.
The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Copyright (C) 2025 Sustentarea Research and Extension Group
The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.


