library(brandr)
library(cli)
library(dplyr)
library(fs)
library(ggplot2)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(httr2)
library(lubridate)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(pal) # gitlab.com/rpkg.dev/pal
library(plotr) # github.com/danielvartan/plotr
library(readr)
library(tidyr)
library(utils)
library(vroom)
A Reproducible Pipeline for Processing SISVAN Microdata on Nutritional Status Monitoring in Brazil (2008-2023)
Overview
This report contains a reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008–2023). The main goal is to provide a open and reliable workflow for processing these data, supporting research and informed public policy decisions.
This pipeline is still under development and may not be fully functional.
This warning will be removed once the pipeline is complete.
Problem
The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).
Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).
In light of this, there is a need for an open and reproducible pipeline for processing SISVAN microdata, aiming to identify, correct, or remove problematic records and ensure greater consistency, completeness, and plausibility of the information for use in research and public policy.
Data Availability
The processed data are available in both csv
and rds
formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr
package.
A backup copy of the raw data is also available in OSF. You can access it here.
Methods
Source of Data
The data used in this analysis come from the following sources:
- Brazil’s Food and Nutrition Surveillance System (SISVAN), which provides microdata on nutritional status monitoring in Brazil (Sistema de Vigilância Alimentar e Nutricional et al., n.d.).
- The Brazilian Institute of Geography and Statistics (IBGE), which provides official territorial data for Brazilian municipalities. These data were accessed using the
orbis
andgeobr
R packages (Pereira & Goncalves, n.d.; Vartanian, n.d.). - The Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides annual population estimates for Brazil by municipality, age, and sex for the period 2000-2024 (Comitê de Gestão de Indicadores et al., n.d.). For practicality and better organization, the DATASUS data used in this pipeline is provided through a separate reproducible pipeline, available here (Vartanian & Carvalho, 2025).
For technical information about the raw dataset, see the official technical note (in Portuguese).
Data Munging
The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.
The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.
Source: Reproduced from Wickham et al. (2023).
Data Validation
Different validation techniques were used to ensure data quality and reliability:
- The amount of data imported from the raw files were compared to the amount of data returned by SISVAN Online Data Access Tool.
- Duplicates were removed based on distinct combinations of the variables
id
,age
,date
(date of the individual’s nutritional assessment),weight
, andheight
. - The number of nutritional assessments were compared to the estimated number of children in the population.
Silva et al. (2023) quality indicators were also used for validation. Refer to the article for more details.
Code Style
The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.
Setting the Environment
Setting the Initial Variables
year <- 2017
age_limits <- c(0, 4)
Click here to access the microdata data dictionary (in Portuguese).
col_selection <- c(
"CO_PESSOA_SISVAN",
"CO_MUNICIPIO_IBGE",
"DT_ACOMPANHAMENTO",
"SG_SEXO",
"NU_IDADE_ANO",
"NU_PESO",
"NU_ALTURA"
)
Downloading the Data
SISVAN microdata files are very large. For practical reasons, some code chunks have eval: false
set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.
Code
if (!dir.exists(here::here("data"))) dir.create("data")
raw_file_pattern <- paste0("raw-", year)
file <- here::here("data", paste0(raw_file_pattern, ".zip"))
paste0(
"https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br/SISVAN/",
"estado_nutricional/sisvan_estado_nutricional_",
year,
".zip"
) |>
httr2::request() |>
httr2::req_progress() |>
httr2::req_perform(file)
Unzipping the Data
Code
fs::file_delete(here::here("data", paste0(raw_file_pattern, ".zip")))
Checking Data Dimensions
file |> groomr::peek_csv_file(delim = ";", skip = 0, has_header = TRUE)
#> The file has 34 columns, 28,537,529 rows, and 970,275,986 cells.
Reading and Filtering the Data
We use the vroom
R package together with the AWK programming language to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab or RStudio Cloud.
col_names <- c(
"CO_ACOMPANHAMENTO",
"CO_PESSOA_SISVAN",
"ST_PARTICIPA_ANDI",
"CO_MUNICIPIO_IBGE",
"SG_UF",
"NO_MUNICIPIO",
"CO_CNES",
"NU_IDADE_ANO",
"NU_FASE_VIDA",
"DS_FASE_VIDA",
"SG_SEXO",
"CO_RACA_COR",
"DS_RACA_COR",
"CO_POVO_COMUNIDADE",
"DS_POVO_COMUNIDADE",
"CO_ESCOLARIDADE",
"DS_ESCOLARIDADE",
"DT_ACOMPANHAMENTO",
"NU_COMPETENCIA",
"NU_PESO",
"NU_ALTURA",
"DS_IMC",
"DS_IMC_PRE_GESTACIONAL",
"PESO X IDADE",
"PESO X ALTURA",
"CRI. ALTURA X IDADE",
"CRI. IMC X IDADE",
"ADO. ALTURA X IDADE",
"ADO. IMC X IDADE",
"CO_ESTADO_NUTRI_ADULTO",
"CO_ESTADO_NUTRI_IDOSO",
"CO_ESTADO_NUTRI_IMC_SEMGEST",
"CO_SISTEMA_ORIGEM_ACOMP",
"SISTEMA_ORIGEM_ACOMP"
)
schema <- vroom::cols(
CO_ACOMPANHAMENTO = vroom::col_character(),
CO_PESSOA_SISVAN = vroom::col_character(),
ST_PARTICIPA_ANDI = vroom::col_character(),
CO_MUNICIPIO_IBGE = vroom::col_integer(),
SG_UF = vroom::col_factor(),
NO_MUNICIPIO = vroom::col_character(), # ? vroom::col_factor()
CO_CNES = vroom::col_integer(),
NU_IDADE_ANO = vroom::col_integer(),
NU_FASE_VIDA = vroom::col_character(), # decimal mark = "." (double)
DS_FASE_VIDA = vroom::col_factor(),
SG_SEXO = vroom::col_factor(),
CO_RACA_COR = vroom::col_character(),
DS_RACA_COR = vroom::col_factor(),
CO_POVO_COMUNIDADE = vroom::col_integer(),
DS_POVO_COMUNIDADE = vroom::col_factor(),
CO_ESCOLARIDADE = vroom::col_character(),
DS_ESCOLARIDADE = vroom::col_factor(),
DT_ACOMPANHAMENTO = vroom::col_date(),
NU_COMPETENCIA = vroom::col_integer(),
NU_PESO = vroom::col_double(),
NU_ALTURA = vroom::col_integer(),
DS_IMC = vroom::col_double(),
DS_IMC_PRE_GESTACIONAL = vroom::col_character(), # decimal mark = "." (double)
"PESO X IDADE" = vroom::col_factor(),
"PESO X ALTURA" = vroom::col_factor(),
"CRI. ALTURA X IDADE" = vroom::col_factor(),
"CRI. IMC X IDADE" = vroom::col_factor(),
"ADO. ALTURA X IDADE" = vroom::col_factor(),
"ADO. IMC X IDADE" = vroom::col_factor(),
CO_ESTADO_NUTRI_ADULTO = vroom::col_factor(),
CO_ESTADO_NUTRI_IDOSO = vroom::col_factor(),
CO_ESTADO_NUTRI_IMC_SEMGEST = vroom::col_factor(),
CO_SISTEMA_ORIGEM_ACOMP = vroom::col_integer(),
SISTEMA_ORIGEM_ACOMP = vroom::col_factor()
)
You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.
data <-
vroom::vroom(
# Uses `pipe()` and `awk` to filter data to avoid loading the
# entire file into memory.
file = pipe(
paste(
"awk -F ';' '{ if (",
"($8 >= ", age_limits[1], ") && ($8 <= ", age_limits[2], ")",
") { print } }'",
file
)
),
delim = ";",
col_names = col_names,
col_types = schema,
col_select = dplyr::all_of(col_selection),
id = NULL,
skip = 0,
n_max = Inf,
na = c("", "NA"),
quote = "\"",
comment = "",
skip_empty_rows = TRUE,
trim_ws = TRUE,
escape_double = TRUE,
escape_backslash = FALSE,
locale = vroom::locale(
date_names = "pt",
date_format = "%d/%m/%Y",
time_format = "%H:%M:%S",
decimal_mark = ",",
grouping_mark = ".",
tz = "America/Sao_Paulo",
encoding = readr::guess_encoding(file)$encoding[1]
),
guess_max = 100,
altrep = TRUE,
num_threads = vroom:::vroom_threads(),
progress = vroom::vroom_progress(),
show_col_types = NULL,
.name_repair = "unique"
)
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ CO_PESSOA_SISVAN <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ CO_MUNICIPIO_IBGE <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ DT_ACOMPANHAMENTO <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ SG_SEXO <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ NU_IDADE_ANO <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ NU_PESO <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ NU_ALTURA <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…
Renaming the Data
data <-
data |>
janitor::clean_names() |>
dplyr::rename(
id = co_pessoa_sisvan,
municipality_code = co_municipio_ibge,
date = dt_acompanhamento,
sex = sg_sexo,
age = nu_idade_ano,
weight = nu_peso,
height = nu_altura
)
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ date <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ sex <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ age <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…
Tidying the Data
data <-
data |>
dplyr::mutate(
sex =
sex |>
dplyr::case_match(
"F" ~ "female",
"M" ~ "male"
) |>
factor(
levels = c("male", "female"),
ordered = FALSE
)
) |>
dplyr::relocate(id, date)
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ sex <fct> male, male, female, female, female, female, femal…
#> $ age <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…
Transforming the Data
Adding State and Region Data
brazil_municipalities <- orbis::get_brazil_municipality(
year = plotr:::get_closest_geobr_year(year, type = "municipality")
)
data <-
data |>
dplyr::left_join(
brazil_municipalities |>
dplyr::mutate(
municipality_code =
municipality_code |>
stringr::str_sub(end = -2) |>
as.integer()
),
by = "municipality_code"
) |>
dplyr::relocate(
id,
date,
region_code,
region,
state_code,
state,
federal_unit,
municipality_code,
municipality
)
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 13
#> $ id <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ region_code <int> 2, 4, 5, 2, 2, 2, 2, 3, 2, 2, 4, 3, 1, 1, 3, 2, 4…
#> $ region <chr> "Northeast", "South", "Central-West", "Northeast"…
#> $ state_code <int> 23, 42, 52, 25, 23, 23, 29, 31, 23, 26, 43, 35, 1…
#> $ state <chr> "Ceará", "Santa Catarina", "Goiás", "Paraíba", "C…
#> $ federal_unit <chr> "CE", "SC", "GO", "PB", "CE", "CE", "BA", "MG", "…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ municipality <chr> "Jaguaretama", "Botuverá", "Caturaí", "São José d…
#> $ sex <fct> male, male, female, female, female, female, femal…
#> $ age <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…
Validating the Data
Removing Duplicates
data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id <chr> "B1C98CBB3CB83C75B08E2F62056EB10A68DBEBA4", "5B1A…
#> $ date <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ region_code <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
#> $ region <chr> "Central-West", "Central-West", "Central-West", "…
#> $ state_code <int> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
#> $ state <chr> "Goiás", "Goiás", "Goiás", "Goiás", "Goiás", "Goi…
#> $ federal_unit <chr> "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "…
#> $ municipality_code <int> 522020, 522045, 522020, 521250, 522020, 522020, 5…
#> $ municipality <chr> "São Miguel do Araguaia", "Senador Canedo", "São …
#> $ sex <fct> female, male, female, male, female, male, female,…
#> $ age <int> 3, 4, 2, 2, 2, 3, 3, 1, 2, 4, 2, 3, 2, 4, 3, 4, 4…
#> $ weight <dbl> 17, NA, NA, NA, 14, 19, 16, 12, 14, 21, NA, 17, 1…
#> $ height <int> 94, 107, 85, 81, 89, 96, 95, 80, 87, 127, 93, 93,…
Arranging the Data
data <-
data |>
dplyr::arrange(
region_code,
state_code,
municipality_code,
date,
sex,
age,
weight,
height
)
data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id <chr> "263E905B0395FF94BE2D97E92983F83D0F4D01E6", "B812…
#> $ date <date> 2017-01-04, 2017-01-05, 2017-01-06, 2017-01-09, …
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110001, 110001, 110001, 1…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex <fct> female, male, female, male, male, female, male, m…
#> $ age <int> 1, 1, 4, 2, 0, 3, 1, 2, 1, 4, 0, 2, 0, 0, 4, 0, 0…
#> $ weight <dbl> 9.000, 9.700, 14.000, 12.700, 6.600, 13.800, 10.0…
#> $ height <int> 81, 75, 95, 89, 58, 98, 83, 80, 76, 111, 53, 95, …
Data Dictionary
Code
metadata <-
data |>
labelled::`var_label<-`(
list(
id = "Unique identifier of the individual",
date = "Date of the individual's nutritional assessment",
region_code = "IBGE region code",
region = "Region name",
state_code = "IBGE state code",
state = "State name",
federal_unit = "Federal unit name",
municipality_code = "IBGE municipality code",
municipality = "Municipality name",
sex = "Sex of the individual",
age = "Age of the individual in years",
weight = "Weight of the individual in kilograms",
height = "Height of the individual in centimeters"
)
) |>
labelled::generate_dictionary(details = "full") |>
labelled::convert_list_columns_to_character()
Code
metadata
Code
data
Saving the Valid Data
Data
Code
valid_file_pattern <- paste0(
"valid-",
year,
"-age-",
age_limits[1],
"-",
age_limits[2]
)
Metadata
Code
metadata_file_pattern <- paste0(
"metadata-",
year,
"-age-",
age_limits[1],
"-",
age_limits[2]
)
Checking the Relative Coverage
Transforming the Data
Removing Duplicates by Year
As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.
data |> dplyr::glimpse()
#> Rows: 4,622,727
#> Columns: 14
#> $ id <chr> "1B7842AF30A5899C2B6D82688E95EDCD96355BA1", "9A35…
#> $ date <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110002, 110002, 110002, 110002, 110002, 110002, 1…
#> $ municipality <chr> "Ariquemes", "Ariquemes", "Ariquemes", "Ariquemes…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 1, 2, 2, 2, 2, 3, 4, 1, 2, 3, 4, 1, 1, 1, 2…
#> $ weight <dbl> 7, 12, 13, 12, 13, NA, NA, 17, 18, NA, NA, 19, NA…
#> $ height <int> 63, 82, 85, 90, 90, 87, 88, 117, 107, 61, 96, 107…
Summarizing the Data by Year
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 8
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code <int> 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 1…
#> $ municipality_code <int> 110002, 110005, 110025, 110100, 120025, 120040, 1…
#> $ coverage <int> 1577, 622, 887, 334, 557, 7291, 995, 3741, 2172, …
#> $ mean_age <dbl> 2.581483830, 2.223472669, 2.278466742, 2.46107784…
#> $ mean_weight <dbl> 15.17615741, 13.98031309, 14.23404969, 14.7764350…
#> $ mean_height <dbl> 93.11921370, 91.22508039, 91.10496614, 93.9251497…
Adding Population Estimates
As described in the Methods section, the population estimates were obtained from the DATASUS platform, which provides annual data by municipality, age, and sex for Brazil from 2000 to 2024 (Comitê de Gestão de Indicadores et al., n.d.).
To ensure reproducibility and organization, the DATASUS data used in this pipeline are processed and validated through a separate reproducible pipeline, available here (Vartanian & Carvalho, 2025). The validated datasets are downloaded directly from OSF. For further details, refer to the linked pipeline.
datasus_file_pattern <- paste0("datasus-pop-estimates-", year)
datasus_file <- here::here("data", paste0(datasus_file_pattern, ".rds"))
if (!checkmate::test_file_exists(datasus_file)) {
osf_id <-
paste0("https://osf.io/", "h3pyd") |>
osfr::osf_retrieve_node() |>
osfr::osf_ls_files(
type = "file",
pattern = paste0("valid-", year, ".rds")
)
osfr::osf_download(
x = osf_id,
path = tempdir(),
conflicts = "overwrite"
) |>
dplyr::pull(local_path) |>
fs::file_move(datasus_file)
}
pop_estimates <- datasus_file |> readr::read_rds()
data <-
pop_estimates |>
dplyr::filter(dplyr::between(age, age_limits[1], age_limits[2])) |>
dplyr::summarize(
n = n |> sum(na.rm = TRUE),
.by = c(
"year",
"region_code",
"state_code",
"municipality_code"
)
) |>
dplyr::mutate(
municipality_code =
municipality_code |>
stringr::str_sub(end = -2) |>
as.integer()
) |>
dplyr::right_join(
data,
by = c(
"year",
"region_code",
"state_code",
"municipality_code"
)
) |>
dplyr::rename(population = n) |>
dplyr::relocate(population, .before = coverage)
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…
Validating the Data
The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.
Note: At this stage, only the most recent record for each individual is retained.
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…
Calculating Relative Coverage
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…
Arranging the Data
data <-
data |>
dplyr::arrange(
year,
region_code,
state_code,
municipality_code
)
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…
Checking Relative Coverage by Region
The coverage observed here is slightly lower than that reported in Silva et al. (2023, Table 2). This difference may be explained by the use of different data sources (Fundação Oswaldo Cruz (Fiocruz) vs. OpenDataSUS).
Code
data |>
dplyr::mutate(region = orbis::get_brazil_region(region_code)) |>
dplyr::summarize(
population = population |> sum(na.rm = TRUE),
coverage = coverage |> sum(na.rm = TRUE),
.by = "region"
) |>
dplyr::slice(c(1, 2, 5, 3, 4)) |>
dplyr::mutate(coverage_per = (coverage / population) * 100) |>
dplyr::rename(
Region = region,
Population = population,
`SISVAN coverage` = coverage,
`SISVAN coverage (%)` = coverage_per
) |>
pal::pipe_table() |>
pal::cat_lines()
Region | Population | SISVAN coverage | SISVAN coverage (%) |
---|---|---|---|
North | 1592792 | 624150 | 39.18590751 |
Northeast | 4107294 | 1813679 | 44.15751587 |
Central-West | 1208858 | 268745 | 22.23131253 |
Southeast | 5767592 | 1336352 | 23.17001619 |
South | 1975068 | 579801 | 29.35600192 |
Checking Relative Coverage by State
Code
data |>
dplyr::mutate(state = orbis::get_brazil_state(state_code)) |>
dplyr::summarize(
population = population |> sum(na.rm = TRUE),
coverage = coverage |> sum(na.rm = TRUE),
.by = "state"
) |>
dplyr::arrange(state) |>
dplyr::mutate(coverage_per = (coverage / population) * 100) |>
dplyr::rename(
State = state,
Population = population,
`SISVAN coverage` = coverage,
`SISVAN coverage (%)` = coverage_per
) |>
pal::pipe_table() |>
pal::cat_lines()
State | Population | SISVAN coverage | SISVAN coverage (%) |
---|---|---|---|
Acre | 81517 | 36178 | 44.380926678 |
Alagoas | 253571 | 113343 | 44.698723435 |
Amapá | 79072 | 20820 | 26.330433023 |
Amazonas | 403287 | 165140 | 40.948505655 |
Bahia | 1012762 | 444864 | 43.925818702 |
Ceará | 645357 | 281507 | 43.620352766 |
Distrito Federal | 218010 | 18757 | 8.603733774 |
Espírito Santo | 277541 | 61507 | 22.161410386 |
Goiás | 491920 | 103909 | 21.123150106 |
Maranhão | 578369 | 283802 | 49.069365751 |
Mato Grosso | 280532 | 79026 | 28.170048337 |
Mato Grosso do Sul | 218396 | 67053 | 30.702485394 |
Minas Gerais | 1309142 | 620924 | 47.429843363 |
Paraná | 786664 | 253352 | 32.205871884 |
Paraíba | 285820 | 150132 | 52.526765097 |
Pará | 710233 | 290899 | 40.958248913 |
Pernambuco | 692218 | 259659 | 37.511159779 |
Piauí | 234935 | 125698 | 53.503309426 |
Rio Grande do Norte | 237059 | 84896 | 35.812181778 |
Rio Grande do Sul | 707754 | 177919 | 25.138536836 |
Rio de Janeiro | 1122656 | 181772 | 16.191246473 |
Rondônia | 135254 | 33899 | 25.063214397 |
Roraima | 59647 | 16842 | 28.236122521 |
Santa Catarina | 480650 | 148530 | 30.901903672 |
Sergipe | 167203 | 69778 | 41.732504800 |
São Paulo | 3058253 | 472149 | 15.438519965 |
Tocantins | 123782 | 60372 | 48.772842578 |
Visualizing the Relative Coverage
Code
brand_div_palette <- function(x) {
brandr:::make_color_ramp(
n_prop = x,
colors = c(
brandr::get_brand_color("dark-red"),
# brandr::get_brand_color("white"),
brandr::get_brand_color_mix(
position = 950,
color_1 = "dark-red",
color_2 = "dark-red-triadic-blue",
alpha = 0.5
),
brandr::get_brand_color("dark-red-triadic-blue")
)
)
}
Code
data |>
tidyr::drop_na(coverage_per) |>
plotr:::plot_hist(
col = "coverage_per",
density_line_color = brandr::get_brand_color("red"),
x_label = "Coverage (%)",
print = FALSE
) +
ggplot2::labs(
title = "SISVAN Coverage by Municipality (%)",
subtitle = paste0("Year: ", year),
caption = "Source: SISVAN"
)
Code
data |>
tidyr::drop_na(coverage_per, municipality_code) |>
plotr:::plot_brazil_municipality(
col_fill = "coverage_per",
col_code = "municipality_code",
year = plotr:::get_closest_geobr_year(year, type = "municipality"),
comparable_areas = FALSE,
breaks = seq(0, 100, 25),
limits = c(0, 100),
palette = brand_div_palette,
print = FALSE
) +
ggplot2::labs(
title = "SISVAN Coverage by Municipality (%)",
subtitle = paste0("Year: ", year),
caption = "Source: SISVAN"
)
#> Scale on map varies by more than 10%, scale bar may be inaccurate
How to Cite
To cite this work, please use the following format:
Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023) [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/sisvan-nutritional-status
A BibTeX entry for LaTeX users is
@techreport{vartanian2025,
title = {A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023)},
author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group at the University of São Paulo},
langid = {en},
url = {https://sustentarea.github.io/sisvan-nutritional-status}
}
License
The code in this report is licensed under the MIT License, while the documents are available under the Creative Commons Attribution 4.0 International License.
Acknowledgments
This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).
This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).