library(anthro)
library(brandr)
library(cli)
library(dplyr)
library(forcats)
library(fs)
library(geobr)
library(ggplot2)
library(ggspatial)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(htmltools)
library(httr2)
library(janitor)
library(knitr)
library(labelled)
library(lubridate)
library(nanoparquet)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(pal) # gitlab.com/rpkg.dev/pal
library(parallel)
library(quartabs)
library(readr)
library(rutils) # github.com/danielvartan/rutils
library(scales)
library(sf)
library(stringr)
library(tidyr)
library(utils)
library(vroom)
library(zip)A Reproducible Pipeline for Processing and Analyzing SISVAN Microdata on Nutritional Status Monitoring in Brazil
Overview
This report provides a reproducible pipeline for processing and analyzing the microdata on nutritional status monitoring in Brazil from the Brazilian Food and Nutrition Surveillance System (SISVAN), focusing on the nutritional status of children aged 0–5 years (i.e., younger than 60 months).
If you are working with other age groups, you will need to adapt the code accordingly. We provide some guidance on how to do this along the report.
For instructions on how to run the pipeline, see the repository README.
Click here to see a report with a longitudinal analysis of the processed data.
Problem
The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from accessability and quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).
Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).
In light of these challenges, there is a need for an open and reproducible pipeline to process SISVAN microdata. Such a pipeline should facilitate broader access to the data and systematically identify, correct, and remove problematic records, thereby improving the consistency, completeness, and plausibility of the information for research and policymaking.
Data Availability
The processed data are available in csv, rds, and parquet formats via a dedicated repository on the Open Science Framework (OSF), accessible here. Each dataset is accompanied by a metadata file describing its structure and contents.
You can also retrieve these files directly from R using the osfr package.
Methods
Source of Data
The data used in this report come from the following sources:
- Brazilian Food and Nutrition Surveillance System (SISVAN):
- Microdata on nutritional status monitoring in Brazil (Sistema de Vigilância Alimentar e Nutricional et al., n.d.), the primary dataset for this pipeline.
- Brazilian Institute of Geography and Statistics (IBGE):
- Official codes and metadata for Brazilian municipalities, incorporated via the
geobrR package (Pereira & Goncalves, n.d.), used to normalize IBGE municipality codes and enrich the analysis with geographic information.
- Official codes and metadata for Brazilian municipalities, incorporated via the
- Department of Informatics of the Brazilian Unified Health System (DATASUS):
- Annual population estimates by municipality, age, and sex for Brazil (Comitê de Gestão de Indicadores et al., n.d.), used to calculate SISVAN’s population coverage.
The DATASUS population estimates used in this pipeline are processed through a separate reproducible workflow, available here (Vartanian & Carvalho, 2025).
Data Munging
The data munging follow the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system, along with the AWK (Aho et al., 2023) and R (R Core Team, n.d.) programming languages, supported by several R packages.
For data manipulation and workflow, priority was given to packages from the tidyverse, rOpenSci and r-spatial ecosystems, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).
Source: Reproduced from Wickham et al. (2023).
Data Validation
The validation steps described below are specifically designed for children aged 0–5 years. If you are working with older children or adolescents (ages 5–19 years), you should adapt the code accordingly. For these age groups, we recommend using the WHO’s anthroplus R package (Dirk Schumacher, n.d.-b).
Different validation techniques were used to ensure data quality and reliability:
- Duplicate records were removed based on unique combinations of the SISVAN identifier (
id) and assessment date (date). Only the latest record for each individual on a given date was retained. - Weight and height measurements identified as biologically implausible values (BIVs) according to World Health Organization (WHO) child growth standards (World Health Organization, 2006, 2008) were set to missing. BIVs were detected by calculating z-scores using the
anthro_zscoresfunction from the WHOanthroR package (Dirk Schumacher, n.d.-a), based on weight, height, age, and sex. Implausible values were flagged when z-scores exceeded established WHO cutoffs (typically \(|z| > 5\)). For details, see the function documentation.
Data Categorization
Nutritional status categories are ideally determined using z-scores, as recommended by the WHO child growth standards (World Health Organization, 2006, Section C). However, SISVAN data report age only in years, rather than in days or months as required for accurate z-score calculation. This limitation introduces substantial classification error if z-scores are computed directly. Therefore, we use the nutritional status categories already provided in the SISVAN microdata and set these categories to missing when biologically implausible values (BIVs) were identified.
Code Style
The Tidyverse Tidy Tools Manifesto (Wickham, 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, n.d.) is used to manage and restore the R environment. See the README file in the code repository to learn how to run it.
Set Environment
Load Packages
Set Data Directories
for (i in c(raw_data_dir, data_dir)) {
if (!dir_exists(i)) dir_create(i, recurse = TRUE)
}Set Initial Variables
The year variable represent the year of the consolidated SISVAN dataset on nutritional status.
year <- 2023The age_limits variable define the age range (in years) of individuals to be included in the analysis.
age_limits <- c(0, 4) # == Less than 5 yearscol_selection <- c(
"CO_PESSOA_SISVAN",
"DT_ACOMPANHAMENTO",
"CO_MUNICIPIO_IBGE",
"CO_CNES",
"SG_SEXO",
"NU_IDADE_ANO",
"CO_RACA_COR",
"NU_PESO",
"NU_ALTURA",
"PESO X IDADE",
"PESO X ALTURA",
"CRI. ALTURA X IDADE",
"CRI. IMC X IDADE"
)Download and Import IBGE Municipalities Data
See the Source of Data section for more information.
municipalities_data <- brazil_municipality(year = year)
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022municipalities_data |> glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ latitude <dbl> -11.935540305, -9.908462867, -13.499763460, -11.4…
#> $ longitude <dbl> -61.99982390, -63.03326928, -60.54431358, -61.442…Download DATASUS Population Estimates
See the Source of Data section for more information.
List Files
datasus_file_pattern <-
"datasus-population-estimates-" |>
paste0(year)osf_raw_data_id <- "h3pyd"osf_raw_data_file <-
osf_raw_data_id |>
osf_retrieve_node() |>
osf_ls_files(
type = "file",
pattern = paste0(year, ".rds")
) |>
filter(str_detect(name, paste0("^", year, "\\.rds$")))osf_raw_data_fileDownload Data
osf_raw_data_file |>
osf_download(
path = raw_data_dir,
conflicts = "overwrite"
) |>
pull(local_path)
#> [1] "data-raw/2023.rds"Rename File
if (file_exists(datasus_file)) {
datasus_file |> file_delete()
}Import DATASUS Population Estimates
population_estimates_data <- datasus_file |> read_rds()population_estimates_data |> glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex <fct> male, male, male, male, male, male, male, male, m…
#> $ age <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population <int> 171, 170, 170, 172, 178, 178, 173, 175, 179, 178,…Download SISVAN Microdata on Nutritional Status
See the Source of Data section for more information.
The microdata files are very large. For practical reasons, some code chunks have eval: false set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.
Download Data
file <-
"sisvan_estado_nutricional_" |>
paste0(year, ".zip")"https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br" |>
path(
"SISVAN",
"estado_nutricional",
file
) |>
request() |>
req_progress() |>
req_perform(here(raw_data_dir, file))Unzip Data
Delete Zip Files
raw_data_dir |>
dir_ls(type = "file", regexp = "\\.zip$") |>
file_delete()Check Data Dimensions
file <- file |> str_replace("\\.zip$", "\\.csv")raw_data_dir |>
here(file) |>
peek_csv_file(
delim = ";",
skip = 0,
has_header = TRUE
)
#> The file has 34 columns, 53,981,528 rows, and 1,835,371,952 cells.Import and Filter Data
The vroom R package together with the AWK programming language were use to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab and RStudio Cloud, or using GitHub Actions large runners.
Define Column Names and Schema
col_names <- c(
"CO_ACOMPANHAMENTO",
"CO_PESSOA_SISVAN",
"ST_PARTICIPA_ANDI",
"CO_MUNICIPIO_IBGE",
"SG_UF",
"NO_MUNICIPIO",
"CO_CNES",
"NU_IDADE_ANO",
"NU_FASE_VIDA",
"DS_FASE_VIDA",
"SG_SEXO",
"CO_RACA_COR",
"DS_RACA_COR",
"CO_POVO_COMUNIDADE",
"DS_POVO_COMUNIDADE",
"CO_ESCOLARIDADE",
"DS_ESCOLARIDADE",
"DT_ACOMPANHAMENTO",
"NU_COMPETENCIA",
"NU_PESO",
"NU_ALTURA",
"DS_IMC",
"DS_IMC_PRE_GESTACIONAL",
"PESO X IDADE",
"PESO X ALTURA",
"CRI. ALTURA X IDADE",
"CRI. IMC X IDADE",
"ADO. ALTURA X IDADE",
"ADO. IMC X IDADE",
"CO_ESTADO_NUTRI_ADULTO",
"CO_ESTADO_NUTRI_IDOSO",
"CO_ESTADO_NUTRI_IMC_SEMGEST",
"CO_SISTEMA_ORIGEM_ACOMP",
"SISTEMA_ORIGEM_ACOMP"
)schema <- cols(
"CO_ACOMPANHAMENTO" = col_character(),
"CO_PESSOA_SISVAN" = col_character(),
"ST_PARTICIPA_ANDI" = col_character(),
"CO_MUNICIPIO_IBGE" = col_integer(),
"SG_UF" = col_factor(),
"NO_MUNICIPIO" = col_character(),
"CO_CNES" = col_integer(),
"NU_IDADE_ANO" = col_integer(),
"NU_FASE_VIDA" = col_character(), # decimal mark = "." (double)
"DS_FASE_VIDA" = col_factor(),
"SG_SEXO" = col_factor(),
"CO_RACA_COR" = col_character(),
"DS_RACA_COR" = col_factor(),
"CO_POVO_COMUNIDADE" = col_integer(),
"DS_POVO_COMUNIDADE" = col_factor(),
"CO_ESCOLARIDADE" = col_character(),
"DS_ESCOLARIDADE" = col_factor(),
"DT_ACOMPANHAMENTO" = col_date(),
"NU_COMPETENCIA" = col_integer(),
"NU_PESO" = col_double(),
"NU_ALTURA" = col_integer(),
"DS_IMC" = col_double(),
"DS_IMC_PRE_GESTACIONAL" = col_character(), # decimal mark = "." (double)
"PESO X IDADE" = col_factor(),
"PESO X ALTURA" = col_factor(),
"CRI. ALTURA X IDADE" = col_factor(),
"CRI. IMC X IDADE" = col_factor(),
"ADO. ALTURA X IDADE" = col_factor(),
"ADO. IMC X IDADE" = col_factor(),
"CO_ESTADO_NUTRI_ADULTO" = col_factor(),
"CO_ESTADO_NUTRI_IDOSO" = col_factor(),
"CO_ESTADO_NUTRI_IMC_SEMGEST" = col_factor(),
"CO_SISTEMA_ORIGEM_ACOMP" = col_integer(),
"SISTEMA_ORIGEM_ACOMP" = col_factor()
)Import and Filter Data
You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.
data <-
vroom(
file = pipe(
paste0(
"awk ",
"-F ", # Field separator
"';' ",
"'{", # Program
"if (",
"($8 >= ",
age_limits[1],
")",
" && ",
"($8 <= ",
age_limits[2],
")",
") ",
"{print}",
"}' ",
raw_data_dir |> here(file) # file
)
),
delim = ";",
col_names = col_names,
col_types = schema,
col_select = all_of(col_selection),
na = c("", "NA"),
locale = locale(
date_names = "pt",
date_format = "%d/%m/%Y",
time_format = "%H:%M:%S",
decimal_mark = ",",
grouping_mark = ".",
tz = "America/Sao_Paulo",
encoding = raw_data_dir |>
here(file) |>
guess_encoding() |>
extract2("encoding") |>
magrittr::extract(1)
),
guess_max = 100,
num_threads = detectCores() |>
multiply_by(0.75) |>
floor(),
progress = TRUE
)data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ CO_PESSOA_SISVAN <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "…
#> $ DT_ACOMPANHAMENTO <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-…
#> $ CO_MUNICIPIO_IBGE <int> 150295, 431720, 351960, 521040, 353890, 42029…
#> $ CO_CNES <int> 2312670, 2254549, 373885, 2382482, 7260431, 7…
#> $ SG_SEXO <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, …
#> $ NU_IDADE_ANO <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, …
#> $ CO_RACA_COR <chr> "01", "01", "02", "04", "01", "99", "01", "03…
#> $ NU_PESO <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 1…
#> $ NU_ALTURA <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90,…
#> $ `PESO X IDADE` <fct> Peso adequado para idade, Baixo peso para a i…
#> $ `PESO X ALTURA` <fct> Peso Adequado ou Eutrofico, Risco de sobrepes…
#> $ `CRI. ALTURA X IDADE` <fct> Estatura adequada para a idade, Muito baixa e…
#> $ `CRI. IMC X IDADE` <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutr…Tidy Data
Rename Columns
data <-
data |>
clean_names() |>
rename(
id = co_pessoa_sisvan,
date = dt_acompanhamento,
municipality_code = co_municipio_ibge,
cnes = co_cnes,
sex = sg_sexo,
age = nu_idade_ano,
ethnicity = co_raca_cor,
weight = nu_peso,
height = nu_altura,
weight_for_age = peso_x_idade,
weight_for_height = peso_x_altura,
height_for_age = cri_altura_x_idade,
bmi_for_age = cri_imc_x_idade
)data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, M, M…
#> $ age <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity <chr> "01", "01", "02", "04", "01", "99", "01", "03", "…
#> $ weight <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age <fct> Peso adequado para idade, Baixo peso para a idade…
#> $ weight_for_height <fct> Peso Adequado ou Eutrofico, Risco de sobrepeso, P…
#> $ height_for_age <fct> Estatura adequada para a idade, Muito baixa estat…
#> $ bmi_for_age <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutrofia…Standardize Columns
data <-
data |>
mutate(
sex = sex |>
as.character() |>
case_match(
"F" ~ "Female",
"M" ~ "Male"
) |>
factor(
levels = c("Male", "Female"),
ordered = FALSE
),
ethnicity = ethnicity |>
as.character() |>
case_match(
"01" ~ "White",
"02" ~ "Black",
"03" ~ "Yellow",
"04" ~ "Brown",
"05" ~ "Indigenous"
) |>
factor(
levels = c(
"White",
"Black",
"Yellow",
"Brown",
"Indigenous"
),
ordered = FALSE
),
weight_for_age = weight_for_age |>
as.character() |>
case_match(
"Muito baixo peso para a idade" ~ "Severely underweight",
"Baixo peso para a idade" ~ "Underweight",
"Peso adequado para idade" ~ "Normal",
"Peso elevado para a idade" ~ "High"
) |>
factor(
levels = c(
"Severely underweight",
"Underweight",
"Normal",
"High"
),
ordered = TRUE
),
weight_for_height = weight_for_height |>
as.character() |>
case_match(
"Magreza acentuada" ~ "Severe wasted",
"Magreza" ~ "Wasted",
"Peso Adequado ou Eutrofico" ~ "Normal",
"Risco de sobrepeso" ~ "Possible risk of overweight",
"Sobrepeso" ~ "Overweight",
"Obesidade" ~ "Obese"
) |>
factor(
levels = c(
"Severe wasted",
"Wasted",
"Normal",
"Possible risk of overweight",
"Overweight",
"Obese"
),
ordered = TRUE
),
height_for_age = height_for_age |>
as.character() |>
case_match(
"Muito baixa estatura para idade" ~ "Severely stunted",
"Baixa estatura para idade" ~ "Stunted",
"Estatura adequada para a idade" ~ "Normal"
) |>
factor(
levels = c(
"Severely stunted",
"Stunted",
"Normal"
),
ordered = TRUE
),
bmi_for_age = bmi_for_age |>
as.character() |>
case_match(
"Magreza acentuada" ~ "Severe wasted",
"Magreza" ~ "Wasted",
"Eutrofia" ~ "Normal",
"Risco de sobrepeso" ~ "Possible risk of overweight",
"Sobrepeso" ~ "Overweight",
"Obesidade" ~ "Obese"
) |>
factor(
levels = c(
"Severe wasted",
"Wasted",
"Normal",
"Possible risk of overweight",
"Overweight",
"Obese"
),
ordered = TRUE
)
)data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex <fct> Male, Male, Male, Female, Female, Female, Male, M…
#> $ age <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity <fct> White, White, Black, Brown, White, NA, White, Yel…
#> $ weight <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age <ord> Normal, Underweight, Normal, High, Normal, Normal…
#> $ weight_for_height <ord> Normal, Possible risk of overweight, Normal, Obes…
#> $ height_for_age <ord> Normal, Severely stunted, Normal, Normal, Normal,…
#> $ bmi_for_age <ord> Normal, Normal, Normal, Obese, Normal, Normal, No…Transform Data
Remove Duplicates
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, 7.3, 12.0, 8.0, 24.0…
#> $ height <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age <ord> Normal, Normal, Underweight, Normal, Normal, Seve…
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age <ord> Normal, Obese, Obese, Normal, Normal, Severe wast…Remove Biological Implausible Values (BVI)
See the Data Validation section for more information.
data <-
data |>
mutate(
z_scores = anthro_zscores(
sex = as.numeric(sex),
age = age * 12,
is_age_in_month = TRUE,
weight = weight,
lenhei = height,
measure = "h"
),
weight = if_else(
(z_scores$fwei == 1) | (z_scores$flen != 1 & z_scores$fwfl == 1),
NA,
weight
),
height = if_else(
z_scores$flen == 1,
NA,
height
),
weight_for_age = if_else(
is.na(weight),
NA,
weight_for_age
),
weight_for_height = if_else(
is.na(weight) | is.na(height),
NA,
weight_for_height
),
height_for_age = if_else(
is.na(height),
NA,
height_for_age
),
bmi_for_age = if_else(
is.na(weight) | is.na(height),
NA,
bmi_for_age
)
) |>
select(-z_scores)data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…Fix Municipality Code
data <-
data |>
rename(municipality_code_6 = municipality_code) |>
left_join(
municipalities_data |>
mutate(
municipality_code_6 = municipality_code |>
str_sub(1, 6) |>
as.integer()
) |>
select(municipality_code, municipality_code_6),
by = join_by(municipality_code_6)
) |>
select(-municipality_code_6) |>
relocate(municipality_code, .after = date)data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <dbl> 4311759, 5211800, 2302800, 2103505, 2611101, 1505…
#> $ cnes <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…Arrange Data
data <-
data |>
arrange(
date,
municipality_code,
cnes,
sex,
age
)data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age <ord> Possible risk of overweight, Normal, Normal, Poss…Create Data Dictionary
Prepare Metadata
metadata <-
data |>
`var_label<-`(
list(
id = "Unique identifier for the individual",
date = "Date of the individual's nutritional assessment",
municipality_code = paste0(
"Institute of Geography and Statistics (IBGE) code of the ",
"municipality where the assessment was performed"
),
cnes = paste0(
"National Registry of Health Establishments (CNES) code of the ",
"health facility where the assessment was performed"
),
sex = "Sex of the individual",
age = "Age of the individual in years",
ethnicity = "Self-reported ethnicity/race or color of the individual",
weight = "Weight of the individual in kilograms",
height = "Height of the individual in centimeters",
weight_for_age = paste0(
"Nutritional status classification (children 0–5) based on ",
"weight-for-age"
),
weight_for_height = paste0(
"Nutritional status classification (children 0–5) based on ",
"weight-for-height"
),
height_for_age = paste0(
"Nutritional status classification (children 0–10) based on ",
"height-for-age"
),
bmi_for_age = paste0(
"Nutritional status classification (children 0–10) based on ",
"BMI-for-age"
)
)
) |>
generate_dictionary(details = "full") |>
convert_list_columns_to_character()Visualize Final Data
metadata |> glimpse()
#> Rows: 13
#> Columns: 14
#> $ pos <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
#> $ variable <chr> "id", "date", "municipality_code", "cnes", "sex", "ag…
#> $ label <chr> "Unique identifier for the individual", "Date of the …
#> $ col_type <chr> "chr", "date", "dbl", "int", "fct", "int", "fct", "db…
#> $ missing <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ levels <chr> "", "", "", "", "Male; Female", "", "White; Black; Ye…
#> $ value_labels <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ class <chr> "character", "Date", "numeric", "integer", "factor", …
#> $ type <chr> "character", "double", "double", "integer", "integer"…
#> $ na_values <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ na_range <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ n_na <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ unique_values <int> 7237146, 365, 5569, 43701, 2, 5, 6, 11123, 92, 5, 7, …
#> $ range <chr> "000002CB5BCC53DACE05639B3E1CD4BD1FF6C51A - FFFFFF927…metadatadata |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age <ord> Possible risk of overweight, Normal, Normal, Poss…dataSave Data
The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). See the Data Availability section for more information.
Write Data
valid_file_pattern <-
year |>
paste0(
"-age-limits-",
age_limits[1],
"-",
age_limits[2]
)data |>
write_parquet(
here(data_dir, paste0(valid_file_pattern, ".parquet"))
)Write Metadata
metadata_file_pattern <-
"metadata-" |>
paste0(
year,
"-age-limits-",
age_limits[1],
"-",
age_limits[2]
)metadata |>
write_parquet(
here(data_dir, paste0(metadata_file_pattern, ".parquet"))
)Explore Data
Summarize Frequencies
Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.
Code
vars <- c(
"sex",
"age",
"ethnicity",
"weight_for_age",
"weight_for_height",
"height_for_age",
"bmi_for_age"
)Code
panel_tabset_data <- tibble()
for (i in vars) {
table <-
data |>
arrange(desc(.data[[i]])) |>
distinct(id, .data[[i]]) |>
group_by(.data[[i]]) |>
summarize(n = n(), .groups = "drop") |>
arrange(desc(n)) |>
mutate(
!!i := .data[[i]] |>
as.factor() |>
fct_na_value_to_level(level = "(NA)") |>
fct_reorder(n) |>
fct_relevel("(NA)", after = 0)
) |>
arrange(desc(.data[[i]])) |>
mutate(
n_cum = cumsum(n),
pct = n |>
divide_by(sum(n)) |>
multiply_by(100),
pct_cum = pct |>
cumsum() |>
round(3),
pct = pct |> round(3),
across(
.cols = where(is.numeric),
.fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
)
) |>
rename(
N = n,
`N (Cumulative)` = n_cum,
Percent = pct,
`Percent (Cumulative)` = pct_cum
) |>
kable()
panel_tabset_data <-
panel_tabset_data |>
bind_rows(
tibble(
label = paste0("`", i, "`"),
table = list(table)
)
)
}Code
panel_tabset_data |> render_tabset(label, table)| age | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| 0 | 1,691,665 | 1,691,665 | 23.349 | 23.349 |
| 1 | 1,466,717 | 3,158,382 | 20.244 | 43.593 |
| 4 | 1,409,998 | 4,568,380 | 19.461 | 63.054 |
| 2 | 1,355,180 | 5,923,560 | 18.705 | 81.759 |
| 3 | 1,321,600 | 7,245,160 | 18.241 | 100 |
| bmi_for_age | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Normal | 3,545,481 | 3,545,481 | 48.884 | 48.884 |
| Possible risk of overweight | 991,166 | 4,536,647 | 13.666 | 62.55 |
| Overweight | 434,183 | 4,970,830 | 5.986 | 68.536 |
| Obese | 228,781 | 5,199,611 | 3.154 | 71.69 |
| Wasted | 169,328 | 5,368,939 | 2.335 | 74.025 |
| Severe wasted | 88,902 | 5,457,841 | 1.226 | 75.251 |
| (NA) | 1,795,035 | 7,252,876 | 24.749 | 100 |
| ethnicity | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Yellow | 2,345,345 | 2,345,345 | 32.407 | 32.407 |
| White | 2,038,029 | 4,383,374 | 28.161 | 60.568 |
| Brown | 1,486,639 | 5,870,013 | 20.542 | 81.11 |
| Black | 174,120 | 6,044,133 | 2.406 | 83.515 |
| Indigenous | 53,156 | 6,097,289 | 0.734 | 84.25 |
| (NA) | 1,139,857 | 7,237,146 | 15.75 | 100 |
| height_for_age | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Normal | 5,020,425 | 5,020,425 | 69.265 | 69.265 |
| Stunted | 425,016 | 5,445,441 | 5.864 | 75.129 |
| Severely stunted | 298,757 | 5,744,198 | 4.122 | 79.251 |
| (NA) | 1,503,933 | 7,248,131 | 20.749 | 100 |
| sex | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Male | 3,713,250 | 3,713,250 | 51.308 | 51.308 |
| Female | 3,523,896 | 7,237,146 | 48.692 | 100 |
| weight_for_age | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Normal | 5,106,342 | 5,106,342 | 70.472 | 70.472 |
| High | 305,772 | 5,412,114 | 4.22 | 74.692 |
| Underweight | 166,943 | 5,579,057 | 2.304 | 76.996 |
| Severely underweight | 60,807 | 5,639,864 | 0.839 | 77.835 |
| (NA) | 1,606,032 | 7,245,896 | 22.165 | 100 |
| weight_for_height | N | N (Cumulative) | Percent | Percent (Cumulative) |
|---|---|---|---|---|
| Normal | 3,660,833 | 3,660,833 | 50.483 | 50.483 |
| Possible risk of overweight | 983,190 | 4,644,023 | 13.558 | 64.041 |
| Overweight | 392,527 | 5,036,550 | 5.413 | 69.454 |
| Obese | 203,342 | 5,239,892 | 2.804 | 72.258 |
| Wasted | 143,713 | 5,383,605 | 1.982 | 74.239 |
| Severe wasted | 67,486 | 5,451,091 | 0.931 | 75.17 |
| (NA) | 1,800,585 | 7,251,676 | 24.83 | 100 |
Plot Bar Charts
Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.
Code
panel_tabset_data <- tibble()
for (i in vars) {
plot <-
data |>
mutate(
!!i := as.character(.data[[i]]),
!!i := ifelse(
str_length(.data[[i]]) > 30,
paste0(str_sub(.data[[i]], 1, 27), "..."),
.data[[i]]
) |>
fct_na_value_to_level(level = "(NA)")
) |>
group_by(.data[[i]]) |>
summarize(n = n(), .groups = "drop") |>
mutate(
rel = n |>
divide_by(sum(n)) |>
round(2),
!!i := .data[[i]] |>
fct_reorder(n) |>
fct_relevel("(NA)", after = 0)
) |>
ggplot(
aes(
x = .data[[i]],
y = n,
labels = percent(rel)
)
) +
geom_col(fill = get_brand_color("green")) +
geom_text(
hjust = -0.25,
size = 3
) +
coord_flip() +
scale_y_continuous(
expand = expansion(mult = c(0.05, 0.15))
) +
labs(x = NULL, y = NULL) +
theme(
axis.text.y = element_text(size = 8)
)
panel_tabset_data <-
panel_tabset_data |>
bind_rows(
tibble(
label = paste0("`", i, "`"),
plot = list(plot)
)
)
}Code
panel_tabset_data |> render_tabset(label, plot)Summarize Descriptive Statistics
Code
vars <- c(
"date",
"weight",
"height"
)Code
panel_tabset_data <- tibble()
for (i in vars) {
table <-
data |>
arrange(desc(.data[[i]])) |>
distinct(id, .data[[i]]) |>
stats_summary(i) |>
mutate(
name = c(
"Class",
"N",
"N (Without Missing)",
"N (Missing)",
"Mean",
"Variance",
"Standard Deviation",
"Minimum",
"1st Quartile (Q1)",
"Median",
"3rd Quartile (Q3)",
"Maximum",
"Interquartile Range (IQR)",
"Range",
"Skewness",
"Kurtosis"
),
value = if_else(
!str_detect(value, "\\d{4}-\\d{2}-\\d{2}"),
value |>
prettyNum(big.mark = ",", decimal.mark = ".") |>
str_trim(),
value
)
) |>
rename(
Name = name,
Value = value
) |>
kable()
panel_tabset_data <-
panel_tabset_data |>
bind_rows(
tibble(
label = paste0("`", i, "`"),
table = list(table)
)
)
}Code
panel_tabset_data |> render_tabset(label, table)| Name | Value |
|---|---|
| Class | Date |
| N | 7,273,731 |
| N (Without Missing) | 7,273,731 |
| N (Missing) | 0 |
| Mean | 2023-09-11 |
| Variance | 675,402,587.067,832s (~21.4 years) |
| Standard Deviation | 7,639,030.27,371,018s (~12.63 weeks) |
| Minimum | 2023-01-01 |
| 1st Quartile (Q1) | 2023-07-27 |
| Median | 2023-10-04 |
| 3rd Quartile (Q3) | 2023-11-21 |
| Maximum | 2023-12-31 |
| Interquartile Range (IQR) | 10,108,800s (~16.71 weeks) |
| Range | 31,449,600s (~52 weeks) |
| Skewness | -0.993245646806598 |
| Kurtosis | 3.08772335061643 |
| Name | Value |
|---|---|
| Class | integer |
| N | 7,261,876 |
| N (Without Missing) | 5,758,005 |
| N (Missing) | 1,503,871 |
| Mean | 89.6279690622012 |
| Variance | 246.572889788617 |
| Standard Deviation | 15.702639580294 |
| Minimum | 38 |
| 1st Quartile (Q1) | 80 |
| Median | 92 |
| 3rd Quartile (Q3) | 101 |
| Maximum | 128 |
| Interquartile Range (IQR) | 21 |
| Range | 90 |
| Skewness | -0.699042893863037 |
| Kurtosis | 3.14509557344265 |
| Name | Value |
|---|---|
| Class | numeric |
| N | 7,262,785 |
| N (Without Missing) | 5,656,771 |
| N (Missing) | 1,606,014 |
| Mean | 13.4700866034705 |
| Variance | 18.3701448246864 |
| Standard Deviation | 4.28604069330733 |
| Minimum | 0.945 |
| 1st Quartile (Q1) | 11 |
| Median | 13.5 |
| 3rd Quartile (Q3) | 16 |
| Maximum | 32.51 |
| Interquartile Range (IQR) | 5 |
| Range | 31.565 |
| Skewness | -0.0440857611925257 |
| Kurtosis | 3.61371365104714 |
Plot Histograms
Code
panel_tabset_data <- tibble()
for (i in vars) {
plot <-
data |>
tidyr::drop_na(all_of(i)) |>
ggplot(
aes(
x = .data[[i]],
y = after_stat(count)
)
) +
geom_histogram(
bins = 30,
fill = get_brand_color("gray-d25"),
color = get_brand_color("white")
) +
# geom_density(
# color = "red",
# linewidth = 1
# ) +
labs(
title = ifelse(
i == "date",
paste("Distribution of", str_to_title(i), "of Nutritional Assessments"),
paste("Distribution of", str_to_title(i))
),
x = str_to_title(i),
y = "Frequency"
) +
theme(
axis.text.x = element_text(margin = margin(0, 0, 10, 0)),
axis.text.y = element_text(margin = margin(0, 0, 0, 20))
)
panel_tabset_data <-
panel_tabset_data |>
bind_rows(
tibble(
label = paste0("`", i, "`"),
plot = list(plot)
)
)
}Code
panel_tabset_data |> render_tabset(label, plot)Check Relative Coverage
Transform Data
Remove Duplicates by Year
As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.
Code
data |> glimpse()
#> Rows: 7,237,146
#> Columns: 14
#> $ id <chr> "95978385C9FBD8D908846992EDAB7AD1566C765C", "0732…
#> $ date <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100452, 1100452, 1100452, 1100…
#> $ cnes <int> 5695880, 2806630, 2806630, 2806630, 9277927, 9277…
#> $ sex <fct> Male, Male, Male, Female, Male, Female, Female, M…
#> $ age <int> 2, 2, 3, 2, 3, 1, 4, 3, 1, 1, 3, 3, 3, 1, 3, 4, 4…
#> $ ethnicity <fct> White, Brown, Brown, Yellow, Yellow, White, Brown…
#> $ weight <dbl> 13.0, 14.0, 15.0, 12.0, 19.4, 11.5, 21.0, 17.0, 1…
#> $ height <int> 80, 90, 95, 81, 103, 84, 116, 104, 76, 70, 84, 10…
#> $ weight_for_age <ord> Normal, Normal, Normal, Normal, High, Normal, Nor…
#> $ weight_for_height <ord> Overweight, Possible risk of overweight, Normal, …
#> $ height_for_age <ord> Stunted, Normal, Normal, Normal, Normal, Normal, …
#> $ bmi_for_age <ord> Overweight, Possible risk of overweight, Normal, …Summarize Data by Year
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 6
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100601, 1101807, 1200013, 1200…
#> $ coverage <int> 12569, 1945, 246, 366, 1088, 914, 1634, 10810, 36…
#> $ mean_age <dbl> 1.932373299, 1.665295630, 2.130081301, 1.89071038…
#> $ mean_weight <dbl> 13.18117501, 12.74053459, 14.59693299, 13.7829060…
#> $ mean_height <dbl> 88.81422925, 87.31281317, 95.27319588, 90.6430976…Add Population Estimates
Code
data <-
population_estimates_data |>
filter(between(age, age_limits[1], age_limits[2])) |>
summarize(
n = population |> sum(na.rm = TRUE),
.by = c(
"year",
"municipality_code"
)
) |>
right_join(
data,
by = c(
"year",
"municipality_code"
)
) |>
rename(population = n) |>
relocate(population, .before = coverage)Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 7
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ population <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…Add Municipality Data
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…Validate Data
The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…Calculate Relative Coverage
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…Arrange Data
Code
data <-
data |>
arrange(
year,
municipality_code
)Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…Tabulate Relative Coverage by Region
Code
data |>
mutate(
region = brazil_region(region_code)
) |>
summarize(
population = population |> sum(na.rm = TRUE),
coverage = coverage |> sum(na.rm = TRUE),
.by = "region"
) |>
slice(c(1, 2, 5, 3, 4)) |>
mutate(
coverage_pct = coverage |>
divide_by(population) |>
multiply_by(100) |>
round(3),
across(
.cols = where(is.numeric),
.fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
)
) |>
arrange(desc(coverage_pct)) |>
rename(
Region = region,
Population = population,
`SISVAN Coverage` = coverage,
`SISVAN Coverage (%)` = coverage_pct
) |>
pipe_table() |>
cat_lines()| Region | Population | SISVAN Coverage | SISVAN Coverage (%) |
|---|---|---|---|
| Northeast | 3,766,882 | 2,367,312 | 62.845 |
| North | 1,491,017 | 936,061 | 62.78 |
| South | 1,866,192 | 980,048 | 52.516 |
| Central-West | 1,163,709 | 581,890 | 50.003 |
| Southeast | 5,144,676 | 2,371,835 | 46.103 |
Tabulate Relative Coverage by State
Code
data |>
mutate(
state = brazil_state(state_code)
) |>
summarize(
population = population |> sum(na.rm = TRUE),
coverage = coverage |> sum(na.rm = TRUE),
.by = "state"
) |>
arrange(state) |>
mutate(
coverage_pct = coverage |>
divide_by(population) |>
multiply_by(100) |>
round(3),
across(
.cols = where(is.numeric),
.fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
)
) |>
arrange(desc(coverage_pct)) |>
rename(
State = state,
Population = population,
`SISVAN Coverage` = coverage,
`SISVAN Coverage (%)` = coverage_pct
) |>
pipe_table() |>
cat_lines()| State | Population | SISVAN Coverage | SISVAN Coverage (%) |
|---|---|---|---|
| Amazonas | 372,626 | 258,619 | 69.404 |
| Piauí | 223,295 | 152,225 | 68.172 |
| Ceará | 603,297 | 409,420 | 67.864 |
| Tocantins | 118,363 | 79,276 | 66.977 |
| Maranhão | 521,647 | 339,351 | 65.054 |
| Alagoas | 235,822 | 153,311 | 65.011 |
| Pará | 655,706 | 420,326 | 64.103 |
| Bahia | 917,506 | 579,347 | 63.144 |
| Paraíba | 273,869 | 172,150 | 62.859 |
| Sergipe | 151,907 | 91,052 | 59.939 |
| Acre | 74,909 | 43,935 | 58.651 |
| Pernambuco | 625,346 | 355,420 | 56.836 |
| Minas Gerais | 1,223,545 | 687,147 | 56.16 |
| Santa Catarina | 504,318 | 278,068 | 55.137 |
| Paraná | 730,202 | 393,975 | 53.954 |
| Rio Grande do Norte | 214,193 | 115,036 | 53.707 |
| Mato Grosso | 295,406 | 158,514 | 53.66 |
| Mato Grosso do Sul | 210,411 | 110,848 | 52.682 |
| Amapá | 72,115 | 37,543 | 52.06 |
| Rondônia | 126,022 | 62,427 | 49.537 |
| Rio Grande do Sul | 631,672 | 308,005 | 48.76 |
| Goiás | 469,363 | 227,488 | 48.467 |
| Roraima | 71,276 | 33,935 | 47.611 |
| Espírito Santo | 266,283 | 125,030 | 46.954 |
| Rio de Janeiro | 958,493 | 444,301 | 46.354 |
| Distrito Federal | 188,529 | 85,040 | 45.107 |
| São Paulo | 2,696,355 | 1,115,357 | 41.365 |
Plot Histogram by Municipality
Code
data |>
tidyr::drop_na(coverage_pct) |>
ggplot(aes(x = coverage_pct)) +
geom_histogram(
aes(y = after_stat(density)),
bins = 30,
fill = get_brand_color("gray-d25"),
color = get_brand_color("white")
) +
geom_density(
color = "red",
linewidth = 1
) +
labs(
title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
subtitle = paste0("Year: ", year),
x = "Coverage (%)",
y = "Density",
caption = "Source: SISVAN."
)Plot Map by Municipality
Set Shape
Code
shape <-
read_municipality(
year = year |>
closest_geobr_year(type = "municipality"),
showProgress = FALSE
) |>
st_transform(st_crs(4326))
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022Prepare Plot Data
Plot Data
Code
brand_div_palette <- function(x) {
brandr:::make_color_ramp(
n = x,
colors = c(
get_brand_color("dark-red"),
# get_brand_color("white"),
get_brand_color_mix(
position = 950,
color_1 = "dark-red",
color_2 = "dark-red-triadic-blue",
alpha = 0.5
),
get_brand_color("dark-red-triadic-blue")
)
)
}Code
plot_data |>
st_as_sf() |>
ggplot(aes(fill = coverage_pct)) +
geom_sf(
color = get_brand_color("gray"),
linewidth = 0.05
) +
scale_fill_binned(
breaks = seq(0, 100, 25),
limits = c(0, 100),
palette = brand_div_palette,
na.value = get_brand_color("gray-d25")
) +
annotation_scale(
aes(),
location = "br",
style = "tick",
height = unit(0.5, "lines")
) +
annotation_north_arrow(
location = "br",
height = unit(2, "lines"),
width = unit(2, "lines"),
pad_x = unit(0.25, "lines"),
pad_y = unit(1.25, "lines"),
style = north_arrow_fancy_orienteering
) +
labs(
title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
subtitle = paste0("Year: ", year),
fill = NULL,
caption = "Source: SISVAN."
)
#> Scale on map varies by more than 10%, scale bar may be inaccurateCitation
When using this data, you must also cite the original data sources.
To cite this work, please use the following format:
Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil [Computer software]. Sustentarea Research and Extension Group, University of São Paulo. https://sustentarea.github.io/nutritional-status
A BibLaTeX entry for LaTeX users is:
@software{vartanian2025,
title = {A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil},
author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group, University of São Paulo},
langid = {en},
url = {https://sustentarea.github.io/nutritional-status}
}
License
The original data sources may be subject to their own licensing terms and conditions.
The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Copyright (C) 2025 Sustentarea Research and Extension Group
The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.












