A Reproducible Pipeline for Processing and Analyzing SISVAN Microdata on Nutritional Status Monitoring in Brazil

Daniel Vartanian, João Pedro J. Schettino, &amp; Aline M. de Carvalho

Author

Daniel Vartanian, João Pedro J. Schettino, & Aline M. de Carvalho

Published

December 15, 2025

Overview

This report provides a reproducible pipeline for processing and analyzing the microdata on nutritional status monitoring in Brazil from the Brazilian Food and Nutrition Surveillance System (SISVAN), focusing on the nutritional status of children aged 0–5 years (i.e., younger than 60 months).

If you are working with other age groups, you will need to adapt the code accordingly. We provide some guidance on how to do this along the report.

For instructions on how to run the pipeline, see the repository README.

Click here to see a report with a longitudinal analysis of the processed data.

Problem

The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from accessability and quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).

Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).

In light of these challenges, there is a need for an open and reproducible pipeline to process SISVAN microdata. Such a pipeline should facilitate broader access to the data and systematically identify, correct, and remove problematic records, thereby improving the consistency, completeness, and plausibility of the information for research and policymaking.

Data Availability

The processed data are available in csv, rds, and parquet formats via a dedicated repository on the Open Science Framework (OSF), accessible here. Each dataset is accompanied by a metadata file describing its structure and contents.

You can also retrieve these files directly from R using the osfr package.

Methods

Source of Data

The data used in this report come from the following sources:

Brazilian Food and Nutrition Surveillance System (SISVAN):
- Microdata on nutritional status monitoring in Brazil (Sistema de Vigilância Alimentar e Nutricional et al., n.d.), the primary dataset for this pipeline.
Brazilian Institute of Geography and Statistics (IBGE):
- Official codes and metadata for Brazilian municipalities, incorporated via the geobr R package (Pereira & Goncalves, n.d.), used to normalize IBGE municipality codes and enrich the analysis with geographic information.
Department of Informatics of the Brazilian Unified Health System (DATASUS):
- Annual population estimates by municipality, age, and sex for Brazil (Comitê de Gestão de Indicadores et al., n.d.), used to calculate SISVAN’s population coverage.

The DATASUS population estimates used in this pipeline are processed through a separate reproducible workflow, available here (Vartanian & Carvalho, 2025).

Data Munging

The data munging follow the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system, along with the AWK (Aho et al., 2023) and R (R Core Team, n.d.) programming languages, supported by several R packages.

For data manipulation and workflow, priority was given to packages from the tidyverse, rOpenSci and r-spatial ecosystems, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Data Validation

The validation steps described below are specifically designed for children aged 0–5 years. If you are working with older children or adolescents (ages 5–19 years), you should adapt the code accordingly. For these age groups, we recommend using the WHO’s anthroplus R package (Dirk Schumacher, n.d.-b).

Different validation techniques were used to ensure data quality and reliability:

Duplicate records were removed based on unique combinations of the SISVAN identifier (id) and assessment date (date). Only the latest record for each individual on a given date was retained.
Weight and height measurements identified as biologically implausible values (BIVs) according to World Health Organization (WHO) child growth standards (World Health Organization, 2006, 2008) were set to missing. BIVs were detected by calculating z-scores using the anthro_zscores function from the WHO anthro R package (Dirk Schumacher, n.d.-a), based on weight, height, age, and sex. Implausible values were flagged when z-scores exceeded established WHO cutoffs (typically \(|z| > 5\)). For details, see the function documentation.

Data Categorization

Nutritional status categories are ideally determined using z-scores, as recommended by the WHO child growth standards (World Health Organization, 2006, Section C). However, SISVAN data report age only in years, rather than in days or months as required for accurate z-score calculation. This limitation introduces substantial classification error if z-scores are computed directly. Therefore, we use the nutritional status categories already provided in the SISVAN microdata and set these categories to missing when biologically implausible values (BIVs) were identified.

Code Style

The Tidyverse Tidy Tools Manifesto (Wickham, 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, n.d.) is used to manage and restore the R environment. See the README file in the code repository to learn how to run it.

Set Environment

Load Packages

library(anthro)
library(brandr)
library(cli)
library(dplyr)
library(forcats)
library(fs)
library(geobr)
library(ggplot2)
library(ggspatial)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(htmltools)
library(httr2)
library(janitor)
library(knitr)
library(labelled)
library(lubridate)
library(nanoparquet)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(pal) # gitlab.com/rpkg.dev/pal
library(parallel)
library(quartabs)
library(readr)
library(rutils) # github.com/danielvartan/rutils
library(scales)
library(sf)
library(stringr)
library(tidyr)
library(utils)
library(vroom)
library(zip)

Set Data Directories

raw_data_dir <- here("data-raw")
data_dir <- here("data")

for (i in c(raw_data_dir, data_dir)) {
  if (!dir_exists(i)) dir_create(i, recurse = TRUE)
}

Set Initial Variables

The year variable represent the year of the consolidated SISVAN dataset on nutritional status.

year <- 2023

The age_limits variable define the age range (in years) of individuals to be included in the analysis.

age_limits <- c(0, 4) # == Less than 5 years

The col_selection variable specifies the columns to be imported from the raw SISVAN microdata files.

Click here to access the microdata data dictionary (in Portuguese).

col_selection <- c(
  "CO_PESSOA_SISVAN",
  "DT_ACOMPANHAMENTO",
  "CO_MUNICIPIO_IBGE",
  "CO_CNES",
  "SG_SEXO",
  "NU_IDADE_ANO",
  "CO_RACA_COR",
  "NU_PESO",
  "NU_ALTURA",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE"
)

Download and Import IBGE Municipalities Data

See the Source of Data section for more information.

municipalities_data <- brazil_municipality(year = year)
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022

municipalities_data |> glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ latitude          <dbl> -11.935540305, -9.908462867, -13.499763460, -11.4…
#> $ longitude         <dbl> -61.99982390, -63.03326928, -60.54431358, -61.442…

Download DATASUS Population Estimates

See the Source of Data section for more information.

List Files

datasus_file_pattern <-
  "datasus-population-estimates-" |>
  paste0(year)

datasus_file <-
  raw_data_dir |>
  here(paste0(datasus_file_pattern, ".rds"))

osf_raw_data_id <- "h3pyd"

osf_raw_data_file <-
  osf_raw_data_id |>
  osf_retrieve_node() |>
  osf_ls_files(
    type = "file",
    pattern = paste0(year, ".rds")
  ) |>
  filter(str_detect(name, paste0("^", year, "\\.rds$")))

osf_raw_data_file

Download Data

osf_raw_data_file |>
  osf_download(
    path = raw_data_dir,
    conflicts = "overwrite"
  ) |>
  pull(local_path)
#> [1] "data-raw/2023.rds"

Rename File

if (file_exists(datasus_file)) {
  datasus_file |> file_delete()
}

raw_data_dir |>
  dir_ls(
    type = "file",
    regexp = paste0(year, "\\.rds$")
  ) |>
  file_move(datasus_file)

Import DATASUS Population Estimates

population_estimates_data <- datasus_file |> read_rds()

population_estimates_data |> glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ year              <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population        <int> 171, 170, 170, 172, 178, 178, 173, 175, 179, 178,…

Download SISVAN Microdata on Nutritional Status

See the Source of Data section for more information.

The microdata files are very large. For practical reasons, some code chunks have eval: false set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.

Download Data

file <-
  "sisvan_estado_nutricional_" |>
  paste0(year, ".zip")

"https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br" |>
  path(
    "SISVAN",
    "estado_nutricional",
    file
  ) |>
  request() |>
  req_progress() |>
  req_perform(here(raw_data_dir, file))

Unzip Data

here(raw_data_dir, file) |>
  unzip(exdir = raw_data_dir)

Delete Zip Files

raw_data_dir |>
  dir_ls(type = "file", regexp = "\\.zip$") |>
  file_delete()

Check Data Dimensions

file <- file |> str_replace("\\.zip$", "\\.csv")

raw_data_dir |>
  here(file) |>
  peek_csv_file(
    delim = ";",
    skip = 0,
    has_header = TRUE
  )
#> The file has 34 columns, 53,981,528 rows, and 1,835,371,952 cells.

Import and Filter Data

The vroom R package together with the AWK programming language were use to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab and RStudio Cloud, or using GitHub Actions large runners.

Define Column Names and Schema

col_names <- c(
  "CO_ACOMPANHAMENTO",
  "CO_PESSOA_SISVAN",
  "ST_PARTICIPA_ANDI",
  "CO_MUNICIPIO_IBGE",
  "SG_UF",
  "NO_MUNICIPIO",
  "CO_CNES",
  "NU_IDADE_ANO",
  "NU_FASE_VIDA",
  "DS_FASE_VIDA",
  "SG_SEXO",
  "CO_RACA_COR",
  "DS_RACA_COR",
  "CO_POVO_COMUNIDADE",
  "DS_POVO_COMUNIDADE",
  "CO_ESCOLARIDADE",
  "DS_ESCOLARIDADE",
  "DT_ACOMPANHAMENTO",
  "NU_COMPETENCIA",
  "NU_PESO",
  "NU_ALTURA",
  "DS_IMC",
  "DS_IMC_PRE_GESTACIONAL",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE",
  "ADO. ALTURA X IDADE",
  "ADO. IMC X IDADE",
  "CO_ESTADO_NUTRI_ADULTO",
  "CO_ESTADO_NUTRI_IDOSO",
  "CO_ESTADO_NUTRI_IMC_SEMGEST",
  "CO_SISTEMA_ORIGEM_ACOMP",
  "SISTEMA_ORIGEM_ACOMP"
)

schema <- cols(
  "CO_ACOMPANHAMENTO" = col_character(),
  "CO_PESSOA_SISVAN" = col_character(),
  "ST_PARTICIPA_ANDI" = col_character(),
  "CO_MUNICIPIO_IBGE" = col_integer(),
  "SG_UF" = col_factor(),
  "NO_MUNICIPIO" = col_character(),
  "CO_CNES" = col_integer(),
  "NU_IDADE_ANO" = col_integer(),
  "NU_FASE_VIDA" = col_character(), # decimal mark = "." (double)
  "DS_FASE_VIDA" = col_factor(),
  "SG_SEXO" = col_factor(),
  "CO_RACA_COR" = col_character(),
  "DS_RACA_COR" = col_factor(),
  "CO_POVO_COMUNIDADE" = col_integer(),
  "DS_POVO_COMUNIDADE" = col_factor(),
  "CO_ESCOLARIDADE" = col_character(),
  "DS_ESCOLARIDADE" = col_factor(),
  "DT_ACOMPANHAMENTO" = col_date(),
  "NU_COMPETENCIA" = col_integer(),
  "NU_PESO" = col_double(),
  "NU_ALTURA" = col_integer(),
  "DS_IMC" = col_double(),
  "DS_IMC_PRE_GESTACIONAL" = col_character(), # decimal mark = "." (double)
  "PESO X IDADE" = col_factor(),
  "PESO X ALTURA" = col_factor(),
  "CRI. ALTURA X IDADE" = col_factor(),
  "CRI. IMC X IDADE" = col_factor(),
  "ADO. ALTURA X IDADE" = col_factor(),
  "ADO. IMC X IDADE" = col_factor(),
  "CO_ESTADO_NUTRI_ADULTO" = col_factor(),
  "CO_ESTADO_NUTRI_IDOSO" = col_factor(),
  "CO_ESTADO_NUTRI_IMC_SEMGEST" = col_factor(),
  "CO_SISTEMA_ORIGEM_ACOMP" = col_integer(),
  "SISTEMA_ORIGEM_ACOMP" = col_factor()
)

Import and Filter Data

You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.

data <-
  vroom(
    file = pipe(
      paste0(
        "awk ",
        "-F ", # Field separator
        "';' ",
        "'{", # Program
        "if (",
        "($8 >= ",
        age_limits[1],
        ")",
        " && ",
        "($8 <= ",
        age_limits[2],
        ")",
        ") ",
        "{print}",
        "}' ",
        raw_data_dir |> here(file) # file
      )
    ),
    delim = ";",
    col_names = col_names,
    col_types = schema,
    col_select = all_of(col_selection),
    na = c("", "NA"),
    locale = locale(
      date_names = "pt",
      date_format = "%d/%m/%Y",
      time_format = "%H:%M:%S",
      decimal_mark = ",",
      grouping_mark = ".",
      tz = "America/Sao_Paulo",
      encoding = raw_data_dir |>
        here(file) |>
        guess_encoding() |>
        extract2("encoding") |>
        magrittr::extract(1)
    ),
    guess_max = 100,
    num_threads = detectCores() |>
      multiply_by(0.75) |>
      floor(),
    progress = TRUE
  )

data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ CO_PESSOA_SISVAN      <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "…
#> $ DT_ACOMPANHAMENTO     <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-…
#> $ CO_MUNICIPIO_IBGE     <int> 150295, 431720, 351960, 521040, 353890, 42029…
#> $ CO_CNES               <int> 2312670, 2254549, 373885, 2382482, 7260431, 7…
#> $ SG_SEXO               <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, …
#> $ NU_IDADE_ANO          <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, …
#> $ CO_RACA_COR           <chr> "01", "01", "02", "04", "01", "99", "01", "03…
#> $ NU_PESO               <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 1…
#> $ NU_ALTURA             <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90,…
#> $ `PESO X IDADE`        <fct> Peso adequado para idade, Baixo peso para a i…
#> $ `PESO X ALTURA`       <fct> Peso Adequado ou Eutrofico, Risco de sobrepes…
#> $ `CRI. ALTURA X IDADE` <fct> Estatura adequada para a idade, Muito baixa e…
#> $ `CRI. IMC X IDADE`    <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutr…

Tidy Data

Rename Columns

data <-
  data |>
  clean_names() |>
  rename(
    id = co_pessoa_sisvan,
    date = dt_acompanhamento,
    municipality_code = co_municipio_ibge,
    cnes = co_cnes,
    sex = sg_sexo,
    age = nu_idade_ano,
    ethnicity = co_raca_cor,
    weight = nu_peso,
    height = nu_altura,
    weight_for_age = peso_x_idade,
    weight_for_height = peso_x_altura,
    height_for_age = cri_altura_x_idade,
    bmi_for_age = cri_imc_x_idade
  )

data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id                <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date              <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes              <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex               <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, M, M…
#> $ age               <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity         <chr> "01", "01", "02", "04", "01", "99", "01", "03", "…
#> $ weight            <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height            <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age    <fct> Peso adequado para idade, Baixo peso para a idade…
#> $ weight_for_height <fct> Peso Adequado ou Eutrofico, Risco de sobrepeso, P…
#> $ height_for_age    <fct> Estatura adequada para a idade, Muito baixa estat…
#> $ bmi_for_age       <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutrofia…

Standardize Columns

data <-
  data |>
  mutate(
    sex = sex |>
      as.character() |>
      case_match(
        "F" ~ "Female",
        "M" ~ "Male"
      ) |>
      factor(
        levels = c("Male", "Female"),
        ordered = FALSE
      ),
    ethnicity = ethnicity |>
      as.character() |>
      case_match(
        "01" ~ "White",
        "02" ~ "Black",
        "03" ~ "Yellow",
        "04" ~ "Brown",
        "05" ~ "Indigenous"
      ) |>
      factor(
        levels = c(
          "White",
          "Black",
          "Yellow",
          "Brown",
          "Indigenous"
        ),
        ordered = FALSE
      ),
    weight_for_age = weight_for_age |>
      as.character() |>
      case_match(
        "Muito baixo peso para a idade" ~ "Severely underweight",
        "Baixo peso para a idade" ~ "Underweight",
        "Peso adequado para idade" ~ "Normal",
        "Peso elevado para a idade" ~ "High"
      ) |>
      factor(
        levels = c(
          "Severely underweight",
          "Underweight",
          "Normal",
          "High"
        ),
        ordered = TRUE
      ),
    weight_for_height = weight_for_height |>
      as.character() |>
      case_match(
        "Magreza acentuada" ~ "Severe wasted",
        "Magreza" ~ "Wasted",
        "Peso Adequado ou Eutrofico" ~ "Normal",
        "Risco de sobrepeso" ~ "Possible risk of overweight",
        "Sobrepeso" ~ "Overweight",
        "Obesidade" ~ "Obese"
      ) |>
      factor(
        levels = c(
          "Severe wasted",
          "Wasted",
          "Normal",
          "Possible risk of overweight",
          "Overweight",
          "Obese"
        ),
        ordered = TRUE
      ),
    height_for_age = height_for_age |>
      as.character() |>
      case_match(
        "Muito baixa estatura para idade" ~ "Severely stunted",
        "Baixa estatura para idade" ~ "Stunted",
        "Estatura adequada para a idade" ~ "Normal"
      ) |>
      factor(
        levels = c(
          "Severely stunted",
          "Stunted",
          "Normal"
        ),
        ordered = TRUE
      ),
    bmi_for_age = bmi_for_age |>
      as.character() |>
      case_match(
        "Magreza acentuada" ~ "Severe wasted",
        "Magreza" ~ "Wasted",
        "Eutrofia" ~ "Normal",
        "Risco de sobrepeso" ~ "Possible risk of overweight",
        "Sobrepeso" ~ "Overweight",
        "Obesidade" ~ "Obese"
      ) |>
      factor(
        levels = c(
          "Severe wasted",
          "Wasted",
          "Normal",
          "Possible risk of overweight",
          "Overweight",
          "Obese"
        ),
        ordered = TRUE
      )
  )

data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id                <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date              <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes              <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex               <fct> Male, Male, Male, Female, Female, Female, Male, M…
#> $ age               <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity         <fct> White, White, Black, Brown, White, NA, White, Yel…
#> $ weight            <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height            <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age    <ord> Normal, Underweight, Normal, High, Normal, Normal…
#> $ weight_for_height <ord> Normal, Possible risk of overweight, Normal, Obes…
#> $ height_for_age    <ord> Normal, Severely stunted, Normal, Normal, Normal,…
#> $ bmi_for_age       <ord> Normal, Normal, Normal, Obese, Normal, Normal, No…

Transform Data

Remove Duplicates

data <-
  data |>
  arrange(desc(date)) |>
  distinct(id, date, .keep_all = TRUE)

data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, 7.3, 12.0, 8.0, 24.0…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, Seve…
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, Severe wast…

Remove Biological Implausible Values (BVI)

See the Data Validation section for more information.

data <-
  data |>
  mutate(
    z_scores = anthro_zscores(
      sex = as.numeric(sex),
      age = age * 12,
      is_age_in_month = TRUE,
      weight = weight,
      lenhei = height,
      measure = "h"
    ),
    weight = if_else(
      (z_scores$fwei == 1) | (z_scores$flen != 1 & z_scores$fwfl == 1),
      NA,
      weight
    ),
    height = if_else(
      z_scores$flen == 1,
      NA,
      height
    ),
    weight_for_age = if_else(
      is.na(weight),
      NA,
      weight_for_age
    ),
    weight_for_height = if_else(
      is.na(weight) | is.na(height),
      NA,
      weight_for_height
    ),
    height_for_age = if_else(
      is.na(height),
      NA,
      height_for_age
    ),
    bmi_for_age = if_else(
      is.na(weight) | is.na(height),
      NA,
      bmi_for_age
    )
  ) |>
  select(-z_scores)

data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…

Fix Municipality Code

data <-
  data |>
  rename(municipality_code_6 = municipality_code) |>
  left_join(
    municipalities_data |>
      mutate(
        municipality_code_6 = municipality_code |>
          str_sub(1, 6) |>
          as.integer()
      ) |>
      select(municipality_code, municipality_code_6),
    by = join_by(municipality_code_6)
  ) |>
  select(-municipality_code_6) |>
  relocate(municipality_code, .after = date)

data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <dbl> 4311759, 5211800, 2302800, 2103505, 2611101, 1505…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…

Arrange Data

data <-
  data |>
  arrange(
    date,
    municipality_code,
    cnes,
    sex,
    age
  )

data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date              <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes              <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex               <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age               <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity         <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight            <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height            <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age    <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age       <ord> Possible risk of overweight, Normal, Normal, Poss…

Create Data Dictionary

Prepare Metadata

metadata <-
  data |>
  `var_label<-`(
    list(
      id = "Unique identifier for the individual",
      date = "Date of the individual's nutritional assessment",
      municipality_code = paste0(
        "Institute of Geography and Statistics (IBGE) code of the ",
        "municipality where the assessment was performed"
      ),
      cnes = paste0(
        "National Registry of Health Establishments (CNES) code of the ",
        "health facility where the assessment was performed"
      ),
      sex = "Sex of the individual",
      age = "Age of the individual in years",
      ethnicity = "Self-reported ethnicity/race or color of the individual",
      weight = "Weight of the individual in kilograms",
      height = "Height of the individual in centimeters",
      weight_for_age = paste0(
        "Nutritional status classification (children 0–5) based on ",
        "weight-for-age"
      ),
      weight_for_height = paste0(
        "Nutritional status classification (children 0–5) based on ",
        "weight-for-height"
      ),
      height_for_age = paste0(
        "Nutritional status classification (children 0–10) based on ",
        "height-for-age"
      ),
      bmi_for_age = paste0(
        "Nutritional status classification (children 0–10) based on ",
        "BMI-for-age"
      )
    )
  ) |>
  generate_dictionary(details = "full") |>
  convert_list_columns_to_character()

Visualize Final Data

metadata |> glimpse()
#> Rows: 13
#> Columns: 14
#> $ pos           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
#> $ variable      <chr> "id", "date", "municipality_code", "cnes", "sex", "ag…
#> $ label         <chr> "Unique identifier for the individual", "Date of the …
#> $ col_type      <chr> "chr", "date", "dbl", "int", "fct", "int", "fct", "db…
#> $ missing       <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ levels        <chr> "", "", "", "", "Male; Female", "", "White; Black; Ye…
#> $ value_labels  <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ class         <chr> "character", "Date", "numeric", "integer", "factor", …
#> $ type          <chr> "character", "double", "double", "integer", "integer"…
#> $ na_values     <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ na_range      <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ n_na          <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ unique_values <int> 7237146, 365, 5569, 43701, 2, 5, 6, 11123, 92, 5, 7, …
#> $ range         <chr> "000002CB5BCC53DACE05639B3E1CD4BD1FF6C51A - FFFFFF927…

metadata

data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date              <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes              <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex               <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age               <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity         <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight            <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height            <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age    <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age       <ord> Possible risk of overweight, Normal, Normal, Poss…

data

Save Data

The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). See the Data Availability section for more information.

Write Data

valid_file_pattern <-
  year |>
  paste0(
    "-age-limits-",
    age_limits[1],
    "-",
    age_limits[2]
  )

data |>
  write_csv(
    here(data_dir, paste0(valid_file_pattern, ".csv"))
  )

data |>
  write_rds(
    here(data_dir, paste0(valid_file_pattern, ".rds"))
  )

data |>
  write_parquet(
    here(data_dir, paste0(valid_file_pattern, ".parquet"))
  )

Write Metadata

metadata_file_pattern <-
  "metadata-" |>
  paste0(
    year,
    "-age-limits-",
    age_limits[1],
    "-",
    age_limits[2]
  )

metadata |>
  write_csv(
    here(data_dir, paste0(metadata_file_pattern, ".csv"))
  )

metadata |>
  write_rds(
    here(data_dir, paste0(metadata_file_pattern, ".rds"))
  )

metadata |>
  write_parquet(
    here(data_dir, paste0(metadata_file_pattern, ".parquet"))
  )

Explore Data

Summarize Frequencies

Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.

Code

vars <- c(
  "sex",
  "age",
  "ethnicity",
  "weight_for_age",
  "weight_for_height",
  "height_for_age",
  "bmi_for_age"
)

Code

panel_tabset_data <- tibble()

for (i in vars) {
  table <-
    data |>
    arrange(desc(.data[[i]])) |>
    distinct(id, .data[[i]]) |>
    group_by(.data[[i]]) |>
    summarize(n = n(), .groups = "drop") |>
    arrange(desc(n)) |>
    mutate(
      !!i := .data[[i]] |>
        as.factor() |>
        fct_na_value_to_level(level = "(NA)") |>
        fct_reorder(n) |>
        fct_relevel("(NA)", after = 0)
    ) |>
    arrange(desc(.data[[i]])) |>
    mutate(
      n_cum = cumsum(n),
      pct = n |>
        divide_by(sum(n)) |>
        multiply_by(100),
      pct_cum = pct |>
        cumsum() |>
        round(3),
      pct = pct |> round(3),
      across(
        .cols = where(is.numeric),
        .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
      )
    ) |>
    rename(
      N = n,
      `N (Cumulative)` = n_cum,
      Percent = pct,
      `Percent (Cumulative)` = pct_cum
    ) |>
    kable()

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        table = list(table)
      )
    )
}

Code

panel_tabset_data |> render_tabset(label, table)

age	N	N (Cumulative)	Percent	Percent (Cumulative)
0	1,691,665	1,691,665	23.349	23.349
1	1,466,717	3,158,382	20.244	43.593
4	1,409,998	4,568,380	19.461	63.054
2	1,355,180	5,923,560	18.705	81.759
3	1,321,600	7,245,160	18.241	100

bmi_for_age	N	N (Cumulative)	Percent	Percent (Cumulative)
Normal	3,545,481	3,545,481	48.884	48.884
Possible risk of overweight	991,166	4,536,647	13.666	62.55
Overweight	434,183	4,970,830	5.986	68.536
Obese	228,781	5,199,611	3.154	71.69
Wasted	169,328	5,368,939	2.335	74.025
Severe wasted	88,902	5,457,841	1.226	75.251
(NA)	1,795,035	7,252,876	24.749	100

ethnicity	N	N (Cumulative)	Percent	Percent (Cumulative)
Yellow	2,345,345	2,345,345	32.407	32.407
White	2,038,029	4,383,374	28.161	60.568
Brown	1,486,639	5,870,013	20.542	81.11
Black	174,120	6,044,133	2.406	83.515
Indigenous	53,156	6,097,289	0.734	84.25
(NA)	1,139,857	7,237,146	15.75	100

height_for_age	N	N (Cumulative)	Percent	Percent (Cumulative)
Normal	5,020,425	5,020,425	69.265	69.265
Stunted	425,016	5,445,441	5.864	75.129
Severely stunted	298,757	5,744,198	4.122	79.251
(NA)	1,503,933	7,248,131	20.749	100

sex	N	N (Cumulative)	Percent	Percent (Cumulative)
Male	3,713,250	3,713,250	51.308	51.308
Female	3,523,896	7,237,146	48.692	100

weight_for_age	N	N (Cumulative)	Percent	Percent (Cumulative)
Normal	5,106,342	5,106,342	70.472	70.472
High	305,772	5,412,114	4.22	74.692
Underweight	166,943	5,579,057	2.304	76.996
Severely underweight	60,807	5,639,864	0.839	77.835
(NA)	1,606,032	7,245,896	22.165	100

weight_for_height	N	N (Cumulative)	Percent	Percent (Cumulative)
Normal	3,660,833	3,660,833	50.483	50.483
Possible risk of overweight	983,190	4,644,023	13.558	64.041
Overweight	392,527	5,036,550	5.413	69.454
Obese	203,342	5,239,892	2.804	72.258
Wasted	143,713	5,383,605	1.982	74.239
Severe wasted	67,486	5,451,091	0.931	75.17
(NA)	1,800,585	7,251,676	24.83	100

Plot Bar Charts

Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.

Code

panel_tabset_data <- tibble()

for (i in vars) {
  plot <-
    data |>
    mutate(
      !!i := as.character(.data[[i]]),
      !!i := ifelse(
        str_length(.data[[i]]) > 30,
        paste0(str_sub(.data[[i]], 1, 27), "..."),
        .data[[i]]
      ) |>
        fct_na_value_to_level(level = "(NA)")
    ) |>
    group_by(.data[[i]]) |>
    summarize(n = n(), .groups = "drop") |>
    mutate(
      rel = n |>
        divide_by(sum(n)) |>
        round(2),
      !!i := .data[[i]] |>
        fct_reorder(n) |>
        fct_relevel("(NA)", after = 0)
    ) |>
    ggplot(
      aes(
        x = .data[[i]],
        y = n,
        labels = percent(rel)
      )
    ) +
    geom_col(fill = get_brand_color("green")) +
    geom_text(
      hjust = -0.25,
      size = 3
    ) +
    coord_flip() +
    scale_y_continuous(
      expand = expansion(mult = c(0.05, 0.15))
    ) +
    labs(x = NULL, y = NULL) +
    theme(
      axis.text.y = element_text(size = 8)
    )

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        plot = list(plot)
      )
    )
}

Code

panel_tabset_data |> render_tabset(label, plot)

Summarize Descriptive Statistics

Code

vars <- c(
  "date",
  "weight",
  "height"
)

Code

panel_tabset_data <- tibble()

for (i in vars) {
  table <-
    data |>
    arrange(desc(.data[[i]])) |>
    distinct(id, .data[[i]]) |>
    stats_summary(i) |>
    mutate(
      name = c(
        "Class",
        "N",
        "N (Without Missing)",
        "N (Missing)",
        "Mean",
        "Variance",
        "Standard Deviation",
        "Minimum",
        "1st Quartile (Q1)",
        "Median",
        "3rd Quartile (Q3)",
        "Maximum",
        "Interquartile Range (IQR)",
        "Range",
        "Skewness",
        "Kurtosis"
      ),
      value = if_else(
        !str_detect(value, "\\d{4}-\\d{2}-\\d{2}"),
        value |>
          prettyNum(big.mark = ",", decimal.mark = ".") |>
          str_trim(),
        value
      )
    ) |>
    rename(
      Name = name,
      Value = value
    ) |>
    kable()

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        table = list(table)
      )
    )
}

Code

panel_tabset_data |> render_tabset(label, table)

Name	Value
Class	Date
N	7,273,731
N (Without Missing)	7,273,731
N (Missing)	0
Mean	2023-09-11
Variance	675,402,587.067,832s (~21.4 years)
Standard Deviation	7,639,030.27,371,018s (~12.63 weeks)
Minimum	2023-01-01
1st Quartile (Q1)	2023-07-27
Median	2023-10-04
3rd Quartile (Q3)	2023-11-21
Maximum	2023-12-31
Interquartile Range (IQR)	10,108,800s (~16.71 weeks)
Range	31,449,600s (~52 weeks)
Skewness	-0.993245646806598
Kurtosis	3.08772335061643

Name	Value
Class	integer
N	7,261,876
N (Without Missing)	5,758,005
N (Missing)	1,503,871
Mean	89.6279690622012
Variance	246.572889788617
Standard Deviation	15.702639580294
Minimum	38
1st Quartile (Q1)	80
Median	92
3rd Quartile (Q3)	101
Maximum	128
Interquartile Range (IQR)	21
Range	90
Skewness	-0.699042893863037
Kurtosis	3.14509557344265

Name	Value
Class	numeric
N	7,262,785
N (Without Missing)	5,656,771
N (Missing)	1,606,014
Mean	13.4700866034705
Variance	18.3701448246864
Standard Deviation	4.28604069330733
Minimum	0.945
1st Quartile (Q1)	11
Median	13.5
3rd Quartile (Q3)	16
Maximum	32.51
Interquartile Range (IQR)	5
Range	31.565
Skewness	-0.0440857611925257
Kurtosis	3.61371365104714

Plot Histograms

Code

panel_tabset_data <- tibble()

for (i in vars) {
  plot <-
    data |>
    tidyr::drop_na(all_of(i)) |>
    ggplot(
      aes(
        x = .data[[i]],
        y = after_stat(count)
      )
    ) +
    geom_histogram(
      bins = 30,
      fill = get_brand_color("gray-d25"),
      color = get_brand_color("white")
    ) +
    # geom_density(
    #   color = "red",
    #   linewidth = 1
    # ) +
    labs(
      title = ifelse(
        i == "date",
        paste("Distribution of", str_to_title(i), "of Nutritional Assessments"),
        paste("Distribution of", str_to_title(i))
      ),
      x = str_to_title(i),
      y = "Frequency"
    ) +
    theme(
      axis.text.x = element_text(margin = margin(0, 0, 10, 0)),
      axis.text.y = element_text(margin = margin(0, 0, 0, 20))
    )

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        plot = list(plot)
      )
    )
}

Code

panel_tabset_data |> render_tabset(label, plot)

Check Relative Coverage

Transform Data

Remove Duplicates by Year

As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.

Code

data <-
  data |>
  mutate(year = year(date)) |>
  arrange(desc(date)) |>
  distinct(id, year, .keep_all = TRUE) |>
  relocate(year, .after = date)

Code

data |> glimpse()
#> Rows: 7,237,146
#> Columns: 14
#> $ id                <chr> "95978385C9FBD8D908846992EDAB7AD1566C765C", "0732…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100452, 1100452, 1100452, 1100…
#> $ cnes              <int> 5695880, 2806630, 2806630, 2806630, 9277927, 9277…
#> $ sex               <fct> Male, Male, Male, Female, Male, Female, Female, M…
#> $ age               <int> 2, 2, 3, 2, 3, 1, 4, 3, 1, 1, 3, 3, 3, 1, 3, 4, 4…
#> $ ethnicity         <fct> White, Brown, Brown, Yellow, Yellow, White, Brown…
#> $ weight            <dbl> 13.0, 14.0, 15.0, 12.0, 19.4, 11.5, 21.0, 17.0, 1…
#> $ height            <int> 80, 90, 95, 81, 103, 84, 116, 104, 76, 70, 84, 10…
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, High, Normal, Nor…
#> $ weight_for_height <ord> Overweight, Possible risk of overweight, Normal, …
#> $ height_for_age    <ord> Stunted, Normal, Normal, Normal, Normal, Normal, …
#> $ bmi_for_age       <ord> Overweight, Possible risk of overweight, Normal, …

Summarize Data by Year

Code

data <-
  data |>
  summarize(
    coverage = n(),
    mean_age = age |> mean(na.rm = TRUE),
    mean_weight = weight |> mean(na.rm = TRUE),
    mean_height = height |> mean(na.rm = TRUE),
    .by = c(
      "year",
      "municipality_code"
    )
  )

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 6
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100601, 1101807, 1200013, 1200…
#> $ coverage          <int> 12569, 1945, 246, 366, 1088, 914, 1634, 10810, 36…
#> $ mean_age          <dbl> 1.932373299, 1.665295630, 2.130081301, 1.89071038…
#> $ mean_weight       <dbl> 13.18117501, 12.74053459, 14.59693299, 13.7829060…
#> $ mean_height       <dbl> 88.81422925, 87.31281317, 95.27319588, 90.6430976…

Add Population Estimates

Code

data <-
  population_estimates_data |>
  filter(between(age, age_limits[1], age_limits[2])) |>
  summarize(
    n = population |> sum(na.rm = TRUE),
    .by = c(
      "year",
      "municipality_code"
    )
  ) |>
  right_join(
    data,
    by = c(
      "year",
      "municipality_code"
    )
  ) |>
  rename(population = n) |>
  relocate(population, .before = coverage)

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 7
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Add Municipality Data

Code

data <-
  data |>
  left_join(
    municipalities_data,
    by = join_by(municipality_code),
    suffix = c(".x", "")
  ) |>
  select(-latitude, -longitude) |>
  relocate(region_code:municipality, .after = year) |>
  relocate(municipality, .after = municipality_code)

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Validate Data

The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.

Code

data <-
  data |>
  mutate(
    population = case_when(
      coverage > population ~ coverage,
      TRUE ~ population
    )
  )

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Calculate Relative Coverage

Code

data <-
  data |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100)
  ) |>
  relocate(coverage_pct, .after = coverage)

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct      <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Arrange Data

Code

data <-
  data |>
  arrange(
    year,
    municipality_code
  )

Code

data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct      <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Tabulate Relative Coverage by Region

Code

data |>
  mutate(
    region = brazil_region(region_code)
  ) |>
  summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "region"
  ) |>
  slice(c(1, 2, 5, 3, 4)) |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100) |>
      round(3),
    across(
      .cols = where(is.numeric),
      .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
    )
  ) |>
  arrange(desc(coverage_pct)) |>
  rename(
    Region = region,
    Population = population,
    `SISVAN Coverage` = coverage,
    `SISVAN Coverage (%)` = coverage_pct
  ) |>
  pipe_table() |>
  cat_lines()

Region	Population	SISVAN Coverage	SISVAN Coverage (%)
Northeast	3,766,882	2,367,312	62.845
North	1,491,017	936,061	62.78
South	1,866,192	980,048	52.516
Central-West	1,163,709	581,890	50.003
Southeast	5,144,676	2,371,835	46.103

Tabulate Relative Coverage by State

Code

data |>
  mutate(
    state = brazil_state(state_code)
  ) |>
  summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "state"
  ) |>
  arrange(state) |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100) |>
      round(3),
    across(
      .cols = where(is.numeric),
      .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
    )
  ) |>
  arrange(desc(coverage_pct)) |>
  rename(
    State = state,
    Population = population,
    `SISVAN Coverage` = coverage,
    `SISVAN Coverage (%)` = coverage_pct
  ) |>
  pipe_table() |>
  cat_lines()

State	Population	SISVAN Coverage	SISVAN Coverage (%)
Amazonas	372,626	258,619	69.404
Piauí	223,295	152,225	68.172
Ceará	603,297	409,420	67.864
Tocantins	118,363	79,276	66.977
Maranhão	521,647	339,351	65.054
Alagoas	235,822	153,311	65.011
Pará	655,706	420,326	64.103
Bahia	917,506	579,347	63.144
Paraíba	273,869	172,150	62.859
Sergipe	151,907	91,052	59.939
Acre	74,909	43,935	58.651
Pernambuco	625,346	355,420	56.836
Minas Gerais	1,223,545	687,147	56.16
Santa Catarina	504,318	278,068	55.137
Paraná	730,202	393,975	53.954
Rio Grande do Norte	214,193	115,036	53.707
Mato Grosso	295,406	158,514	53.66
Mato Grosso do Sul	210,411	110,848	52.682
Amapá	72,115	37,543	52.06
Rondônia	126,022	62,427	49.537
Rio Grande do Sul	631,672	308,005	48.76
Goiás	469,363	227,488	48.467
Roraima	71,276	33,935	47.611
Espírito Santo	266,283	125,030	46.954
Rio de Janeiro	958,493	444,301	46.354
Distrito Federal	188,529	85,040	45.107
São Paulo	2,696,355	1,115,357	41.365

Plot Histogram by Municipality

Code

data |>
  tidyr::drop_na(coverage_pct) |>
  ggplot(aes(x = coverage_pct)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    fill = get_brand_color("gray-d25"),
    color = get_brand_color("white")
  ) +
  geom_density(
    color = "red",
    linewidth = 1
  ) +
  labs(
    title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
    subtitle = paste0("Year: ", year),
    x = "Coverage (%)",
    y = "Density",
    caption = "Source: SISVAN."
  )

Plot Map by Municipality

Set Shape

Code

shape <-
  read_municipality(
    year = year |>
      closest_geobr_year(type = "municipality"),
    showProgress = FALSE
  ) |>
  st_transform(st_crs(4326))
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022

Prepare Plot Data

Code

plot_data <-
  data |>
  left_join(
    shape,
    by = join_by(municipality_code == code_muni)
  ) |>
  rename(geometry = geom) |>
  select(
    municipality_code,
    coverage_pct,
    geometry
  ) |>
  tidyr::drop_na(coverage_pct)

Plot Data

Code

brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n = x,
    colors = c(
      get_brand_color("dark-red"),
      # get_brand_color("white"),
      get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      get_brand_color("dark-red-triadic-blue")
    )
  )
}

Code

plot_data |>
  st_as_sf() |>
  ggplot(aes(fill = coverage_pct)) +
  geom_sf(
    color = get_brand_color("gray"),
    linewidth = 0.05
  ) +
  scale_fill_binned(
    breaks = seq(0, 100, 25),
    limits = c(0, 100),
    palette = brand_div_palette,
    na.value = get_brand_color("gray-d25")
  ) +
  annotation_scale(
    aes(),
    location = "br",
    style = "tick",
    height = unit(0.5, "lines")
  ) +
  annotation_north_arrow(
    location = "br",
    height = unit(2, "lines"),
    width = unit(2, "lines"),
    pad_x = unit(0.25, "lines"),
    pad_y = unit(1.25, "lines"),
    style = north_arrow_fancy_orienteering
  ) +
  labs(
    title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
    subtitle = paste0("Year: ", year),
    fill = NULL,
    caption = "Source: SISVAN."
  )
#> Scale on map varies by more than 10%, scale bar may be inaccurate

Citation

When using this data, you must also cite the original data sources.

To cite this work, please use the following format:

Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil [Computer software]. Sustentarea Research and Extension Group, University of São Paulo. https://sustentarea.github.io/nutritional-status

A BibLaTeX entry for LaTeX users is:

@software{vartanian2025,
  title = {A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil},
  author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group, University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/nutritional-status}
}

License

The original data sources may be subject to their own licensing terms and conditions.

The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Copyright (C) 2025 Sustentarea Research and Extension Group

The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

Acknowledgments

This work is part of a research project by the Sustentarea Research and Extension Group of the University of São Paulo (USP) titled: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil's public health system.

This work was supported by the Department of Science and Technology of the Secretariat of Science, Technology, and Innovation and of the Health Economic-Industrial Complex (SECTICS) of the Ministry of Health of Brazil, and the National Council for Scientific and Technological Development (CNPq) (grant no. 444588/2023-0).

References

Aho, A., Kernighan, B., & Weinberger, P. (2023). The AWK programming language. Addison-Wesley Professional. https://www.awk.dev

Bagni, U. V., & Barros, D. C. D. (2015). Erro em antropometria aplicada à avaliação nutricional nos serviços de saúde: Causas, consequências e métodos de mensuração. Nutrire, 40(2), 226–236. https://doi.org/10.4322/2316-7874.18613

Bopp, M., & Faeh, D. (2008). End-digits preference for self-reported height depends on language. BMC Public Health, 8(1), 342. https://doi.org/10.1186/1471-2458-8-342

Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def

Corsi, D. J., Perkins, J. M., & Subramanian, S. V. (2017). Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: Are we comparing apples and oranges? Global Health Action, 10(1), 1328185. https://doi.org/10.1080/16549716.2017.1328185

Dirk Schumacher. (n.d.-a). anthro: Computation of the WHO child growth standards [Computer software]. https://doi.org/10.32614/CRAN.package.anthro

Dirk Schumacher. (n.d.-b). anthroplus: Computation of the WHO 2007 references for school-age children and adolescents (5 to 19 years) [Computer software]. https://doi.org/10.32614/CRAN.package.anthroplus

Finaret, A. B., & Hutchinson, M. (2018). Missingness of height data from the demographic and health surveys in Africa between 1991 and 2016 was not random but is unlikely to have major implications for biases in estimating stunting prevalence or the determinants of child height. The Journal of Nutrition, 148(5), 781–789. https://doi.org/10.1093/jn/nxy037

Lawman, H. G., Ogden, C. L., Hassink, S., Mallya, G., Vander Veur, S., & Foster, G. D. (2015). Comparing methods for identifying biologically implausible values in height, weight, and body mass index among youth. American Journal of Epidemiology, 182(4), 359–365. https://doi.org/10.1093/aje/kwv057

Lyons-Amos, M., & Stones, T. (2017). Trends in demographic and health survey data quality: An analysis of age heaping over time in 34 countries in sub saharan Africa between 1987 and 2015. BMC Research Notes, 10(1), 760. https://doi.org/10.1186/s13104-017-3091-x

Mei, Z. (2007). Standard deviation of anthropometric Z-scores as a data quality assessment tool using the 2006 WHO growth standards: A cross country analysis. Bulletin of the World Health Organization, 85(6), 441–448. https://doi.org/10.2471/BLT.06.034421

Mourão, E., Gallo, C. D. O., Nascimento, F. A. D., & Jaime, P. C. (2020). Tendência temporal da cobertura do Sistema de Vigilância Alimentar e Nutricional entre crianças menores de 5 anos da região Norte do Brasil, 2008-2017*. Epidemiologia e Serviços de Saúde, 29(2). https://doi.org/10.5123/S1679-49742020000200026

Nannan, N., Dorrington, R., & Bradshaw, D. (2019). Estimating completeness of birth registration in South Africa, 1996 – 2011. Bulletin of the World Health Organization, 97(7), 468–476. https://doi.org/10.2471/BLT.18.222620

Nascimento, F. A. D., Silva, S. A. D., & Jaime, P. C. (2017). Cobertura da avaliação do estado nutricional no Sistema de Vigilância Alimentar e Nutricional brasileiro: 2008 a 2013. Cadernos de Saúde Pública, 33(12). https://doi.org/10.1590/0102-311x00161516

Pereira, R. H. M., & Goncalves, C. N. (n.d.). geobr: Download official spatial data sets of Brazil [Computer software]. https://doi.org/10.32614/CRAN.package.geobr

Perumal, N., Namaste, S., Qamar, H., Aimone, A., Bassani, D. G., & Roth, D. E. (2020). Anthropometric data quality assessment in multisurvey studies of child growth. The American Journal of Clinical Nutrition, 112, 806S–815S. https://doi.org/10.1093/ajcn/nqaa162

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Silva, N. de J., Silva, J. F. de M. e, Carrilho, T. R. B., Pinto, E. de J., Andrade, R. da C. S. de, Silva, S. A., Pedroso, J., Spaniol, A. M., Bortolini, G. A., Fagundes, A., Nilson, E. A. F., Fiaccone, R. L., Kac, G., Barreto, M. L., & Ribeiro-Silva, R. de C. (2023). Qualidade dos dados antropométricos infantis do Sisvan, Brasil, 2008-2017. Revista de Saúde Pública, 57(1, 1), 62–62. https://doi.org/10.11606/s1518-8787.2023057004655

Sistema de Vigilância Alimentar e Nutricional, Coordenação-Geral de Alimentação e Nutrição, Departamento de Promoção da Saúde, Coordenação Setorial de Tecnologia da Informação, Secretaria de Atenção Primária à Saúde, & Ministério da Saúde. (n.d.). Microdados dos acompanhamentos de estado nutricional [Microdata on nutritional status monitoring] [Data set]. openDataSUS. Retrieved November 16, 2023, from https://opendatasus.saude.gov.br/dataset/sisvan-estado-nutricional

Ushey, K., & Wickham, H. (n.d.). renv: Project environments [Computer software]. https://doi.org/10.32614/CRAN.package.renv

Vartanian, D., & Carvalho, A. M. de. (2025). A reproducible pipeline for processing WorldClim 2.1 Historical Monthly Weather Data in Brazil [Computer software]. Sustentarea Research; Extension Center at the University of São Paulo. https://sustentarea.github.io/brazil-historical-climate

Wickham, H. (n.d.-a). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org

Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz

World Health Organization. (2006). WHO child growth standards: Length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age: Methods and development. WHO Press. https://www.who.int/tools/child-growth-standards/standards

World Health Organization. (2008). Training course on child growth assessment. WHO Press. https://www.who.int/publications/i/item/9789241595070