A Reproducible Pipeline for Processing and Analyzing SISVAN Microdata on Nutritional Status Monitoring in Brazil

Author

Daniel Vartanian, João Pedro J. Schettino, & Aline M. de Carvalho

Published

December 15, 2025

Project Status: Active – The project has reached a stable, usable state and is being actively developed. OSF DOI License: GPLv3 License: CC BY-NC-SA 4.0

Overview

This report provides a reproducible pipeline for processing and analyzing the microdata on nutritional status monitoring in Brazil from the Brazilian Food and Nutrition Surveillance System (SISVAN), focusing on the nutritional status of children aged 0–5 years (i.e., younger than 60 months).

If you are working with other age groups, you will need to adapt the code accordingly. We provide some guidance on how to do this along the report.

For instructions on how to run the pipeline, see the repository README.

Click here to see a report with a longitudinal analysis of the processed data.

Problem

The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from accessability and quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).

Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).

In light of these challenges, there is a need for an open and reproducible pipeline to process SISVAN microdata. Such a pipeline should facilitate broader access to the data and systematically identify, correct, and remove problematic records, thereby improving the consistency, completeness, and plausibility of the information for research and policymaking.

Data Availability

The processed data are available in csv, rds, and parquet formats via a dedicated repository on the Open Science Framework (OSF), accessible here. Each dataset is accompanied by a metadata file describing its structure and contents.

You can also retrieve these files directly from R using the osfr package.

Methods

Source of Data

The data used in this report come from the following sources:

The DATASUS population estimates used in this pipeline are processed through a separate reproducible workflow, available here (Vartanian & Carvalho, 2025).

Data Munging

The data munging follow the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system, along with the AWK (Aho et al., 2023) and R (R Core Team, n.d.) programming languages, supported by several R packages.

For data manipulation and workflow, priority was given to packages from the tidyverse, rOpenSci and r-spatial ecosystems, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Source: Reproduced from Wickham et al. (2023).

Data Validation

The validation steps described below are specifically designed for children aged 0–5 years. If you are working with older children or adolescents (ages 5–19 years), you should adapt the code accordingly. For these age groups, we recommend using the WHO’s anthroplus R package (Dirk Schumacher, n.d.-b).

Different validation techniques were used to ensure data quality and reliability:

  • Duplicate records were removed based on unique combinations of the SISVAN identifier (id) and assessment date (date). Only the latest record for each individual on a given date was retained.
  • Weight and height measurements identified as biologically implausible values (BIVs) according to World Health Organization (WHO) child growth standards (World Health Organization, 2006, 2008) were set to missing. BIVs were detected by calculating z-scores using the anthro_zscores function from the WHO anthro R package (Dirk Schumacher, n.d.-a), based on weight, height, age, and sex. Implausible values were flagged when z-scores exceeded established WHO cutoffs (typically \(|z| > 5\)). For details, see the function documentation.

Data Categorization

Nutritional status categories are ideally determined using z-scores, as recommended by the WHO child growth standards (World Health Organization, 2006, Section C). However, SISVAN data report age only in years, rather than in days or months as required for accurate z-score calculation. This limitation introduces substantial classification error if z-scores are computed directly. Therefore, we use the nutritional status categories already provided in the SISVAN microdata and set these categories to missing when biologically implausible values (BIVs) were identified.

Code Style

The Tidyverse Tidy Tools Manifesto (Wickham, 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, n.d.) is used to manage and restore the R environment. See the README file in the code repository to learn how to run it.

Set Environment

Load Packages

Set Data Directories

raw_data_dir <- here("data-raw")
data_dir <- here("data")
for (i in c(raw_data_dir, data_dir)) {
  if (!dir_exists(i)) dir_create(i, recurse = TRUE)
}

Set Initial Variables

The year variable represent the year of the consolidated SISVAN dataset on nutritional status.

year <- 2023

The age_limits variable define the age range (in years) of individuals to be included in the analysis.

age_limits <- c(0, 4) # == Less than 5 years

The col_selection variable specifies the columns to be imported from the raw SISVAN microdata files.

Click here to access the microdata data dictionary (in Portuguese).

col_selection <- c(
  "CO_PESSOA_SISVAN",
  "DT_ACOMPANHAMENTO",
  "CO_MUNICIPIO_IBGE",
  "CO_CNES",
  "SG_SEXO",
  "NU_IDADE_ANO",
  "CO_RACA_COR",
  "NU_PESO",
  "NU_ALTURA",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE"
)

Download and Import IBGE Municipalities Data

See the Source of Data section for more information.

municipalities_data <- brazil_municipality(year = year)
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022
municipalities_data |> glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ latitude          <dbl> -11.935540305, -9.908462867, -13.499763460, -11.4…
#> $ longitude         <dbl> -61.99982390, -63.03326928, -60.54431358, -61.442…

Download DATASUS Population Estimates

See the Source of Data section for more information.

List Files

datasus_file_pattern <-
  "datasus-population-estimates-" |>
  paste0(year)
datasus_file <-
  raw_data_dir |>
  here(paste0(datasus_file_pattern, ".rds"))
osf_raw_data_id <- "h3pyd"
osf_raw_data_file <-
  osf_raw_data_id |>
  osf_retrieve_node() |>
  osf_ls_files(
    type = "file",
    pattern = paste0(year, ".rds")
  ) |>
  filter(str_detect(name, paste0("^", year, "\\.rds$")))
osf_raw_data_file

Download Data

osf_raw_data_file |>
  osf_download(
    path = raw_data_dir,
    conflicts = "overwrite"
  ) |>
  pull(local_path)
#> [1] "data-raw/2023.rds"

Rename File

if (file_exists(datasus_file)) {
  datasus_file |> file_delete()
}
raw_data_dir |>
  dir_ls(
    type = "file",
    regexp = paste0(year, "\\.rds$")
  ) |>
  file_move(datasus_file)

Import DATASUS Population Estimates

population_estimates_data <- datasus_file |> read_rds()
population_estimates_data |> glimpse()
#> Rows: 902,340
#> Columns: 5
#> $ year              <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <int> 1100015, 1100015, 1100015, 1100015, 1100015, 1100…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
#> $ population        <int> 171, 170, 170, 172, 178, 178, 173, 175, 179, 178,…

Download SISVAN Microdata on Nutritional Status

See the Source of Data section for more information.

The microdata files are very large. For practical reasons, some code chunks have eval: false set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.

Download Data

file <-
  "sisvan_estado_nutricional_" |>
  paste0(year, ".zip")
"https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br" |>
  path(
    "SISVAN",
    "estado_nutricional",
    file
  ) |>
  request() |>
  req_progress() |>
  req_perform(here(raw_data_dir, file))

Unzip Data

here(raw_data_dir, file) |>
  unzip(exdir = raw_data_dir)

Delete Zip Files

raw_data_dir |>
  dir_ls(type = "file", regexp = "\\.zip$") |>
  file_delete()

Check Data Dimensions

file <- file |> str_replace("\\.zip$", "\\.csv")
raw_data_dir |>
  here(file) |>
  peek_csv_file(
    delim = ";",
    skip = 0,
    has_header = TRUE
  )
#> The file has 34 columns, 53,981,528 rows, and 1,835,371,952 cells.

Import and Filter Data

The vroom R package together with the AWK programming language were use to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab and RStudio Cloud, or using GitHub Actions large runners.

Define Column Names and Schema

col_names <- c(
  "CO_ACOMPANHAMENTO",
  "CO_PESSOA_SISVAN",
  "ST_PARTICIPA_ANDI",
  "CO_MUNICIPIO_IBGE",
  "SG_UF",
  "NO_MUNICIPIO",
  "CO_CNES",
  "NU_IDADE_ANO",
  "NU_FASE_VIDA",
  "DS_FASE_VIDA",
  "SG_SEXO",
  "CO_RACA_COR",
  "DS_RACA_COR",
  "CO_POVO_COMUNIDADE",
  "DS_POVO_COMUNIDADE",
  "CO_ESCOLARIDADE",
  "DS_ESCOLARIDADE",
  "DT_ACOMPANHAMENTO",
  "NU_COMPETENCIA",
  "NU_PESO",
  "NU_ALTURA",
  "DS_IMC",
  "DS_IMC_PRE_GESTACIONAL",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE",
  "ADO. ALTURA X IDADE",
  "ADO. IMC X IDADE",
  "CO_ESTADO_NUTRI_ADULTO",
  "CO_ESTADO_NUTRI_IDOSO",
  "CO_ESTADO_NUTRI_IMC_SEMGEST",
  "CO_SISTEMA_ORIGEM_ACOMP",
  "SISTEMA_ORIGEM_ACOMP"
)
schema <- cols(
  "CO_ACOMPANHAMENTO" = col_character(),
  "CO_PESSOA_SISVAN" = col_character(),
  "ST_PARTICIPA_ANDI" = col_character(),
  "CO_MUNICIPIO_IBGE" = col_integer(),
  "SG_UF" = col_factor(),
  "NO_MUNICIPIO" = col_character(),
  "CO_CNES" = col_integer(),
  "NU_IDADE_ANO" = col_integer(),
  "NU_FASE_VIDA" = col_character(), # decimal mark = "." (double)
  "DS_FASE_VIDA" = col_factor(),
  "SG_SEXO" = col_factor(),
  "CO_RACA_COR" = col_character(),
  "DS_RACA_COR" = col_factor(),
  "CO_POVO_COMUNIDADE" = col_integer(),
  "DS_POVO_COMUNIDADE" = col_factor(),
  "CO_ESCOLARIDADE" = col_character(),
  "DS_ESCOLARIDADE" = col_factor(),
  "DT_ACOMPANHAMENTO" = col_date(),
  "NU_COMPETENCIA" = col_integer(),
  "NU_PESO" = col_double(),
  "NU_ALTURA" = col_integer(),
  "DS_IMC" = col_double(),
  "DS_IMC_PRE_GESTACIONAL" = col_character(), # decimal mark = "." (double)
  "PESO X IDADE" = col_factor(),
  "PESO X ALTURA" = col_factor(),
  "CRI. ALTURA X IDADE" = col_factor(),
  "CRI. IMC X IDADE" = col_factor(),
  "ADO. ALTURA X IDADE" = col_factor(),
  "ADO. IMC X IDADE" = col_factor(),
  "CO_ESTADO_NUTRI_ADULTO" = col_factor(),
  "CO_ESTADO_NUTRI_IDOSO" = col_factor(),
  "CO_ESTADO_NUTRI_IMC_SEMGEST" = col_factor(),
  "CO_SISTEMA_ORIGEM_ACOMP" = col_integer(),
  "SISTEMA_ORIGEM_ACOMP" = col_factor()
)

Import and Filter Data

You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.

data <-
  vroom(
    file = pipe(
      paste0(
        "awk ",
        "-F ", # Field separator
        "';' ",
        "'{", # Program
        "if (",
        "($8 >= ",
        age_limits[1],
        ")",
        " && ",
        "($8 <= ",
        age_limits[2],
        ")",
        ") ",
        "{print}",
        "}' ",
        raw_data_dir |> here(file) # file
      )
    ),
    delim = ";",
    col_names = col_names,
    col_types = schema,
    col_select = all_of(col_selection),
    na = c("", "NA"),
    locale = locale(
      date_names = "pt",
      date_format = "%d/%m/%Y",
      time_format = "%H:%M:%S",
      decimal_mark = ",",
      grouping_mark = ".",
      tz = "America/Sao_Paulo",
      encoding = raw_data_dir |>
        here(file) |>
        guess_encoding() |>
        extract2("encoding") |>
        magrittr::extract(1)
    ),
    guess_max = 100,
    num_threads = detectCores() |>
      multiply_by(0.75) |>
      floor(),
    progress = TRUE
  )
data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ CO_PESSOA_SISVAN      <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "…
#> $ DT_ACOMPANHAMENTO     <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-…
#> $ CO_MUNICIPIO_IBGE     <int> 150295, 431720, 351960, 521040, 353890, 42029…
#> $ CO_CNES               <int> 2312670, 2254549, 373885, 2382482, 7260431, 7…
#> $ SG_SEXO               <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, …
#> $ NU_IDADE_ANO          <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, …
#> $ CO_RACA_COR           <chr> "01", "01", "02", "04", "01", "99", "01", "03…
#> $ NU_PESO               <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 1…
#> $ NU_ALTURA             <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90,…
#> $ `PESO X IDADE`        <fct> Peso adequado para idade, Baixo peso para a i…
#> $ `PESO X ALTURA`       <fct> Peso Adequado ou Eutrofico, Risco de sobrepes…
#> $ `CRI. ALTURA X IDADE` <fct> Estatura adequada para a idade, Muito baixa e…
#> $ `CRI. IMC X IDADE`    <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutr…

Tidy Data

Rename Columns

data <-
  data |>
  clean_names() |>
  rename(
    id = co_pessoa_sisvan,
    date = dt_acompanhamento,
    municipality_code = co_municipio_ibge,
    cnes = co_cnes,
    sex = sg_sexo,
    age = nu_idade_ano,
    ethnicity = co_raca_cor,
    weight = nu_peso,
    height = nu_altura,
    weight_for_age = peso_x_idade,
    weight_for_height = peso_x_altura,
    height_for_age = cri_altura_x_idade,
    bmi_for_age = cri_imc_x_idade
  )
data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id                <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date              <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes              <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex               <fct> M, M, M, F, F, F, M, M, M, F, F, F, F, M, F, M, M…
#> $ age               <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity         <chr> "01", "01", "02", "04", "01", "99", "01", "03", "…
#> $ weight            <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height            <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age    <fct> Peso adequado para idade, Baixo peso para a idade…
#> $ weight_for_height <fct> Peso Adequado ou Eutrofico, Risco de sobrepeso, P…
#> $ height_for_age    <fct> Estatura adequada para a idade, Muito baixa estat…
#> $ bmi_for_age       <fct> Eutrofia, Eutrofia, Eutrofia, Obesidade, Eutrofia…

Standardize Columns

data <-
  data |>
  mutate(
    sex = sex |>
      as.character() |>
      case_match(
        "F" ~ "Female",
        "M" ~ "Male"
      ) |>
      factor(
        levels = c("Male", "Female"),
        ordered = FALSE
      ),
    ethnicity = ethnicity |>
      as.character() |>
      case_match(
        "01" ~ "White",
        "02" ~ "Black",
        "03" ~ "Yellow",
        "04" ~ "Brown",
        "05" ~ "Indigenous"
      ) |>
      factor(
        levels = c(
          "White",
          "Black",
          "Yellow",
          "Brown",
          "Indigenous"
        ),
        ordered = FALSE
      ),
    weight_for_age = weight_for_age |>
      as.character() |>
      case_match(
        "Muito baixo peso para a idade" ~ "Severely underweight",
        "Baixo peso para a idade" ~ "Underweight",
        "Peso adequado para idade" ~ "Normal",
        "Peso elevado para a idade" ~ "High"
      ) |>
      factor(
        levels = c(
          "Severely underweight",
          "Underweight",
          "Normal",
          "High"
        ),
        ordered = TRUE
      ),
    weight_for_height = weight_for_height |>
      as.character() |>
      case_match(
        "Magreza acentuada" ~ "Severe wasted",
        "Magreza" ~ "Wasted",
        "Peso Adequado ou Eutrofico" ~ "Normal",
        "Risco de sobrepeso" ~ "Possible risk of overweight",
        "Sobrepeso" ~ "Overweight",
        "Obesidade" ~ "Obese"
      ) |>
      factor(
        levels = c(
          "Severe wasted",
          "Wasted",
          "Normal",
          "Possible risk of overweight",
          "Overweight",
          "Obese"
        ),
        ordered = TRUE
      ),
    height_for_age = height_for_age |>
      as.character() |>
      case_match(
        "Muito baixa estatura para idade" ~ "Severely stunted",
        "Baixa estatura para idade" ~ "Stunted",
        "Estatura adequada para a idade" ~ "Normal"
      ) |>
      factor(
        levels = c(
          "Severely stunted",
          "Stunted",
          "Normal"
        ),
        ordered = TRUE
      ),
    bmi_for_age = bmi_for_age |>
      as.character() |>
      case_match(
        "Magreza acentuada" ~ "Severe wasted",
        "Magreza" ~ "Wasted",
        "Eutrofia" ~ "Normal",
        "Risco de sobrepeso" ~ "Possible risk of overweight",
        "Sobrepeso" ~ "Overweight",
        "Obesidade" ~ "Obese"
      ) |>
      factor(
        levels = c(
          "Severe wasted",
          "Wasted",
          "Normal",
          "Possible risk of overweight",
          "Overweight",
          "Obese"
        ),
        ordered = TRUE
      )
  )
data |> glimpse()
#> Rows: 7,290,143
#> Columns: 13
#> $ id                <chr> "E58A0CFF79CFFAE2F4CCFB27996BAE3546A498DD", "097B…
#> $ date              <date> 2023-01-09, 2023-01-05, 2023-01-04, 2023-01-04, …
#> $ municipality_code <int> 150295, 431720, 351960, 521040, 353890, 420290, 4…
#> $ cnes              <int> 2312670, 2254549, 373885, 2382482, 7260431, 75694…
#> $ sex               <fct> Male, Male, Male, Female, Female, Female, Male, M…
#> $ age               <int> 0, 0, 2, 4, 1, 0, 2, 3, 2, 0, 1, 4, 0, 1, 3, 0, 2…
#> $ ethnicity         <fct> White, White, Black, Brown, White, NA, White, Yel…
#> $ weight            <dbl> 9.710, 4.000, 11.900, 34.000, 9.500, 7.639, 12.20…
#> $ height            <int> 74, 50, 87, 110, 75, 70, 85, 108, 95, NA, 90, 105…
#> $ weight_for_age    <ord> Normal, Underweight, Normal, High, Normal, Normal…
#> $ weight_for_height <ord> Normal, Possible risk of overweight, Normal, Obes…
#> $ height_for_age    <ord> Normal, Severely stunted, Normal, Normal, Normal,…
#> $ bmi_for_age       <ord> Normal, Normal, Normal, Obese, Normal, Normal, No…

Transform Data

Remove Duplicates

data <-
  data |>
  arrange(desc(date)) |>
  distinct(id, date, .keep_all = TRUE)
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, 7.3, 12.0, 8.0, 24.0…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, Seve…
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, Severe wast…

Remove Biological Implausible Values (BVI)

See the Data Validation section for more information.

data <-
  data |>
  mutate(
    z_scores = anthro_zscores(
      sex = as.numeric(sex),
      age = age * 12,
      is_age_in_month = TRUE,
      weight = weight,
      lenhei = height,
      measure = "h"
    ),
    weight = if_else(
      (z_scores$fwei == 1) | (z_scores$flen != 1 & z_scores$fwfl == 1),
      NA,
      weight
    ),
    height = if_else(
      z_scores$flen == 1,
      NA,
      height
    ),
    weight_for_age = if_else(
      is.na(weight),
      NA,
      weight_for_age
    ),
    weight_for_height = if_else(
      is.na(weight) | is.na(height),
      NA,
      weight_for_height
    ),
    height_for_age = if_else(
      is.na(height),
      NA,
      height_for_age
    ),
    bmi_for_age = if_else(
      is.na(weight) | is.na(height),
      NA,
      bmi_for_age
    )
  ) |>
  select(-z_scores)
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <int> 431175, 521180, 230280, 210350, 261110, 150580, 2…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…

Fix Municipality Code

data <-
  data |>
  rename(municipality_code_6 = municipality_code) |>
  left_join(
    municipalities_data |>
      mutate(
        municipality_code_6 = municipality_code |>
          str_sub(1, 6) |>
          as.integer()
      ) |>
      select(municipality_code, municipality_code_6),
    by = join_by(municipality_code_6)
  ) |>
  select(-municipality_code_6) |>
  relocate(municipality_code, .after = date)
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "49A374D16581329DA3BFB7B6852E8E1BA3F41C34", "E401…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ municipality_code <dbl> 4311759, 5211800, 2302800, 2103505, 2611101, 1505…
#> $ cnes              <int> 2247887, 5120659, 2478870, 2591200, 214752, 23128…
#> $ sex               <fct> Male, Male, Female, Female, Male, Male, Male, Fem…
#> $ age               <int> 2, 1, 1, 2, 4, 3, 1, 1, 4, 2, 1, 1, 3, 1, 1, 1, 1…
#> $ ethnicity         <fct> White, White, Brown, Brown, White, Yellow, Brown,…
#> $ weight            <dbl> 14.0, 11.0, 8.0, 11.5, 16.8, NA, 12.0, 8.0, 24.0,…
#> $ height            <int> 90, 70, 60, 86, 103, 98, 77, 70, 119, NA, 83, 72,…
#> $ weight_for_age    <ord> Normal, Normal, Underweight, Normal, Normal, NA, …
#> $ weight_for_height <ord> Possible risk of overweight, Overweight, Overweig…
#> $ height_for_age    <ord> Normal, Stunted, Severely stunted, Stunted, Norma…
#> $ bmi_for_age       <ord> Normal, Obese, Obese, Normal, Normal, NA, Overwei…

Arrange Data

data <-
  data |>
  arrange(
    date,
    municipality_code,
    cnes,
    sex,
    age
  )
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date              <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes              <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex               <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age               <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity         <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight            <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height            <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age    <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age       <ord> Possible risk of overweight, Normal, Normal, Poss…

Create Data Dictionary

Prepare Metadata

metadata <-
  data |>
  `var_label<-`(
    list(
      id = "Unique identifier for the individual",
      date = "Date of the individual's nutritional assessment",
      municipality_code = paste0(
        "Institute of Geography and Statistics (IBGE) code of the ",
        "municipality where the assessment was performed"
      ),
      cnes = paste0(
        "National Registry of Health Establishments (CNES) code of the ",
        "health facility where the assessment was performed"
      ),
      sex = "Sex of the individual",
      age = "Age of the individual in years",
      ethnicity = "Self-reported ethnicity/race or color of the individual",
      weight = "Weight of the individual in kilograms",
      height = "Height of the individual in centimeters",
      weight_for_age = paste0(
        "Nutritional status classification (children 0–5) based on ",
        "weight-for-age"
      ),
      weight_for_height = paste0(
        "Nutritional status classification (children 0–5) based on ",
        "weight-for-height"
      ),
      height_for_age = paste0(
        "Nutritional status classification (children 0–10) based on ",
        "height-for-age"
      ),
      bmi_for_age = paste0(
        "Nutritional status classification (children 0–10) based on ",
        "BMI-for-age"
      )
    )
  ) |>
  generate_dictionary(details = "full") |>
  convert_list_columns_to_character()

Visualize Final Data

metadata |> glimpse()
#> Rows: 13
#> Columns: 14
#> $ pos           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
#> $ variable      <chr> "id", "date", "municipality_code", "cnes", "sex", "ag…
#> $ label         <chr> "Unique identifier for the individual", "Date of the …
#> $ col_type      <chr> "chr", "date", "dbl", "int", "fct", "int", "fct", "db…
#> $ missing       <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ levels        <chr> "", "", "", "", "Male; Female", "", "White; Black; Ye…
#> $ value_labels  <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ class         <chr> "character", "Date", "numeric", "integer", "factor", …
#> $ type          <chr> "character", "double", "double", "integer", "integer"…
#> $ na_values     <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ na_range      <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""
#> $ n_na          <int> 0, 0, 0, 57013, 0, 0, 1142071, 1612199, 1509643, 1612…
#> $ unique_values <int> 7237146, 365, 5569, 43701, 2, 5, 6, 11123, 92, 5, 7, …
#> $ range         <chr> "000002CB5BCC53DACE05639B3E1CD4BD1FF6C51A - FFFFFF927…
metadata
data |> glimpse()
#> Rows: 7,273,731
#> Columns: 13
#> $ id                <chr> "5A947ABCEA1CAC15C21EEFED4C4C880DF37F49EB", "0C99…
#> $ date              <date> 2023-01-01, 2023-01-01, 2023-01-01, 2023-01-01, …
#> $ municipality_code <dbl> 1302603, 1302603, 1302603, 1303700, 1303809, 1303…
#> $ cnes              <int> 2011786, 2011786, 2013932, 9970835, NA, NA, NA, 2…
#> $ sex               <fct> Male, Female, Male, Female, Male, Female, Male, F…
#> $ age               <int> 4, 3, 4, 2, 4, 0, 2, 4, 4, 1, 2, 4, 4, 4, 1, 4, 0…
#> $ ethnicity         <fct> Brown, Yellow, Brown, Yellow, Indigenous, Indigen…
#> $ weight            <dbl> 22.00, 14.70, 17.00, 11.10, 19.80, NA, 14.00, 16.…
#> $ height            <int> 115, 97, 110, 80, 100, NA, 91, 105, 110, 75, 74, …
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, Normal, NA, Norma…
#> $ weight_for_height <ord> Normal, Normal, Normal, Normal, Overweight, NA, N…
#> $ height_for_age    <ord> Normal, Normal, Normal, Severely stunted, Normal,…
#> $ bmi_for_age       <ord> Possible risk of overweight, Normal, Normal, Poss…
data

Save Data

The processed data are available in csv, rds and parquet formats through a dedicated repository on the Open Science Framework (OSF). See the Data Availability section for more information.

Write Data

valid_file_pattern <-
  year |>
  paste0(
    "-age-limits-",
    age_limits[1],
    "-",
    age_limits[2]
  )
data |>
  write_csv(
    here(data_dir, paste0(valid_file_pattern, ".csv"))
  )
data |>
  write_rds(
    here(data_dir, paste0(valid_file_pattern, ".rds"))
  )
data |>
  write_parquet(
    here(data_dir, paste0(valid_file_pattern, ".parquet"))
  )

Write Metadata

metadata_file_pattern <-
  "metadata-" |>
  paste0(
    year,
    "-age-limits-",
    age_limits[1],
    "-",
    age_limits[2]
  )
metadata |>
  write_csv(
    here(data_dir, paste0(metadata_file_pattern, ".csv"))
  )
metadata |>
  write_rds(
    here(data_dir, paste0(metadata_file_pattern, ".rds"))
  )
metadata |>
  write_parquet(
    here(data_dir, paste0(metadata_file_pattern, ".parquet"))
  )

Explore Data

Summarize Frequencies

Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.

Code
vars <- c(
  "sex",
  "age",
  "ethnicity",
  "weight_for_age",
  "weight_for_height",
  "height_for_age",
  "bmi_for_age"
)
Code
panel_tabset_data <- tibble()

for (i in vars) {
  table <-
    data |>
    arrange(desc(.data[[i]])) |>
    distinct(id, .data[[i]]) |>
    group_by(.data[[i]]) |>
    summarize(n = n(), .groups = "drop") |>
    arrange(desc(n)) |>
    mutate(
      !!i := .data[[i]] |>
        as.factor() |>
        fct_na_value_to_level(level = "(NA)") |>
        fct_reorder(n) |>
        fct_relevel("(NA)", after = 0)
    ) |>
    arrange(desc(.data[[i]])) |>
    mutate(
      n_cum = cumsum(n),
      pct = n |>
        divide_by(sum(n)) |>
        multiply_by(100),
      pct_cum = pct |>
        cumsum() |>
        round(3),
      pct = pct |> round(3),
      across(
        .cols = where(is.numeric),
        .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
      )
    ) |>
    rename(
      N = n,
      `N (Cumulative)` = n_cum,
      Percent = pct,
      `Percent (Cumulative)` = pct_cum
    ) |>
    kable()

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        table = list(table)
      )
    )
}
Code
panel_tabset_data |> render_tabset(label, table)
age N N (Cumulative) Percent Percent (Cumulative)
0 1,691,665 1,691,665 23.349 23.349
1 1,466,717 3,158,382 20.244 43.593
4 1,409,998 4,568,380 19.461 63.054
2 1,355,180 5,923,560 18.705 81.759
3 1,321,600 7,245,160 18.241 100
bmi_for_age N N (Cumulative) Percent Percent (Cumulative)
Normal 3,545,481 3,545,481 48.884 48.884
Possible risk of overweight 991,166 4,536,647 13.666 62.55
Overweight 434,183 4,970,830 5.986 68.536
Obese 228,781 5,199,611 3.154 71.69
Wasted 169,328 5,368,939 2.335 74.025
Severe wasted 88,902 5,457,841 1.226 75.251
(NA) 1,795,035 7,252,876 24.749 100
ethnicity N N (Cumulative) Percent Percent (Cumulative)
Yellow 2,345,345 2,345,345 32.407 32.407
White 2,038,029 4,383,374 28.161 60.568
Brown 1,486,639 5,870,013 20.542 81.11
Black 174,120 6,044,133 2.406 83.515
Indigenous 53,156 6,097,289 0.734 84.25
(NA) 1,139,857 7,237,146 15.75 100
height_for_age N N (Cumulative) Percent Percent (Cumulative)
Normal 5,020,425 5,020,425 69.265 69.265
Stunted 425,016 5,445,441 5.864 75.129
Severely stunted 298,757 5,744,198 4.122 79.251
(NA) 1,503,933 7,248,131 20.749 100
sex N N (Cumulative) Percent Percent (Cumulative)
Male 3,713,250 3,713,250 51.308 51.308
Female 3,523,896 7,237,146 48.692 100
weight_for_age N N (Cumulative) Percent Percent (Cumulative)
Normal 5,106,342 5,106,342 70.472 70.472
High 305,772 5,412,114 4.22 74.692
Underweight 166,943 5,579,057 2.304 76.996
Severely underweight 60,807 5,639,864 0.839 77.835
(NA) 1,606,032 7,245,896 22.165 100
weight_for_height N N (Cumulative) Percent Percent (Cumulative)
Normal 3,660,833 3,660,833 50.483 50.483
Possible risk of overweight 983,190 4,644,023 13.558 64.041
Overweight 392,527 5,036,550 5.413 69.454
Obese 203,342 5,239,892 2.804 72.258
Wasted 143,713 5,383,605 1.982 74.239
Severe wasted 67,486 5,451,091 0.931 75.17
(NA) 1,800,585 7,251,676 24.83 100

Plot Bar Charts

Some SISVAN data may show discrepancies when compared to official population estimates. For example, the reported number of children under 5 classified as yellow in 2023 appears unusually high. If you notice such inconsistencies, check the SISVAN web reports to confirm before reporting an issue.

Code
panel_tabset_data <- tibble()

for (i in vars) {
  plot <-
    data |>
    mutate(
      !!i := as.character(.data[[i]]),
      !!i := ifelse(
        str_length(.data[[i]]) > 30,
        paste0(str_sub(.data[[i]], 1, 27), "..."),
        .data[[i]]
      ) |>
        fct_na_value_to_level(level = "(NA)")
    ) |>
    group_by(.data[[i]]) |>
    summarize(n = n(), .groups = "drop") |>
    mutate(
      rel = n |>
        divide_by(sum(n)) |>
        round(2),
      !!i := .data[[i]] |>
        fct_reorder(n) |>
        fct_relevel("(NA)", after = 0)
    ) |>
    ggplot(
      aes(
        x = .data[[i]],
        y = n,
        labels = percent(rel)
      )
    ) +
    geom_col(fill = get_brand_color("green")) +
    geom_text(
      hjust = -0.25,
      size = 3
    ) +
    coord_flip() +
    scale_y_continuous(
      expand = expansion(mult = c(0.05, 0.15))
    ) +
    labs(x = NULL, y = NULL) +
    theme(
      axis.text.y = element_text(size = 8)
    )

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        plot = list(plot)
      )
    )
}

Summarize Descriptive Statistics

Code
vars <- c(
  "date",
  "weight",
  "height"
)
Code
panel_tabset_data <- tibble()

for (i in vars) {
  table <-
    data |>
    arrange(desc(.data[[i]])) |>
    distinct(id, .data[[i]]) |>
    stats_summary(i) |>
    mutate(
      name = c(
        "Class",
        "N",
        "N (Without Missing)",
        "N (Missing)",
        "Mean",
        "Variance",
        "Standard Deviation",
        "Minimum",
        "1st Quartile (Q1)",
        "Median",
        "3rd Quartile (Q3)",
        "Maximum",
        "Interquartile Range (IQR)",
        "Range",
        "Skewness",
        "Kurtosis"
      ),
      value = if_else(
        !str_detect(value, "\\d{4}-\\d{2}-\\d{2}"),
        value |>
          prettyNum(big.mark = ",", decimal.mark = ".") |>
          str_trim(),
        value
      )
    ) |>
    rename(
      Name = name,
      Value = value
    ) |>
    kable()

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        table = list(table)
      )
    )
}
Code
panel_tabset_data |> render_tabset(label, table)
Name Value
Class Date
N 7,273,731
N (Without Missing) 7,273,731
N (Missing) 0
Mean 2023-09-11
Variance 675,402,587.067,832s (~21.4 years)
Standard Deviation 7,639,030.27,371,018s (~12.63 weeks)
Minimum 2023-01-01
1st Quartile (Q1) 2023-07-27
Median 2023-10-04
3rd Quartile (Q3) 2023-11-21
Maximum 2023-12-31
Interquartile Range (IQR) 10,108,800s (~16.71 weeks)
Range 31,449,600s (~52 weeks)
Skewness -0.993245646806598
Kurtosis 3.08772335061643
Name Value
Class integer
N 7,261,876
N (Without Missing) 5,758,005
N (Missing) 1,503,871
Mean 89.6279690622012
Variance 246.572889788617
Standard Deviation 15.702639580294
Minimum 38
1st Quartile (Q1) 80
Median 92
3rd Quartile (Q3) 101
Maximum 128
Interquartile Range (IQR) 21
Range 90
Skewness -0.699042893863037
Kurtosis 3.14509557344265
Name Value
Class numeric
N 7,262,785
N (Without Missing) 5,656,771
N (Missing) 1,606,014
Mean 13.4700866034705
Variance 18.3701448246864
Standard Deviation 4.28604069330733
Minimum 0.945
1st Quartile (Q1) 11
Median 13.5
3rd Quartile (Q3) 16
Maximum 32.51
Interquartile Range (IQR) 5
Range 31.565
Skewness -0.0440857611925257
Kurtosis 3.61371365104714

Plot Histograms

Code
panel_tabset_data <- tibble()

for (i in vars) {
  plot <-
    data |>
    tidyr::drop_na(all_of(i)) |>
    ggplot(
      aes(
        x = .data[[i]],
        y = after_stat(count)
      )
    ) +
    geom_histogram(
      bins = 30,
      fill = get_brand_color("gray-d25"),
      color = get_brand_color("white")
    ) +
    # geom_density(
    #   color = "red",
    #   linewidth = 1
    # ) +
    labs(
      title = ifelse(
        i == "date",
        paste("Distribution of", str_to_title(i), "of Nutritional Assessments"),
        paste("Distribution of", str_to_title(i))
      ),
      x = str_to_title(i),
      y = "Frequency"
    ) +
    theme(
      axis.text.x = element_text(margin = margin(0, 0, 10, 0)),
      axis.text.y = element_text(margin = margin(0, 0, 0, 20))
    )

  panel_tabset_data <-
    panel_tabset_data |>
    bind_rows(
      tibble(
        label = paste0("`", i, "`"),
        plot = list(plot)
      )
    )
}
Code
panel_tabset_data |> render_tabset(label, plot)

Check Relative Coverage

Transform Data

Remove Duplicates by Year

As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.

Code
data <-
  data |>
  mutate(year = year(date)) |>
  arrange(desc(date)) |>
  distinct(id, year, .keep_all = TRUE) |>
  relocate(year, .after = date)
Code
data |> glimpse()
#> Rows: 7,237,146
#> Columns: 14
#> $ id                <chr> "95978385C9FBD8D908846992EDAB7AD1566C765C", "0732…
#> $ date              <date> 2023-12-31, 2023-12-31, 2023-12-31, 2023-12-31, …
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100452, 1100452, 1100452, 1100…
#> $ cnes              <int> 5695880, 2806630, 2806630, 2806630, 9277927, 9277…
#> $ sex               <fct> Male, Male, Male, Female, Male, Female, Female, M…
#> $ age               <int> 2, 2, 3, 2, 3, 1, 4, 3, 1, 1, 3, 3, 3, 1, 3, 4, 4…
#> $ ethnicity         <fct> White, Brown, Brown, Yellow, Yellow, White, Brown…
#> $ weight            <dbl> 13.0, 14.0, 15.0, 12.0, 19.4, 11.5, 21.0, 17.0, 1…
#> $ height            <int> 80, 90, 95, 81, 103, 84, 116, 104, 76, 70, 84, 10…
#> $ weight_for_age    <ord> Normal, Normal, Normal, Normal, High, Normal, Nor…
#> $ weight_for_height <ord> Overweight, Possible risk of overweight, Normal, …
#> $ height_for_age    <ord> Stunted, Normal, Normal, Normal, Normal, Normal, …
#> $ bmi_for_age       <ord> Overweight, Possible risk of overweight, Normal, …

Summarize Data by Year

Code
data <-
  data |>
  summarize(
    coverage = n(),
    mean_age = age |> mean(na.rm = TRUE),
    mean_weight = weight |> mean(na.rm = TRUE),
    mean_height = height |> mean(na.rm = TRUE),
    .by = c(
      "year",
      "municipality_code"
    )
  )
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 6
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100205, 1100452, 1100601, 1101807, 1200013, 1200…
#> $ coverage          <int> 12569, 1945, 246, 366, 1088, 914, 1634, 10810, 36…
#> $ mean_age          <dbl> 1.932373299, 1.665295630, 2.130081301, 1.89071038…
#> $ mean_weight       <dbl> 13.18117501, 12.74053459, 14.59693299, 13.7829060…
#> $ mean_height       <dbl> 88.81422925, 87.31281317, 95.27319588, 90.6430976…

Add Population Estimates

Code
data <-
  population_estimates_data |>
  filter(between(age, age_limits[1], age_limits[2])) |>
  summarize(
    n = population |> sum(na.rm = TRUE),
    .by = c(
      "year",
      "municipality_code"
    )
  ) |>
  right_join(
    data,
    by = c(
      "year",
      "municipality_code"
    )
  ) |>
  rename(population = n) |>
  relocate(population, .before = coverage)
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 7
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Add Municipality Data

Code
data <-
  data |>
  left_join(
    municipalities_data,
    by = join_by(municipality_code),
    suffix = c(".x", "")
  ) |>
  select(-latitude, -longitude) |>
  relocate(region_code:municipality, .after = year) |>
  relocate(municipality, .after = municipality_code)
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Validate Data

The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.

Code
data <-
  data |>
  mutate(
    population = case_when(
      coverage > population ~ coverage,
      TRUE ~ population
    )
  )
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 13
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Calculate Relative Coverage

Code
data <-
  data |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100)
  ) |>
  relocate(coverage_pct, .after = coverage)
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct      <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Arrange Data

Code
data <-
  data |>
  arrange(
    year,
    municipality_code
  )
Code
data |> glimpse()
#> Rows: 5,569
#> Columns: 14
#> $ year              <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <dbl> 1100015, 1100023, 1100031, 1100049, 1100056, 1100…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ population        <int> 1683, 7831, 356, 6540, 1256, 1048, 556, 1118, 238…
#> $ coverage          <int> 1036, 2554, 243, 3278, 780, 687, 418, 779, 1048, …
#> $ coverage_pct      <dbl> 61.55674391, 32.61397012, 68.25842697, 50.1223241…
#> $ mean_age          <dbl> 1.694015444, 1.779169930, 1.547325103, 1.68700427…
#> $ mean_weight       <dbl> 12.69762531, 12.01254692, 12.92741573, 12.6632511…
#> $ mean_height       <dbl> 88.59817945, 84.05950266, 87.34444444, 86.8941267…

Tabulate Relative Coverage by Region

Code
data |>
  mutate(
    region = brazil_region(region_code)
  ) |>
  summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "region"
  ) |>
  slice(c(1, 2, 5, 3, 4)) |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100) |>
      round(3),
    across(
      .cols = where(is.numeric),
      .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
    )
  ) |>
  arrange(desc(coverage_pct)) |>
  rename(
    Region = region,
    Population = population,
    `SISVAN Coverage` = coverage,
    `SISVAN Coverage (%)` = coverage_pct
  ) |>
  pipe_table() |>
  cat_lines()
Region Population SISVAN Coverage SISVAN Coverage (%)
Northeast 3,766,882 2,367,312 62.845
North 1,491,017 936,061 62.78
South 1,866,192 980,048 52.516
Central-West 1,163,709 581,890 50.003
Southeast 5,144,676 2,371,835 46.103

Tabulate Relative Coverage by State

Code
data |>
  mutate(
    state = brazil_state(state_code)
  ) |>
  summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "state"
  ) |>
  arrange(state) |>
  mutate(
    coverage_pct = coverage |>
      divide_by(population) |>
      multiply_by(100) |>
      round(3),
    across(
      .cols = where(is.numeric),
      .fns = \(x) prettyNum(x, big.mark = ",", decimal.mark = ".")
    )
  ) |>
  arrange(desc(coverage_pct)) |>
  rename(
    State = state,
    Population = population,
    `SISVAN Coverage` = coverage,
    `SISVAN Coverage (%)` = coverage_pct
  ) |>
  pipe_table() |>
  cat_lines()
State Population SISVAN Coverage SISVAN Coverage (%)
Amazonas 372,626 258,619 69.404
Piauí 223,295 152,225 68.172
Ceará 603,297 409,420 67.864
Tocantins 118,363 79,276 66.977
Maranhão 521,647 339,351 65.054
Alagoas 235,822 153,311 65.011
Pará 655,706 420,326 64.103
Bahia 917,506 579,347 63.144
Paraíba 273,869 172,150 62.859
Sergipe 151,907 91,052 59.939
Acre 74,909 43,935 58.651
Pernambuco 625,346 355,420 56.836
Minas Gerais 1,223,545 687,147 56.16
Santa Catarina 504,318 278,068 55.137
Paraná 730,202 393,975 53.954
Rio Grande do Norte 214,193 115,036 53.707
Mato Grosso 295,406 158,514 53.66
Mato Grosso do Sul 210,411 110,848 52.682
Amapá 72,115 37,543 52.06
Rondônia 126,022 62,427 49.537
Rio Grande do Sul 631,672 308,005 48.76
Goiás 469,363 227,488 48.467
Roraima 71,276 33,935 47.611
Espírito Santo 266,283 125,030 46.954
Rio de Janeiro 958,493 444,301 46.354
Distrito Federal 188,529 85,040 45.107
São Paulo 2,696,355 1,115,357 41.365

Plot Histogram by Municipality

Code
data |>
  tidyr::drop_na(coverage_pct) |>
  ggplot(aes(x = coverage_pct)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    fill = get_brand_color("gray-d25"),
    color = get_brand_color("white")
  ) +
  geom_density(
    color = "red",
    linewidth = 1
  ) +
  labs(
    title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
    subtitle = paste0("Year: ", year),
    x = "Coverage (%)",
    y = "Density",
    caption = "Source: SISVAN."
  )

Plot Map by Municipality

Set Shape

Code
shape <-
  read_municipality(
    year = year |>
      closest_geobr_year(type = "municipality"),
    showProgress = FALSE
  ) |>
  st_transform(st_crs(4326))
#> ! The closest map year to 2023 is 2022. Using year 2022 instead.
#> Using year/date 2022

Prepare Plot Data

Code
plot_data <-
  data |>
  left_join(
    shape,
    by = join_by(municipality_code == code_muni)
  ) |>
  rename(geometry = geom) |>
  select(
    municipality_code,
    coverage_pct,
    geometry
  ) |>
  tidyr::drop_na(coverage_pct)

Plot Data

Code
brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n = x,
    colors = c(
      get_brand_color("dark-red"),
      # get_brand_color("white"),
      get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      get_brand_color("dark-red-triadic-blue")
    )
  )
}
Code
plot_data |>
  st_as_sf() |>
  ggplot(aes(fill = coverage_pct)) +
  geom_sf(
    color = get_brand_color("gray"),
    linewidth = 0.05
  ) +
  scale_fill_binned(
    breaks = seq(0, 100, 25),
    limits = c(0, 100),
    palette = brand_div_palette,
    na.value = get_brand_color("gray-d25")
  ) +
  annotation_scale(
    aes(),
    location = "br",
    style = "tick",
    height = unit(0.5, "lines")
  ) +
  annotation_north_arrow(
    location = "br",
    height = unit(2, "lines"),
    width = unit(2, "lines"),
    pad_x = unit(0.25, "lines"),
    pad_y = unit(1.25, "lines"),
    style = north_arrow_fancy_orienteering
  ) +
  labs(
    title = "SISVAN Coverage by Municipality (%) (Ages 0-5)",
    subtitle = paste0("Year: ", year),
    fill = NULL,
    caption = "Source: SISVAN."
  )
#> Scale on map varies by more than 10%, scale bar may be inaccurate

Citation

When using this data, you must also cite the original data sources.

To cite this work, please use the following format:

Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil [Computer software]. Sustentarea Research and Extension Group, University of São Paulo. https://sustentarea.github.io/nutritional-status

A BibLaTeX entry for LaTeX users is:

@software{vartanian2025,
  title = {A reproducible pipeline for processing and analyzing SISVAN microdata on nutritional status monitoring in Brazil},
  author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group, University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/nutritional-status}
}

License

License: GPLv3 License: CC BY-NC-SA 4.0

The original data sources may be subject to their own licensing terms and conditions.

The code in this repository is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Copyright (C) 2025 Sustentarea Research and Extension Group

The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

Acknowledgments


Sustentarea Logo
This work is part of a research project by the Sustentarea Research and Extension Group of the University of São Paulo (USP) titled: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil's public health system.
CNPq Logo
This work was supported by the Department of Science and Technology of the Secretariat of Science, Technology, and Innovation and of the Health Economic-Industrial Complex (SECTICS) of the Ministry of Health of Brazil, and the National Council for Scientific and Technological Development (CNPq) (grant no. 444588/2023-0).

References

Aho, A., Kernighan, B., & Weinberger, P. (2023). The AWK programming language. Addison-Wesley Professional. https://www.awk.dev
Bagni, U. V., & Barros, D. C. D. (2015). Erro em antropometria aplicada à avaliação nutricional nos serviços de saúde: Causas, consequências e métodos de mensuração. Nutrire, 40(2), 226–236. https://doi.org/10.4322/2316-7874.18613
Bopp, M., & Faeh, D. (2008). End-digits preference for self-reported height depends on language. BMC Public Health, 8(1), 342. https://doi.org/10.1186/1471-2458-8-342
Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def
Corsi, D. J., Perkins, J. M., & Subramanian, S. V. (2017). Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: Are we comparing apples and oranges? Global Health Action, 10(1), 1328185. https://doi.org/10.1080/16549716.2017.1328185
Dirk Schumacher. (n.d.-a). anthro: Computation of the WHO child growth standards [Computer software]. https://doi.org/10.32614/CRAN.package.anthro
Dirk Schumacher. (n.d.-b). anthroplus: Computation of the WHO 2007 references for school-age children and adolescents (5 to 19 years) [Computer software]. https://doi.org/10.32614/CRAN.package.anthroplus
Finaret, A. B., & Hutchinson, M. (2018). Missingness of height data from the demographic and health surveys in Africa between 1991 and 2016 was not random but is unlikely to have major implications for biases in estimating stunting prevalence or the determinants of child height. The Journal of Nutrition, 148(5), 781–789. https://doi.org/10.1093/jn/nxy037
Lawman, H. G., Ogden, C. L., Hassink, S., Mallya, G., Vander Veur, S., & Foster, G. D. (2015). Comparing methods for identifying biologically implausible values in height, weight, and body mass index among youth. American Journal of Epidemiology, 182(4), 359–365. https://doi.org/10.1093/aje/kwv057
Lyons-Amos, M., & Stones, T. (2017). Trends in demographic and health survey data quality: An analysis of age heaping over time in 34 countries in sub saharan Africa between 1987 and 2015. BMC Research Notes, 10(1), 760. https://doi.org/10.1186/s13104-017-3091-x
Mei, Z. (2007). Standard deviation of anthropometric Z-scores as a data quality assessment tool using the 2006 WHO growth standards: A cross country analysis. Bulletin of the World Health Organization, 85(6), 441–448. https://doi.org/10.2471/BLT.06.034421
Mourão, E., Gallo, C. D. O., Nascimento, F. A. D., & Jaime, P. C. (2020). Tendência temporal da cobertura do Sistema de Vigilância Alimentar e Nutricional entre crianças menores de 5 anos da região Norte do Brasil, 2008-2017*. Epidemiologia e Serviços de Saúde, 29(2). https://doi.org/10.5123/S1679-49742020000200026
Nannan, N., Dorrington, R., & Bradshaw, D. (2019). Estimating completeness of birth registration in South Africa, 1996 – 2011. Bulletin of the World Health Organization, 97(7), 468–476. https://doi.org/10.2471/BLT.18.222620
Nascimento, F. A. D., Silva, S. A. D., & Jaime, P. C. (2017). Cobertura da avaliação do estado nutricional no Sistema de Vigilância Alimentar e Nutricional brasileiro: 2008 a 2013. Cadernos de Saúde Pública, 33(12). https://doi.org/10.1590/0102-311x00161516
Pereira, R. H. M., & Goncalves, C. N. (n.d.). geobr: Download official spatial data sets of Brazil [Computer software]. https://doi.org/10.32614/CRAN.package.geobr
Perumal, N., Namaste, S., Qamar, H., Aimone, A., Bassani, D. G., & Roth, D. E. (2020). Anthropometric data quality assessment in multisurvey studies of child growth. The American Journal of Clinical Nutrition, 112, 806S–815S. https://doi.org/10.1093/ajcn/nqaa162
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org
Silva, N. de J., Silva, J. F. de M. e, Carrilho, T. R. B., Pinto, E. de J., Andrade, R. da C. S. de, Silva, S. A., Pedroso, J., Spaniol, A. M., Bortolini, G. A., Fagundes, A., Nilson, E. A. F., Fiaccone, R. L., Kac, G., Barreto, M. L., & Ribeiro-Silva, R. de C. (2023). Qualidade dos dados antropométricos infantis do Sisvan, Brasil, 2008-2017. Revista de Saúde Pública, 57(1, 1), 62–62. https://doi.org/10.11606/s1518-8787.2023057004655
Sistema de Vigilância Alimentar e Nutricional, Coordenação-Geral de Alimentação e Nutrição, Departamento de Promoção da Saúde, Coordenação Setorial de Tecnologia da Informação, Secretaria de Atenção Primária à Saúde, & Ministério da Saúde. (n.d.). Microdados dos acompanhamentos de estado nutricional [Microdata on nutritional status monitoring] [Data set]. openDataSUS. Retrieved November 16, 2023, from https://opendatasus.saude.gov.br/dataset/sisvan-estado-nutricional
Ushey, K., & Wickham, H. (n.d.). renv: Project environments [Computer software]. https://doi.org/10.32614/CRAN.package.renv
Vartanian, D., & Carvalho, A. M. de. (2025). A reproducible pipeline for processing WorldClim 2.1 Historical Monthly Weather Data in Brazil [Computer software]. Sustentarea Research; Extension Center at the University of São Paulo. https://sustentarea.github.io/brazil-historical-climate
Wickham, H. (n.d.-a). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org
Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org
Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz
World Health Organization. (2006). WHO child growth standards: Length/height-for-age, weight-for-age, weight-for-length, weight-for-height and body mass index-for-age: Methods and development. WHO Press. https://www.who.int/tools/child-growth-standards/standards
World Health Organization. (2008). Training course on child growth assessment. WHO Press. https://www.who.int/publications/i/item/9789241595070