A Reproducible Pipeline for Processing SISVAN Microdata on Nutritional Status Monitoring in Brazil (2008-2023)

Daniel Vartanian, João Pedro J. Schettino, &amp; Aline M. de Carvalho

Author

Daniel Vartanian, João Pedro J. Schettino, & Aline M. de Carvalho

Published

2025-05-05

Overview

This report contains a reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008–2023). The main goal is to provide a open and reliable workflow for processing these data, supporting research and informed public policy decisions.

This pipeline is still under development and may not be fully functional.

This warning will be removed once the pipeline is complete.

Problem

The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).

Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).

In light of this, there is a need for an open and reproducible pipeline for processing SISVAN microdata, aiming to identify, correct, or remove problematic records and ensure greater consistency, completeness, and plausibility of the information for use in research and public policy.

Data Availability

The processed data are available in both csv and rds formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr package.

A backup copy of the raw data is also available in OSF. You can access it here.

Methods

Source of Data

The data used in this analysis come from the following sources:

Brazil’s Food and Nutrition Surveillance System (SISVAN), which provides microdata on nutritional status monitoring in Brazil (Sistema de Vigilância Alimentar e Nutricional et al., n.d.).
The Brazilian Institute of Geography and Statistics (IBGE), which provides official territorial data for Brazilian municipalities. These data were accessed using the orbis and geobr R packages (Pereira & Goncalves, n.d.; Vartanian, n.d.).
The Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides annual population estimates for Brazil by municipality, age, and sex for the period 2000-2024 (Comitê de Gestão de Indicadores et al., n.d.). For practicality and better organization, the DATASUS data used in this pipeline is provided through a separate reproducible pipeline, available here (Vartanian & Carvalho, 2025).

For technical information about the raw dataset, see the official technical note (in Portuguese).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Data Validation

Different validation techniques were used to ensure data quality and reliability:

The amount of data imported from the raw files were compared to the amount of data returned by SISVAN Online Data Access Tool.
Duplicates were removed based on distinct combinations of the variables id, age, date (date of the individual’s nutritional assessment), weight, and height.
The number of nutritional assessments were compared to the estimated number of children in the population.

Silva et al. (2023) quality indicators were also used for validation. Refer to the article for more details.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Setting the Environment

library(brandr)
library(cli)
library(dplyr)
library(fs)
library(ggplot2)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(httr2)
library(lubridate)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(pal) # gitlab.com/rpkg.dev/pal
library(plotr) # github.com/danielvartan/plotr
library(readr)
library(tidyr)
library(utils)
library(vroom)

Setting the Initial Variables

year <- 2017

age_limits <- c(0, 4)

Click here to access the microdata data dictionary (in Portuguese).

col_selection <- c(
  "CO_PESSOA_SISVAN",
  "CO_MUNICIPIO_IBGE",
  "DT_ACOMPANHAMENTO",
  "SG_SEXO",
  "NU_IDADE_ANO",
  "NU_PESO",
  "NU_ALTURA"
)

Downloading the Data

SISVAN microdata files are very large. For practical reasons, some code chunks have eval: false set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.

Code

if (!dir.exists(here::here("data"))) dir.create("data")

raw_file_pattern <- paste0("raw-", year)

file <- here::here("data", paste0(raw_file_pattern, ".zip"))

paste0(
    "https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br/SISVAN/",
    "estado_nutricional/sisvan_estado_nutricional_",
    year,
    ".zip"
  ) |>
  httr2::request() |>
  httr2::req_progress() |>
  httr2::req_perform(file)

Unzipping the Data

Code

file <-
  file |>
  utils::unzip(exdir = here::here("data"), overwrite = TRUE)

Code

file <-
  file |>
  fs::file_move(here::here("data", paste0(raw_file_pattern, ".csv")))

Code

fs::file_delete(here::here("data", paste0(raw_file_pattern, ".zip")))

Checking Data Dimensions

file |> groomr::peek_csv_file(delim = ";", skip = 0, has_header = TRUE)
#> The file has 34 columns, 28,537,529 rows, and 970,275,986 cells.

Reading and Filtering the Data

We use the vroom R package together with the AWK programming language to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab or RStudio Cloud.

col_names <- c(
  "CO_ACOMPANHAMENTO",
  "CO_PESSOA_SISVAN",
  "ST_PARTICIPA_ANDI",
  "CO_MUNICIPIO_IBGE",
  "SG_UF",
  "NO_MUNICIPIO",
  "CO_CNES",
  "NU_IDADE_ANO",
  "NU_FASE_VIDA",
  "DS_FASE_VIDA",
  "SG_SEXO",
  "CO_RACA_COR",
  "DS_RACA_COR",
  "CO_POVO_COMUNIDADE",
  "DS_POVO_COMUNIDADE",
  "CO_ESCOLARIDADE",
  "DS_ESCOLARIDADE",
  "DT_ACOMPANHAMENTO",
  "NU_COMPETENCIA",
  "NU_PESO",
  "NU_ALTURA",
  "DS_IMC",
  "DS_IMC_PRE_GESTACIONAL",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE",
  "ADO. ALTURA X IDADE",
  "ADO. IMC X IDADE",
  "CO_ESTADO_NUTRI_ADULTO",
  "CO_ESTADO_NUTRI_IDOSO",
  "CO_ESTADO_NUTRI_IMC_SEMGEST",
  "CO_SISTEMA_ORIGEM_ACOMP",
  "SISTEMA_ORIGEM_ACOMP"
)

schema <- vroom::cols(
  CO_ACOMPANHAMENTO = vroom::col_character(),
  CO_PESSOA_SISVAN = vroom::col_character(),
  ST_PARTICIPA_ANDI = vroom::col_character(),
  CO_MUNICIPIO_IBGE = vroom::col_integer(),
  SG_UF = vroom::col_factor(),
  NO_MUNICIPIO = vroom::col_character(), # ? vroom::col_factor()
  CO_CNES = vroom::col_integer(),
  NU_IDADE_ANO = vroom::col_integer(),
  NU_FASE_VIDA = vroom::col_character(), # decimal mark = "." (double)
  DS_FASE_VIDA = vroom::col_factor(),
  SG_SEXO = vroom::col_factor(),
  CO_RACA_COR = vroom::col_character(),
  DS_RACA_COR = vroom::col_factor(),
  CO_POVO_COMUNIDADE = vroom::col_integer(),
  DS_POVO_COMUNIDADE = vroom::col_factor(),
  CO_ESCOLARIDADE = vroom::col_character(),
  DS_ESCOLARIDADE = vroom::col_factor(),
  DT_ACOMPANHAMENTO = vroom::col_date(),
  NU_COMPETENCIA = vroom::col_integer(),
  NU_PESO = vroom::col_double(),
  NU_ALTURA = vroom::col_integer(),
  DS_IMC = vroom::col_double(),
  DS_IMC_PRE_GESTACIONAL = vroom::col_character(), # decimal mark = "." (double)
  "PESO X IDADE" = vroom::col_factor(),
  "PESO X ALTURA" = vroom::col_factor(),
  "CRI. ALTURA X IDADE" = vroom::col_factor(),
  "CRI. IMC X IDADE" = vroom::col_factor(),
  "ADO. ALTURA X IDADE" = vroom::col_factor(),
  "ADO. IMC X IDADE" = vroom::col_factor(),
  CO_ESTADO_NUTRI_ADULTO = vroom::col_factor(),
  CO_ESTADO_NUTRI_IDOSO = vroom::col_factor(),
  CO_ESTADO_NUTRI_IMC_SEMGEST = vroom::col_factor(),
  CO_SISTEMA_ORIGEM_ACOMP = vroom::col_integer(),
  SISTEMA_ORIGEM_ACOMP = vroom::col_factor()
)

You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.

data <-
  vroom::vroom(
     # Uses `pipe()` and `awk` to filter data to avoid loading the
     # entire file into memory.
    file = pipe(
      paste(
        "awk -F ';' '{ if (",
        "($8 >= ", age_limits[1], ") && ($8 <= ", age_limits[2], ")",
        ") { print } }'",
        file
      )
    ),
    delim = ";",
    col_names = col_names,
    col_types = schema,
    col_select = dplyr::all_of(col_selection),
    id = NULL,
    skip = 0,
    n_max = Inf,
    na = c("", "NA"),
    quote = "\"",
    comment = "",
    skip_empty_rows = TRUE,
    trim_ws = TRUE,
    escape_double = TRUE,
    escape_backslash = FALSE,
    locale = vroom::locale(
      date_names = "pt",
      date_format = "%d/%m/%Y",
      time_format = "%H:%M:%S",
      decimal_mark = ",",
      grouping_mark = ".",
      tz = "America/Sao_Paulo",
      encoding = readr::guess_encoding(file)$encoding[1]
    ),
    guess_max = 100,
    altrep = TRUE,
    num_threads = vroom:::vroom_threads(),
    progress = vroom::vroom_progress(),
    show_col_types = NULL,
    .name_repair = "unique"
  )

data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ CO_PESSOA_SISVAN  <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ CO_MUNICIPIO_IBGE <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ DT_ACOMPANHAMENTO <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ SG_SEXO           <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ NU_IDADE_ANO      <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ NU_PESO           <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ NU_ALTURA         <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Renaming the Data

data <-
  data |>
  janitor::clean_names() |>
  dplyr::rename(
    id = co_pessoa_sisvan,
    municipality_code = co_municipio_ibge,
    date = dt_acompanhamento,
    sex = sg_sexo,
    age = nu_idade_ano,
    weight = nu_peso,
    height = nu_altura
  )

data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ sex               <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Tidying the Data

data <-
  data |>
  dplyr::mutate(
    sex =
      sex |>
      dplyr::case_match(
        "F" ~ "female",
        "M" ~ "male"
      ) |>
      factor(
        levels = c("male", "female"),
        ordered = FALSE
      )
  ) |>
  dplyr::relocate(id, date)

data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ sex               <fct> male, male, female, female, female, female, femal…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Transforming the Data

Adding State and Region Data

The orbis R package retrieves state and region information using the geobr package, developed by the Brazilian Institute for Applied Economic Research (IPEA). The geobr package, in turn, is based on official data from the Brazilian Institute of Geography and Statistics (IBGE).

brazil_municipalities <- orbis::get_brazil_municipality(
  year = plotr:::get_closest_geobr_year(year, type = "municipality")
)

data <-
  data |>
  dplyr::left_join(
    brazil_municipalities |>
      dplyr::mutate(
        municipality_code =
          municipality_code |>
          stringr::str_sub(end = -2) |>
          as.integer()
      ),
    by = "municipality_code"
  ) |>
  dplyr::relocate(
    id,
    date,
    region_code,
    region,
    state_code,
    state,
    federal_unit,
    municipality_code,
    municipality
  )

data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 13
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ region_code       <int> 2, 4, 5, 2, 2, 2, 2, 3, 2, 2, 4, 3, 1, 1, 3, 2, 4…
#> $ region            <chr> "Northeast", "South", "Central-West", "Northeast"…
#> $ state_code        <int> 23, 42, 52, 25, 23, 23, 29, 31, 23, 26, 43, 35, 1…
#> $ state             <chr> "Ceará", "Santa Catarina", "Goiás", "Paraíba", "C…
#> $ federal_unit      <chr> "CE", "SC", "GO", "PB", "CE", "CE", "BA", "MG", "…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ municipality      <chr> "Jaguaretama", "Botuverá", "Caturaí", "São José d…
#> $ sex               <fct> male, male, female, female, female, female, femal…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Validating the Data

Removing Duplicates

data <-
  data |>
  dplyr::arrange(dplyr::desc(date)) |>
  dplyr::distinct(
    id,
    age,
    date,
    weight,
    height,
    .keep_all = TRUE
  )
#> Warning: One or more parsing issues, call `problems()` on your data frame for
#> details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id                <chr> "B1C98CBB3CB83C75B08E2F62056EB10A68DBEBA4", "5B1A…
#> $ date              <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ region_code       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
#> $ region            <chr> "Central-West", "Central-West", "Central-West", "…
#> $ state_code        <int> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
#> $ state             <chr> "Goiás", "Goiás", "Goiás", "Goiás", "Goiás", "Goi…
#> $ federal_unit      <chr> "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "…
#> $ municipality_code <int> 522020, 522045, 522020, 521250, 522020, 522020, 5…
#> $ municipality      <chr> "São Miguel do Araguaia", "Senador Canedo", "São …
#> $ sex               <fct> female, male, female, male, female, male, female,…
#> $ age               <int> 3, 4, 2, 2, 2, 3, 3, 1, 2, 4, 2, 3, 2, 4, 3, 4, 4…
#> $ weight            <dbl> 17, NA, NA, NA, 14, 19, 16, 12, 14, 21, NA, 17, 1…
#> $ height            <int> 94, 107, 85, 81, 89, 96, 95, 80, 87, 127, 93, 93,…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    region_code,
    state_code,
    municipality_code,
    date,
    sex,
    age,
    weight,
    height
  )

data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id                <chr> "263E905B0395FF94BE2D97E92983F83D0F4D01E6", "B812…
#> $ date              <date> 2017-01-04, 2017-01-05, 2017-01-06, 2017-01-09, …
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110001, 110001, 110001, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex               <fct> female, male, female, male, male, female, male, m…
#> $ age               <int> 1, 1, 4, 2, 0, 3, 1, 2, 1, 4, 0, 2, 0, 0, 4, 0, 0…
#> $ weight            <dbl> 9.000, 9.700, 14.000, 12.700, 6.600, 13.800, 10.0…
#> $ height            <int> 81, 75, 95, 89, 58, 98, 83, 80, 76, 111, 53, 95, …

Data Dictionary

Code

metadata <-
  data |>
  labelled::`var_label<-`(
    list(
      id = "Unique identifier of the individual",
      date = "Date of the individual's nutritional assessment",
      region_code = "IBGE region code",
      region = "Region name",
      state_code = "IBGE state code",
      state = "State name",
      federal_unit = "Federal unit name",
      municipality_code = "IBGE municipality code",
      municipality = "Municipality name",
      sex = "Sex of the individual",
      age = "Age of the individual in years",
      weight = "Weight of the individual in kilograms",
      height = "Height of the individual in centimeters"
    )
  ) |>
  labelled::generate_dictionary(details = "full") |>
  labelled::convert_list_columns_to_character()

Code

metadata

Code

data

Saving the Valid Data

Data

Code

valid_file_pattern <- paste0(
  "valid-",
  year,
  "-age-",
  age_limits[1],
  "-",
  age_limits[2]
)

Code

data |>
  readr::write_csv(
    here::here("data", paste0(valid_file_pattern, ".csv"))
  )

Code

data |>
  readr::write_rds(
    here::here("data", paste0(valid_file_pattern, ".rds"))
  )

Metadata

Code

metadata_file_pattern <- paste0(
  "metadata-",
  year,
  "-age-",
  age_limits[1],
  "-",
  age_limits[2]
)

Code

metadata |>
  readr::write_csv(
    here::here("data", paste0(metadata_file_pattern, ".csv"))
  )

Code

metadata |>
  readr::write_rds(
    here::here("data", paste0(metadata_file_pattern, ".rds"))
  )

Checking the Relative Coverage

Transforming the Data

Removing Duplicates by Year

As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.

data <-
  data |>
    dplyr::mutate(year = lubridate::year(date)) |>
    dplyr::arrange(dplyr::desc(date)) |>
    dplyr::distinct(id, year, .keep_all = TRUE) |>
    dplyr::relocate(year, .after = date)

data |> dplyr::glimpse()
#> Rows: 4,622,727
#> Columns: 14
#> $ id                <chr> "1B7842AF30A5899C2B6D82688E95EDCD96355BA1", "9A35…
#> $ date              <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110002, 110002, 110002, 110002, 110002, 110002, 1…
#> $ municipality      <chr> "Ariquemes", "Ariquemes", "Ariquemes", "Ariquemes…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 1, 2, 2, 2, 2, 3, 4, 1, 2, 3, 4, 1, 1, 1, 2…
#> $ weight            <dbl> 7, 12, 13, 12, 13, NA, NA, 17, 18, NA, NA, 19, NA…
#> $ height            <int> 63, 82, 85, 90, 90, 87, 88, 117, 107, 61, 96, 107…

Summarizing the Data by Year

data <-
  data |>
    dplyr::summarize(
      coverage = dplyr::n(),
      mean_age = age |> mean(na.rm = TRUE),
      mean_weight = weight |> mean(na.rm = TRUE),
      mean_height = height |> mean(na.rm = TRUE),
      .by = c(
        "year",
        "region_code",
        "state_code",
        "municipality_code"
      )
    )

data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 8
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 1…
#> $ municipality_code <int> 110002, 110005, 110025, 110100, 120025, 120040, 1…
#> $ coverage          <int> 1577, 622, 887, 334, 557, 7291, 995, 3741, 2172, …
#> $ mean_age          <dbl> 2.581483830, 2.223472669, 2.278466742, 2.46107784…
#> $ mean_weight       <dbl> 15.17615741, 13.98031309, 14.23404969, 14.7764350…
#> $ mean_height       <dbl> 93.11921370, 91.22508039, 91.10496614, 93.9251497…

Adding Population Estimates

As described in the Methods section, the population estimates were obtained from the DATASUS platform, which provides annual data by municipality, age, and sex for Brazil from 2000 to 2024 (Comitê de Gestão de Indicadores et al., n.d.).

To ensure reproducibility and organization, the DATASUS data used in this pipeline are processed and validated through a separate reproducible pipeline, available here (Vartanian & Carvalho, 2025). The validated datasets are downloaded directly from OSF. For further details, refer to the linked pipeline.

datasus_file_pattern <- paste0("datasus-pop-estimates-", year)

datasus_file <- here::here("data", paste0(datasus_file_pattern, ".rds"))

if (!checkmate::test_file_exists(datasus_file)) {
  osf_id <-
    paste0("https://osf.io/", "h3pyd") |>
    osfr::osf_retrieve_node() |>
    osfr::osf_ls_files(
      type = "file",
      pattern = paste0("valid-", year, ".rds")
    )

  osfr::osf_download(
    x = osf_id,
    path = tempdir(),
    conflicts = "overwrite"
  ) |>
    dplyr::pull(local_path) |>
    fs::file_move(datasus_file)
}

pop_estimates <- datasus_file |> readr::read_rds()

data <-
  pop_estimates |>
  dplyr::filter(dplyr::between(age, age_limits[1], age_limits[2])) |>
  dplyr::summarize(
    n = n |> sum(na.rm = TRUE),
    .by = c(
      "year",
      "region_code",
      "state_code",
      "municipality_code"
    )
  ) |>
  dplyr::mutate(
    municipality_code =
      municipality_code |>
      stringr::str_sub(end = -2) |>
      as.integer()
  ) |>
  dplyr::right_join(
    data,
    by = c(
      "year",
      "region_code",
      "state_code",
      "municipality_code"
    )
  ) |>
  dplyr::rename(population = n) |>
  dplyr::relocate(population, .before = coverage)

data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Validating the Data

The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.

Note: At this stage, only the most recent record for each individual is retained.

data <-
  data |>
  dplyr::mutate(
    population = dplyr::case_when(
      coverage > population ~ coverage,
      TRUE ~ population
    )
  )

data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Calculating Relative Coverage

data <-
  data |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::relocate(coverage_per, .after = coverage)

data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per      <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    year,
    region_code,
    state_code,
    municipality_code
  )

data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per      <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Checking Relative Coverage by Region

The coverage observed here is slightly lower than that reported in Silva et al. (2023, Table 2). This difference may be explained by the use of different data sources (Fundação Oswaldo Cruz (Fiocruz) vs. OpenDataSUS).

Code

data |>
  dplyr::mutate(region = orbis::get_brazil_region(region_code)) |>
  dplyr::summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "region"
  ) |>
  dplyr::slice(c(1, 2, 5, 3, 4)) |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::rename(
    Region = region,
    Population = population,
    `SISVAN coverage` = coverage,
    `SISVAN coverage (%)` = coverage_per
  ) |>
  pal::pipe_table() |>
  pal::cat_lines()

Region	Population	SISVAN coverage	SISVAN coverage (%)
North	1592792	624150	39.18590751
Northeast	4107294	1813679	44.15751587
Central-West	1208858	268745	22.23131253
Southeast	5767592	1336352	23.17001619
South	1975068	579801	29.35600192

Checking Relative Coverage by State

Code

data |>
  dplyr::mutate(state = orbis::get_brazil_state(state_code)) |>
  dplyr::summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "state"
  ) |>
  dplyr::arrange(state) |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::rename(
    State = state,
    Population = population,
    `SISVAN coverage` = coverage,
    `SISVAN coverage (%)` = coverage_per
  ) |>
  pal::pipe_table() |>
  pal::cat_lines()

State	Population	SISVAN coverage	SISVAN coverage (%)
Acre	81517	36178	44.380926678
Alagoas	253571	113343	44.698723435
Amapá	79072	20820	26.330433023
Amazonas	403287	165140	40.948505655
Bahia	1012762	444864	43.925818702
Ceará	645357	281507	43.620352766
Distrito Federal	218010	18757	8.603733774
Espírito Santo	277541	61507	22.161410386
Goiás	491920	103909	21.123150106
Maranhão	578369	283802	49.069365751
Mato Grosso	280532	79026	28.170048337
Mato Grosso do Sul	218396	67053	30.702485394
Minas Gerais	1309142	620924	47.429843363
Paraná	786664	253352	32.205871884
Paraíba	285820	150132	52.526765097
Pará	710233	290899	40.958248913
Pernambuco	692218	259659	37.511159779
Piauí	234935	125698	53.503309426
Rio Grande do Norte	237059	84896	35.812181778
Rio Grande do Sul	707754	177919	25.138536836
Rio de Janeiro	1122656	181772	16.191246473
Rondônia	135254	33899	25.063214397
Roraima	59647	16842	28.236122521
Santa Catarina	480650	148530	30.901903672
Sergipe	167203	69778	41.732504800
São Paulo	3058253	472149	15.438519965
Tocantins	123782	60372	48.772842578

Visualizing the Relative Coverage

Code

brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n_prop = x,
    colors = c(
      brandr::get_brand_color("dark-red"),
      # brandr::get_brand_color("white"),
      brandr::get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      brandr::get_brand_color("dark-red-triadic-blue")
    )
  )
}

Code

data |>
  tidyr::drop_na(coverage_per) |>
  plotr:::plot_hist(
    col = "coverage_per",
    density_line_color = brandr::get_brand_color("red"),
    x_label = "Coverage (%)",
    print = FALSE
  ) +
  ggplot2::labs(
    title = "SISVAN Coverage by Municipality (%)",
    subtitle = paste0("Year: ", year),
    caption = "Source: SISVAN"
  )

Code

data |>
  tidyr::drop_na(coverage_per, municipality_code) |>
  plotr:::plot_brazil_municipality(
    col_fill = "coverage_per",
    col_code = "municipality_code",
    year = plotr:::get_closest_geobr_year(year, type = "municipality"),
    comparable_areas = FALSE,
    breaks = seq(0, 100, 25),
    limits = c(0, 100),
    palette = brand_div_palette,
    print = FALSE
  ) +
  ggplot2::labs(
    title = "SISVAN Coverage by Municipality (%)",
    subtitle = paste0("Year: ", year),
    caption = "Source: SISVAN"
  )
#> Scale on map varies by more than 10%, scale bar may be inaccurate

How to Cite

To cite this work, please use the following format:

Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023) [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/sisvan-nutritional-status

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023)},
  author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/sisvan-nutritional-status}
}

License

The code in this report is licensed under the MIT License, while the documents are available under the Creative Commons Attribution 4.0 International License.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

Bagni, U. V., & Barros, D. C. D. (2015). Erro em antropometria aplicada à avaliação nutricional nos serviços de saúde: Causas, consequências e métodos de mensuração. Nutrire, 40(2), 226–236. https://doi.org/10.4322/2316-7874.18613

Bopp, M., & Faeh, D. (2008). End-digits preference for self-reported height depends on language. BMC Public Health, 8(1), 342. https://doi.org/10.1186/1471-2458-8-342

Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def

Corsi, D. J., Perkins, J. M., & Subramanian, S. V. (2017). Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: Are we comparing apples and oranges? Global Health Action, 10(1), 1328185. https://doi.org/10.1080/16549716.2017.1328185

Finaret, A. B., & Hutchinson, M. (2018). Missingness of height data from the demographic and health surveys in Africa between 1991 and 2016 was not random but is unlikely to have major implications for biases in estimating stunting prevalence or the determinants of child height. The Journal of Nutrition, 148(5), 781–789. https://doi.org/10.1093/jn/nxy037

Lawman, H. G., Ogden, C. L., Hassink, S., Mallya, G., Vander Veur, S., & Foster, G. D. (2015). Comparing methods for identifying biologically implausible values in height, weight, and body mass index among youth. American Journal of Epidemiology, 182(4), 359–365. https://doi.org/10.1093/aje/kwv057

Lyons-Amos, M., & Stones, T. (2017). Trends in demographic and health survey data quality: An analysis of age heaping over time in 34 countries in sub saharan Africa between 1987 and 2015. BMC Research Notes, 10(1), 760. https://doi.org/10.1186/s13104-017-3091-x

Mei, Z. (2007). Standard deviation of anthropometric Z-scores as a data quality assessment tool using the 2006 WHO growth standards: A cross country analysis. Bulletin of the World Health Organization, 85(6), 441–448. https://doi.org/10.2471/BLT.06.034421

Mourão, E., Gallo, C. D. O., Nascimento, F. A. D., & Jaime, P. C. (2020). Tendência temporal da cobertura do Sistema de Vigilância Alimentar e Nutricional entre crianças menores de 5 anos da região Norte do Brasil, 2008-2017*. Epidemiologia e Serviços de Saúde, 29(2). https://doi.org/10.5123/S1679-49742020000200026

Nannan, N., Dorrington, R., & Bradshaw, D. (2019). Estimating completeness of birth registration in South Africa, 1996 – 2011. Bulletin of the World Health Organization, 97(7), 468–476. https://doi.org/10.2471/BLT.18.222620

Nascimento, F. A. D., Silva, S. A. D., & Jaime, P. C. (2017). Cobertura da avaliação do estado nutricional no Sistema de Vigilância Alimentar e Nutricional brasileiro: 2008 a 2013. Cadernos de Saúde Pública, 33(12). https://doi.org/10.1590/0102-311x00161516

Pereira, R. H. M., & Goncalves, C. N. (n.d.). geobr: Download official spatial data sets of Brazil [Computer software]. https://doi.org/10.32614/CRAN.package.geobr

Perumal, N., Namaste, S., Qamar, H., Aimone, A., Bassani, D. G., & Roth, D. E. (2020). Anthropometric data quality assessment in multisurvey studies of child growth. The American Journal of Clinical Nutrition, 112, 806S–815S. https://doi.org/10.1093/ajcn/nqaa162

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Silva, N. de J., Silva, J. F. de M. e, Carrilho, T. R. B., Pinto, E. de J., Andrade, R. da C. S. de, Silva, S. A., Pedroso, J., Spaniol, A. M., Bortolini, G. A., Fagundes, A., Nilson, E. A. F., Fiaccone, R. L., Kac, G., Barreto, M. L., & Ribeiro-Silva, R. de C. (2023). Qualidade dos dados antropométricos infantis do Sisvan, Brasil, 2008-2017. Revista de Saúde Pública, 57(1, 1), 62–62. https://doi.org/10.11606/s1518-8787.2023057004655

Sistema de Vigilância Alimentar e Nutricional, Coordenação-Geral de Alimentação e Nutrição, Departamento de Promoção da Saúde, Coordenação Setorial de Tecnologia da Informação, Secretaria de Atenção Primária à Saúde, & Ministério da Saúde. (n.d.). Microdados dos acompanhamentos de estado nutricional [Microdata on nutritional status monitoring] [Data set]. openDataSUS. Retrieved November 16, 2023, from https://opendatasus.saude.gov.br/dataset/sisvan-estado-nutricional

Vartanian, D. (n.d.). {orbis}: Spatial data analysis tools [Computer software]. https://danielvartan.github.io/orbis/

Vartanian, D., & Carvalho, A. M. de. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024). Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-pop-estimates

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz