A Reproducible Pipeline for Processing SISVAN Microdata on Nutritional Status Monitoring in Brazil (2008-2023)

Author

Daniel Vartanian, João Pedro J. Schettino, & Aline M. de Carvalho

Published

2025-05-05

Project Status: Active – The project has reached a stable, usable state and is being actively developed. OSF DOI License: MIT License: CC BY 4.0

Overview

This report contains a reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008–2023). The main goal is to provide a open and reliable workflow for processing these data, supporting research and informed public policy decisions.

This pipeline is still under development and may not be fully functional.

This warning will be removed once the pipeline is complete.

Problem

The Food and Nutrition Surveillance System (SISVAN) is a strategic tool for monitoring the nutritional status of the Brazilian population, particularly those served by Brazil’s Unified Health System (SUS). However, despite its broad scope and importance, the anthropometric data recorded in SISVAN often suffer from quality issues that limit their usefulness for rigorous analyses and evidence-based policymaking (Silva et al., 2023).

Multiple factors contribute to these quality concerns, including the lack of standardized measurement protocols, variability in staff training, inconsistencies in data entry and processing, and incomplete population coverage (Bagni & Barros, 2015; Corsi et al., 2017; Perumal et al., 2020). To assess and improve data quality, several indicators have been proposed and applied, such as population coverage (Mourão et al., 2020; Nascimento et al., 2017), completeness of birth dates and anthropometric measurements (Finaret & Hutchinson, 2018; Nannan et al., 2019), digit preference for age, height, and weight (Bopp & Faeh, 2008; Lyons-Amos & Stones, 2017), the percentage of biologically implausible values (Lawman et al., 2015), and the dispersion and distribution of standardized weight and height measurements (Mei, 2007; Perumal et al., 2020).

In light of this, there is a need for an open and reproducible pipeline for processing SISVAN microdata, aiming to identify, correct, or remove problematic records and ensure greater consistency, completeness, and plausibility of the information for use in research and public policy.

Data Availability

The processed data are available in both csv and rds formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr package.

A backup copy of the raw data is also available in OSF. You can access it here.

Methods

Source of Data

The data used in this analysis come from the following sources:

For technical information about the raw dataset, see the official technical note (in Portuguese).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Source: Reproduced from Wickham et al. (2023).

Data Validation

Different validation techniques were used to ensure data quality and reliability:

  • The amount of data imported from the raw files were compared to the amount of data returned by SISVAN Online Data Access Tool.
  • Duplicates were removed based on distinct combinations of the variables id, age, date (date of the individual’s nutritional assessment), weight, and height.
  • The number of nutritional assessments were compared to the estimated number of children in the population.

Silva et al. (2023) quality indicators were also used for validation. Refer to the article for more details.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Setting the Environment

library(brandr)
library(cli)
library(dplyr)
library(fs)
library(ggplot2)
library(groomr) # github.com/danielvartan/groomr
library(here)
library(httr2)
library(lubridate)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(pal) # gitlab.com/rpkg.dev/pal
library(plotr) # github.com/danielvartan/plotr
library(readr)
library(tidyr)
library(utils)
library(vroom)

Setting the Initial Variables

year <- 2017
age_limits <- c(0, 4)

Click here to access the microdata data dictionary (in Portuguese).

col_selection <- c(
  "CO_PESSOA_SISVAN",
  "CO_MUNICIPIO_IBGE",
  "DT_ACOMPANHAMENTO",
  "SG_SEXO",
  "NU_IDADE_ANO",
  "NU_PESO",
  "NU_ALTURA"
)

Downloading the Data

SISVAN microdata files are very large. For practical reasons, some code chunks have eval: false set to prevent downloading the data each time the report is rendered. When running the pipeline in a loop or for full automation, remove these lines to enable automatic downloading.

Code
if (!dir.exists(here::here("data"))) dir.create("data")
raw_file_pattern <- paste0("raw-", year)
file <- here::here("data", paste0(raw_file_pattern, ".zip"))

paste0(
    "https://s3.sa-east-1.amazonaws.com/ckan.saude.gov.br/SISVAN/",
    "estado_nutricional/sisvan_estado_nutricional_",
    year,
    ".zip"
  ) |>
  httr2::request() |>
  httr2::req_progress() |>
  httr2::req_perform(file)

Unzipping the Data

Code
file <-
  file |>
  utils::unzip(exdir = here::here("data"), overwrite = TRUE)
Code
file <-
  file |>
  fs::file_move(here::here("data", paste0(raw_file_pattern, ".csv")))
Code
fs::file_delete(here::here("data", paste0(raw_file_pattern, ".zip")))

Checking Data Dimensions

file |> groomr::peek_csv_file(delim = ";", skip = 0, has_header = TRUE)
#> The file has 34 columns, 28,537,529 rows, and 970,275,986 cells.

Reading and Filtering the Data

We use the vroom R package together with the AWK programming language to efficiently handle large datasets and mitigate memory issues. This approach allows the pipeline to run locally on most machines, though we recommend a minimum of 12 GB of RAM for optimal performance. Alternatively, the pipeline can also be executed on cloud platforms such as Google Colab or RStudio Cloud.

col_names <- c(
  "CO_ACOMPANHAMENTO",
  "CO_PESSOA_SISVAN",
  "ST_PARTICIPA_ANDI",
  "CO_MUNICIPIO_IBGE",
  "SG_UF",
  "NO_MUNICIPIO",
  "CO_CNES",
  "NU_IDADE_ANO",
  "NU_FASE_VIDA",
  "DS_FASE_VIDA",
  "SG_SEXO",
  "CO_RACA_COR",
  "DS_RACA_COR",
  "CO_POVO_COMUNIDADE",
  "DS_POVO_COMUNIDADE",
  "CO_ESCOLARIDADE",
  "DS_ESCOLARIDADE",
  "DT_ACOMPANHAMENTO",
  "NU_COMPETENCIA",
  "NU_PESO",
  "NU_ALTURA",
  "DS_IMC",
  "DS_IMC_PRE_GESTACIONAL",
  "PESO X IDADE",
  "PESO X ALTURA",
  "CRI. ALTURA X IDADE",
  "CRI. IMC X IDADE",
  "ADO. ALTURA X IDADE",
  "ADO. IMC X IDADE",
  "CO_ESTADO_NUTRI_ADULTO",
  "CO_ESTADO_NUTRI_IDOSO",
  "CO_ESTADO_NUTRI_IMC_SEMGEST",
  "CO_SISTEMA_ORIGEM_ACOMP",
  "SISTEMA_ORIGEM_ACOMP"
)
schema <- vroom::cols(
  CO_ACOMPANHAMENTO = vroom::col_character(),
  CO_PESSOA_SISVAN = vroom::col_character(),
  ST_PARTICIPA_ANDI = vroom::col_character(),
  CO_MUNICIPIO_IBGE = vroom::col_integer(),
  SG_UF = vroom::col_factor(),
  NO_MUNICIPIO = vroom::col_character(), # ? vroom::col_factor()
  CO_CNES = vroom::col_integer(),
  NU_IDADE_ANO = vroom::col_integer(),
  NU_FASE_VIDA = vroom::col_character(), # decimal mark = "." (double)
  DS_FASE_VIDA = vroom::col_factor(),
  SG_SEXO = vroom::col_factor(),
  CO_RACA_COR = vroom::col_character(),
  DS_RACA_COR = vroom::col_factor(),
  CO_POVO_COMUNIDADE = vroom::col_integer(),
  DS_POVO_COMUNIDADE = vroom::col_factor(),
  CO_ESCOLARIDADE = vroom::col_character(),
  DS_ESCOLARIDADE = vroom::col_factor(),
  DT_ACOMPANHAMENTO = vroom::col_date(),
  NU_COMPETENCIA = vroom::col_integer(),
  NU_PESO = vroom::col_double(),
  NU_ALTURA = vroom::col_integer(),
  DS_IMC = vroom::col_double(),
  DS_IMC_PRE_GESTACIONAL = vroom::col_character(), # decimal mark = "." (double)
  "PESO X IDADE" = vroom::col_factor(),
  "PESO X ALTURA" = vroom::col_factor(),
  "CRI. ALTURA X IDADE" = vroom::col_factor(),
  "CRI. IMC X IDADE" = vroom::col_factor(),
  "ADO. ALTURA X IDADE" = vroom::col_factor(),
  "ADO. IMC X IDADE" = vroom::col_factor(),
  CO_ESTADO_NUTRI_ADULTO = vroom::col_factor(),
  CO_ESTADO_NUTRI_IDOSO = vroom::col_factor(),
  CO_ESTADO_NUTRI_IMC_SEMGEST = vroom::col_factor(),
  CO_SISTEMA_ORIGEM_ACOMP = vroom::col_integer(),
  SISTEMA_ORIGEM_ACOMP = vroom::col_factor()
)

You may see warning messages about failed parsing. These warnings are expected due to minor inconsistencies in the SISVAN raw data and do not affect the overall analysis.

data <-
  vroom::vroom(
     # Uses `pipe()` and `awk` to filter data to avoid loading the
     # entire file into memory.
    file = pipe(
      paste(
        "awk -F ';' '{ if (",
        "($8 >= ", age_limits[1], ") && ($8 <= ", age_limits[2], ")",
        ") { print } }'",
        file
      )
    ),
    delim = ";",
    col_names = col_names,
    col_types = schema,
    col_select = dplyr::all_of(col_selection),
    id = NULL,
    skip = 0,
    n_max = Inf,
    na = c("", "NA"),
    quote = "\"",
    comment = "",
    skip_empty_rows = TRUE,
    trim_ws = TRUE,
    escape_double = TRUE,
    escape_backslash = FALSE,
    locale = vroom::locale(
      date_names = "pt",
      date_format = "%d/%m/%Y",
      time_format = "%H:%M:%S",
      decimal_mark = ",",
      grouping_mark = ".",
      tz = "America/Sao_Paulo",
      encoding = readr::guess_encoding(file)$encoding[1]
    ),
    guess_max = 100,
    altrep = TRUE,
    num_threads = vroom:::vroom_threads(),
    progress = vroom::vroom_progress(),
    show_col_types = NULL,
    .name_repair = "unique"
  )
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ CO_PESSOA_SISVAN  <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ CO_MUNICIPIO_IBGE <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ DT_ACOMPANHAMENTO <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ SG_SEXO           <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ NU_IDADE_ANO      <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ NU_PESO           <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ NU_ALTURA         <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Renaming the Data

data <-
  data |>
  janitor::clean_names() |>
  dplyr::rename(
    id = co_pessoa_sisvan,
    municipality_code = co_municipio_ibge,
    date = dt_acompanhamento,
    sex = sg_sexo,
    age = nu_idade_ano,
    weight = nu_peso,
    height = nu_altura
  )
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ sex               <fct> M, M, F, F, F, F, F, F, M, M, F, F, M, F, F, F, F…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Tidying the Data

data <-
  data |>
  dplyr::mutate(
    sex =
      sex |>
      dplyr::case_match(
        "F" ~ "female",
        "M" ~ "male"
      ) |>
      factor(
        levels = c("male", "female"),
        ordered = FALSE
      )
  ) |>
  dplyr::relocate(id, date)
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 7
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ sex               <fct> male, male, female, female, female, female, femal…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Transforming the Data

Adding State and Region Data

The orbis R package retrieves state and region information using the geobr package, developed by the Brazilian Institute for Applied Economic Research (IPEA). The geobr package, in turn, is based on official data from the Brazilian Institute of Geography and Statistics (IBGE).

brazil_municipalities <- orbis::get_brazil_municipality(
  year = plotr:::get_closest_geobr_year(year, type = "municipality")
)
data <-
  data |>
  dplyr::left_join(
    brazil_municipalities |>
      dplyr::mutate(
        municipality_code =
          municipality_code |>
          stringr::str_sub(end = -2) |>
          as.integer()
      ),
    by = "municipality_code"
  ) |>
  dplyr::relocate(
    id,
    date,
    region_code,
    region,
    state_code,
    state,
    federal_unit,
    municipality_code,
    municipality
  )
data |> dplyr::glimpse()
#> Rows: 4,775,907
#> Columns: 13
#> $ id                <chr> "B053B4FAD12CF2F95F1C251702606DCCD870A406", "CF43…
#> $ date              <date> 2017-01-23, 2017-01-11, 2017-01-02, 2017-01-10, …
#> $ region_code       <int> 2, 4, 5, 2, 2, 2, 2, 3, 2, 2, 4, 3, 1, 1, 3, 2, 4…
#> $ region            <chr> "Northeast", "South", "Central-West", "Northeast"…
#> $ state_code        <int> 23, 42, 52, 25, 23, 23, 29, 31, 23, 26, 43, 35, 1…
#> $ state             <chr> "Ceará", "Santa Catarina", "Goiás", "Paraíba", "C…
#> $ federal_unit      <chr> "CE", "SC", "GO", "PB", "CE", "CE", "BA", "MG", "…
#> $ municipality_code <int> 230670, 420270, 520520, 251445, 231380, 230710, 2…
#> $ municipality      <chr> "Jaguaretama", "Botuverá", "Caturaí", "São José d…
#> $ sex               <fct> male, male, female, female, female, female, femal…
#> $ age               <int> 1, 3, 3, 0, 4, 0, 0, 1, 1, 0, 0, 4, 1, 1, 0, 0, 0…
#> $ weight            <dbl> 11.600, 17.200, 14.000, 12.200, 16.000, 6.400, 8.…
#> $ height            <int> 77, 90, 97, 73, 100, 60, 63, 73, 88, 66, 67, 108,…

Validating the Data

Removing Duplicates

data <-
  data |>
  dplyr::arrange(dplyr::desc(date)) |>
  dplyr::distinct(
    id,
    age,
    date,
    weight,
    height,
    .keep_all = TRUE
  )
#> Warning: One or more parsing issues, call `problems()` on your data frame for
#> details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id                <chr> "B1C98CBB3CB83C75B08E2F62056EB10A68DBEBA4", "5B1A…
#> $ date              <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ region_code       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
#> $ region            <chr> "Central-West", "Central-West", "Central-West", "…
#> $ state_code        <int> 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 5…
#> $ state             <chr> "Goiás", "Goiás", "Goiás", "Goiás", "Goiás", "Goi…
#> $ federal_unit      <chr> "GO", "GO", "GO", "GO", "GO", "GO", "GO", "GO", "…
#> $ municipality_code <int> 522020, 522045, 522020, 521250, 522020, 522020, 5…
#> $ municipality      <chr> "São Miguel do Araguaia", "Senador Canedo", "São …
#> $ sex               <fct> female, male, female, male, female, male, female,…
#> $ age               <int> 3, 4, 2, 2, 2, 3, 3, 1, 2, 4, 2, 3, 2, 4, 3, 4, 4…
#> $ weight            <dbl> 17, NA, NA, NA, 14, 19, 16, 12, 14, 21, NA, 17, 1…
#> $ height            <int> 94, 107, 85, 81, 89, 96, 95, 80, 87, 127, 93, 93,…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    region_code,
    state_code,
    municipality_code,
    date,
    sex,
    age,
    weight,
    height
  )
data |> dplyr::glimpse()
#> Rows: 4,770,414
#> Columns: 13
#> $ id                <chr> "263E905B0395FF94BE2D97E92983F83D0F4D01E6", "B812…
#> $ date              <date> 2017-01-04, 2017-01-05, 2017-01-06, 2017-01-09, …
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110001, 110001, 110001, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ sex               <fct> female, male, female, male, male, female, male, m…
#> $ age               <int> 1, 1, 4, 2, 0, 3, 1, 2, 1, 4, 0, 2, 0, 0, 4, 0, 0…
#> $ weight            <dbl> 9.000, 9.700, 14.000, 12.700, 6.600, 13.800, 10.0…
#> $ height            <int> 81, 75, 95, 89, 58, 98, 83, 80, 76, 111, 53, 95, …

Data Dictionary

Code
metadata <-
  data |>
  labelled::`var_label<-`(
    list(
      id = "Unique identifier of the individual",
      date = "Date of the individual's nutritional assessment",
      region_code = "IBGE region code",
      region = "Region name",
      state_code = "IBGE state code",
      state = "State name",
      federal_unit = "Federal unit name",
      municipality_code = "IBGE municipality code",
      municipality = "Municipality name",
      sex = "Sex of the individual",
      age = "Age of the individual in years",
      weight = "Weight of the individual in kilograms",
      height = "Height of the individual in centimeters"
    )
  ) |>
  labelled::generate_dictionary(details = "full") |>
  labelled::convert_list_columns_to_character()
Code
metadata
Code
data

Saving the Valid Data

Data

Code
valid_file_pattern <- paste0(
  "valid-",
  year,
  "-age-",
  age_limits[1],
  "-",
  age_limits[2]
)
Code
data |>
  readr::write_csv(
    here::here("data", paste0(valid_file_pattern, ".csv"))
  )
Code
data |>
  readr::write_rds(
    here::here("data", paste0(valid_file_pattern, ".rds"))
  )

Metadata

Code
metadata_file_pattern <- paste0(
  "metadata-",
  year,
  "-age-",
  age_limits[1],
  "-",
  age_limits[2]
)
Code
metadata |>
  readr::write_csv(
    here::here("data", paste0(metadata_file_pattern, ".csv"))
  )
Code
metadata |>
  readr::write_rds(
    here::here("data", paste0(metadata_file_pattern, ".rds"))
  )

Checking the Relative Coverage

Transforming the Data

Removing Duplicates by Year

As described in Silva et al. (2023, p. 4), to calculate SISVAN’s total resident population coverage, only the most recent record for each individual within each year is retained for analysis.

data <-
  data |>
    dplyr::mutate(year = lubridate::year(date)) |>
    dplyr::arrange(dplyr::desc(date)) |>
    dplyr::distinct(id, year, .keep_all = TRUE) |>
    dplyr::relocate(year, .after = date)
data |> dplyr::glimpse()
#> Rows: 4,622,727
#> Columns: 14
#> $ id                <chr> "1B7842AF30A5899C2B6D82688E95EDCD96355BA1", "9A35…
#> $ date              <date> 2017-12-31, 2017-12-31, 2017-12-31, 2017-12-31, …
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110002, 110002, 110002, 110002, 110002, 110002, 1…
#> $ municipality      <chr> "Ariquemes", "Ariquemes", "Ariquemes", "Ariquemes…
#> $ sex               <fct> male, male, male, male, male, male, male, male, m…
#> $ age               <int> 0, 1, 1, 2, 2, 2, 2, 3, 4, 1, 2, 3, 4, 1, 1, 1, 2…
#> $ weight            <dbl> 7, 12, 13, 12, 13, NA, NA, 17, 18, NA, NA, 19, NA…
#> $ height            <int> 63, 82, 85, 90, 90, 87, 88, 117, 107, 61, 96, 107…

Summarizing the Data by Year

data <-
  data |>
    dplyr::summarize(
      coverage = dplyr::n(),
      mean_age = age |> mean(na.rm = TRUE),
      mean_weight = weight |> mean(na.rm = TRUE),
      mean_height = height |> mean(na.rm = TRUE),
      .by = c(
        "year",
        "region_code",
        "state_code",
        "municipality_code"
      )
    )
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 8
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 1…
#> $ municipality_code <int> 110002, 110005, 110025, 110100, 120025, 120040, 1…
#> $ coverage          <int> 1577, 622, 887, 334, 557, 7291, 995, 3741, 2172, …
#> $ mean_age          <dbl> 2.581483830, 2.223472669, 2.278466742, 2.46107784…
#> $ mean_weight       <dbl> 15.17615741, 13.98031309, 14.23404969, 14.7764350…
#> $ mean_height       <dbl> 93.11921370, 91.22508039, 91.10496614, 93.9251497…

Adding Population Estimates

As described in the Methods section, the population estimates were obtained from the DATASUS platform, which provides annual data by municipality, age, and sex for Brazil from 2000 to 2024 (Comitê de Gestão de Indicadores et al., n.d.).

To ensure reproducibility and organization, the DATASUS data used in this pipeline are processed and validated through a separate reproducible pipeline, available here (Vartanian & Carvalho, 2025). The validated datasets are downloaded directly from OSF. For further details, refer to the linked pipeline.

datasus_file_pattern <- paste0("datasus-pop-estimates-", year)
datasus_file <- here::here("data", paste0(datasus_file_pattern, ".rds"))

if (!checkmate::test_file_exists(datasus_file)) {
  osf_id <-
    paste0("https://osf.io/", "h3pyd") |>
    osfr::osf_retrieve_node() |>
    osfr::osf_ls_files(
      type = "file",
      pattern = paste0("valid-", year, ".rds")
    )

  osfr::osf_download(
    x = osf_id,
    path = tempdir(),
    conflicts = "overwrite"
  ) |>
    dplyr::pull(local_path) |>
    fs::file_move(datasus_file)
}

pop_estimates <- datasus_file |> readr::read_rds()
data <-
  pop_estimates |>
  dplyr::filter(dplyr::between(age, age_limits[1], age_limits[2])) |>
  dplyr::summarize(
    n = n |> sum(na.rm = TRUE),
    .by = c(
      "year",
      "region_code",
      "state_code",
      "municipality_code"
    )
  ) |>
  dplyr::mutate(
    municipality_code =
      municipality_code |>
      stringr::str_sub(end = -2) |>
      as.integer()
  ) |>
  dplyr::right_join(
    data,
    by = c(
      "year",
      "region_code",
      "state_code",
      "municipality_code"
    )
  ) |>
  dplyr::rename(population = n) |>
  dplyr::relocate(population, .before = coverage)
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Validating the Data

The population value used here is an estimate. If the SISVAN coverage for a municipality exceeds the estimated population, the population value is adjusted to match the coverage.

Note: At this stage, only the most recent record for each individual is retained.

data <-
  data |>
  dplyr::mutate(
    population = dplyr::case_when(
      coverage > population ~ coverage,
      TRUE ~ population
    )
  )
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 9
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Calculating Relative Coverage

data <-
  data |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::relocate(coverage_per, .after = coverage)
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per      <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    year,
    region_code,
    state_code,
    municipality_code
  )
data |> dplyr::glimpse()
#> Rows: 5,570
#> Columns: 10
#> $ year              <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ population        <int> 1808, 8154, 426, 6551, 1258, 1242, 604, 1353, 250…
#> $ coverage          <int> 734, 1577, 94, 1334, 622, 374, 223, 584, 702, 107…
#> $ coverage_per      <dbl> 40.597345133, 19.340201128, 22.065727700, 20.3633…
#> $ mean_age          <dbl> 2.179836512, 2.581483830, 2.053191489, 2.11544227…
#> $ mean_weight       <dbl> 13.63854669, 15.17615741, 14.22727273, 13.3930077…
#> $ mean_height       <dbl> 90.91256831, 93.11921370, 91.26595745, 89.4924698…

Checking Relative Coverage by Region

The coverage observed here is slightly lower than that reported in Silva et al. (2023, Table 2). This difference may be explained by the use of different data sources (Fundação Oswaldo Cruz (Fiocruz) vs. OpenDataSUS).

Code
data |>
  dplyr::mutate(region = orbis::get_brazil_region(region_code)) |>
  dplyr::summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "region"
  ) |>
  dplyr::slice(c(1, 2, 5, 3, 4)) |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::rename(
    Region = region,
    Population = population,
    `SISVAN coverage` = coverage,
    `SISVAN coverage (%)` = coverage_per
  ) |>
  pal::pipe_table() |>
  pal::cat_lines()
Region Population SISVAN coverage SISVAN coverage (%)
North 1592792 624150 39.18590751
Northeast 4107294 1813679 44.15751587
Central-West 1208858 268745 22.23131253
Southeast 5767592 1336352 23.17001619
South 1975068 579801 29.35600192

Checking Relative Coverage by State

Code
data |>
  dplyr::mutate(state = orbis::get_brazil_state(state_code)) |>
  dplyr::summarize(
    population = population |> sum(na.rm = TRUE),
    coverage = coverage |> sum(na.rm = TRUE),
    .by = "state"
  ) |>
  dplyr::arrange(state) |>
  dplyr::mutate(coverage_per = (coverage / population) * 100) |>
  dplyr::rename(
    State = state,
    Population = population,
    `SISVAN coverage` = coverage,
    `SISVAN coverage (%)` = coverage_per
  ) |>
  pal::pipe_table() |>
  pal::cat_lines()
State Population SISVAN coverage SISVAN coverage (%)
Acre 81517 36178 44.380926678
Alagoas 253571 113343 44.698723435
Amapá 79072 20820 26.330433023
Amazonas 403287 165140 40.948505655
Bahia 1012762 444864 43.925818702
Ceará 645357 281507 43.620352766
Distrito Federal 218010 18757 8.603733774
Espírito Santo 277541 61507 22.161410386
Goiás 491920 103909 21.123150106
Maranhão 578369 283802 49.069365751
Mato Grosso 280532 79026 28.170048337
Mato Grosso do Sul 218396 67053 30.702485394
Minas Gerais 1309142 620924 47.429843363
Paraná 786664 253352 32.205871884
Paraíba 285820 150132 52.526765097
Pará 710233 290899 40.958248913
Pernambuco 692218 259659 37.511159779
Piauí 234935 125698 53.503309426
Rio Grande do Norte 237059 84896 35.812181778
Rio Grande do Sul 707754 177919 25.138536836
Rio de Janeiro 1122656 181772 16.191246473
Rondônia 135254 33899 25.063214397
Roraima 59647 16842 28.236122521
Santa Catarina 480650 148530 30.901903672
Sergipe 167203 69778 41.732504800
São Paulo 3058253 472149 15.438519965
Tocantins 123782 60372 48.772842578

Visualizing the Relative Coverage

Code
brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n_prop = x,
    colors = c(
      brandr::get_brand_color("dark-red"),
      # brandr::get_brand_color("white"),
      brandr::get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      brandr::get_brand_color("dark-red-triadic-blue")
    )
  )
}
Code
data |>
  tidyr::drop_na(coverage_per) |>
  plotr:::plot_hist(
    col = "coverage_per",
    density_line_color = brandr::get_brand_color("red"),
    x_label = "Coverage (%)",
    print = FALSE
  ) +
  ggplot2::labs(
    title = "SISVAN Coverage by Municipality (%)",
    subtitle = paste0("Year: ", year),
    caption = "Source: SISVAN"
  )

Code
data |>
  tidyr::drop_na(coverage_per, municipality_code) |>
  plotr:::plot_brazil_municipality(
    col_fill = "coverage_per",
    col_code = "municipality_code",
    year = plotr:::get_closest_geobr_year(year, type = "municipality"),
    comparable_areas = FALSE,
    breaks = seq(0, 100, 25),
    limits = c(0, 100),
    palette = brand_div_palette,
    print = FALSE
  ) +
  ggplot2::labs(
    title = "SISVAN Coverage by Municipality (%)",
    subtitle = paste0("Year: ", year),
    caption = "Source: SISVAN"
  )
#> Scale on map varies by more than 10%, scale bar may be inaccurate

How to Cite

To cite this work, please use the following format:

Vartanian, D., Schettino, J. P. J., & Carvalho, A. M. (2025). A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023) [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/sisvan-nutritional-status

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {A reproducible pipeline for processing SISVAN microdata on nutritional status monitoring in Brazil (2008-2023)},
  author = {{Daniel Vartanian} and {João Pedro Junqueira Schettino} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/sisvan-nutritional-status}
}

License

License: MIT License: CC BY 4.0

The code in this report is licensed under the MIT License, while the documents are available under the Creative Commons Attribution 4.0 International License.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048
Bagni, U. V., & Barros, D. C. D. (2015). Erro em antropometria aplicada à avaliação nutricional nos serviços de saúde: Causas, consequências e métodos de mensuração. Nutrire, 40(2), 226–236. https://doi.org/10.4322/2316-7874.18613
Bopp, M., & Faeh, D. (2008). End-digits preference for self-reported height depends on language. BMC Public Health, 8(1), 342. https://doi.org/10.1186/1471-2458-8-342
Comitê de Gestão de Indicadores, Rede Interagencial de Informações para a Saúde, Coordenação-Geral de Informações e Análises Epidemiológicas, Secretaria de Vigilância em Saúde e Ambiente, Ministério da Saúde, & Instituto Brasileiro de Geografia e Estatística. (n.d.). População residente – Estudo de estimativas populacionais por município, idade e sexo 2000-2024 – Brasil [Resident population – Study of population estimates by municipality, age, and sex, 2000–2024 – Brazil] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/deftohtm.exe?ibge/cnv/popsvs2024br.def
Corsi, D. J., Perkins, J. M., & Subramanian, S. V. (2017). Child anthropometry data quality from Demographic and Health Surveys, Multiple Indicator Cluster Surveys, and National Nutrition Surveys in the West Central Africa region: Are we comparing apples and oranges? Global Health Action, 10(1), 1328185. https://doi.org/10.1080/16549716.2017.1328185
Finaret, A. B., & Hutchinson, M. (2018). Missingness of height data from the demographic and health surveys in Africa between 1991 and 2016 was not random but is unlikely to have major implications for biases in estimating stunting prevalence or the determinants of child height. The Journal of Nutrition, 148(5), 781–789. https://doi.org/10.1093/jn/nxy037
Lawman, H. G., Ogden, C. L., Hassink, S., Mallya, G., Vander Veur, S., & Foster, G. D. (2015). Comparing methods for identifying biologically implausible values in height, weight, and body mass index among youth. American Journal of Epidemiology, 182(4), 359–365. https://doi.org/10.1093/aje/kwv057
Lyons-Amos, M., & Stones, T. (2017). Trends in demographic and health survey data quality: An analysis of age heaping over time in 34 countries in sub saharan Africa between 1987 and 2015. BMC Research Notes, 10(1), 760. https://doi.org/10.1186/s13104-017-3091-x
Mei, Z. (2007). Standard deviation of anthropometric Z-scores as a data quality assessment tool using the 2006 WHO growth standards: A cross country analysis. Bulletin of the World Health Organization, 85(6), 441–448. https://doi.org/10.2471/BLT.06.034421
Mourão, E., Gallo, C. D. O., Nascimento, F. A. D., & Jaime, P. C. (2020). Tendência temporal da cobertura do Sistema de Vigilância Alimentar e Nutricional entre crianças menores de 5 anos da região Norte do Brasil, 2008-2017*. Epidemiologia e Serviços de Saúde, 29(2). https://doi.org/10.5123/S1679-49742020000200026
Nannan, N., Dorrington, R., & Bradshaw, D. (2019). Estimating completeness of birth registration in South Africa, 1996 – 2011. Bulletin of the World Health Organization, 97(7), 468–476. https://doi.org/10.2471/BLT.18.222620
Nascimento, F. A. D., Silva, S. A. D., & Jaime, P. C. (2017). Cobertura da avaliação do estado nutricional no Sistema de Vigilância Alimentar e Nutricional brasileiro: 2008 a 2013. Cadernos de Saúde Pública, 33(12). https://doi.org/10.1590/0102-311x00161516
Pereira, R. H. M., & Goncalves, C. N. (n.d.). geobr: Download official spatial data sets of Brazil [Computer software]. https://doi.org/10.32614/CRAN.package.geobr
Perumal, N., Namaste, S., Qamar, H., Aimone, A., Bassani, D. G., & Roth, D. E. (2020). Anthropometric data quality assessment in multisurvey studies of child growth. The American Journal of Clinical Nutrition, 112, 806S–815S. https://doi.org/10.1093/ajcn/nqaa162
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org
Silva, N. de J., Silva, J. F. de M. e, Carrilho, T. R. B., Pinto, E. de J., Andrade, R. da C. S. de, Silva, S. A., Pedroso, J., Spaniol, A. M., Bortolini, G. A., Fagundes, A., Nilson, E. A. F., Fiaccone, R. L., Kac, G., Barreto, M. L., & Ribeiro-Silva, R. de C. (2023). Qualidade dos dados antropométricos infantis do Sisvan, Brasil, 2008-2017. Revista de Saúde Pública, 57(1, 1), 62–62. https://doi.org/10.11606/s1518-8787.2023057004655
Sistema de Vigilância Alimentar e Nutricional, Coordenação-Geral de Alimentação e Nutrição, Departamento de Promoção da Saúde, Coordenação Setorial de Tecnologia da Informação, Secretaria de Atenção Primária à Saúde, & Ministério da Saúde. (n.d.). Microdados dos acompanhamentos de estado nutricional [Microdata on nutritional status monitoring] [Data set]. openDataSUS. Retrieved November 16, 2023, from https://opendatasus.saude.gov.br/dataset/sisvan-estado-nutricional
Vartanian, D. (n.d.). {orbis}: Spatial data analysis tools [Computer software]. https://danielvartan.github.io/orbis/
Vartanian, D., & Carvalho, A. M. de. (2025). A reproducible pipeline for processing DATASUS annual population estimates by municipality, age, and sex in Brazil (2000-2024). Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-pop-estimates
Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz