A Reproducible Pipeline for Processing DATASUS Data on the Gini Index of Per Capita Household Income by Brazilian Municipality for the Years 1991, 2000, and 2010

Daniel Vartanian &amp; Aline Martins de Carvalho

Author

Daniel Vartanian & Aline Martins de Carvalho

Published

2025-05-05

Overview

This report provides a reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010. The main goal is to provide an open and reliable workflow for processing these data, supporting research and informed public policy decisions.

Data Availability

The processed data are available in both csv and rds formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr package.

A backup copy of the raw data is also available in OSF. You can access it here.

Methods

Source of Data

The data used in this report were sourced from the Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 (Instituto de Pesquisas Econômicas e Aplicadas & Instituto Brasileiro de Geografia e Estatística, n.d.). These data are derived from the Population Census conducted by the Brazilian Institute of Geography and Statistics (IBGE).

For technical information about the raw dataset, see the official technical note (in Portuguese).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Setting the Environment

library(brandr)
library(dplyr)
library(ggplot2)
library(here)
library(httr2)
library(janitor)
library(labelled)
library(orbis) # github.com/danielvartan/orbis
library(osfr)
library(plotr) # github.com/danielvartan/plotr
library(readr)
library(tidyr)

Downloading the Data

Code

if (!dir.exists(here::here("data"))) dir.create("data")

file <- here::here("data", paste0("raw.csv"))

"http://tabnet.datasus.gov.br/cgi/ibge/censo/bases/ginibr.csv" |>
  httr2::request() |>
  httr2::req_progress() |>
  httr2::req_perform(file)

Reading the Data

data <-
  file |>
  readr::read_delim(
    delim = ";",
    col_names = FALSE,
    col_types = readr::cols(.default = "c"),
    trim_ws = TRUE,
    skip = 3
  ) |>
  dplyr::slice(1:5565) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::everything(),
      .fns = ~ iconv(.x, from = "latin1", to = "UTF-8")
    )
  )

data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ X1 <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre dos Parecis"…
#> $ X2 <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6527", "...…
#> $ X3 <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,6474", "0,58…
#> $ X4 <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5017", "0,5…

Renaming the Data

data <-
  data |>
  janitor::clean_names() |>
  dplyr::rename(
    municipality = x1,
    x1991 = x2,
    x2000 = x3,
    x2010 = x4
  )

data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ municipality <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre do…
#> $ x1991        <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6…
#> $ x2000        <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,64…
#> $ x2010        <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5…

Tidying the Data

data <-
  data |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("x"),
      .fns = ~ dplyr::case_when(
        .x == "..." ~ NA,
        TRUE ~ .x |> stringr::str_replace_all(",", ".")
      )
    ),
    dplyr::across(
      .cols = dplyr::starts_with("x"),
      .fns = as.numeric
    )
  ) |>
  tidyr::pivot_longer(
    cols = starts_with("x"),
    names_to = "year",
    values_to = "gini_index"
  ) |>
  dplyr::mutate(
    year =
      year |>
      stringr::str_remove("x") |>
      as.integer(),
    municipality_code =
      municipality |>
      stringr::str_extract("\\d*") |>
      as.integer(),
    municipality =
      municipality |>
      stringr::str_remove("\\d*") |>
      stringr::str_trim()
  ) |>
  dplyr::relocate(
    year,
    municipality_code,
    .before = municipality
  )

data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 4
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Adding State and Region Data

brazil_municipalities <- orbis::get_brazil_municipality(
  year = plotr:::get_closest_geobr_year(2000, type = "municipality")
)

data <-
  data |>
  dplyr::select(-municipality) |>
  dplyr::left_join(
    brazil_municipalities |>
      dplyr::mutate(
        municipality_code =
          municipality_code |>
          stringr::str_sub(end = -2) |>
          as.integer()
      ),
    by = "municipality_code"
  ) |>
  dplyr::relocate(
    year,
    region_code,
    region,
    state_code,
    state,
    federal_unit,
    municipality_code,
    municipality
  )

data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Validating the Data

data <-
  data |>
  dplyr::mutate(
    gini_index = dplyr::case_when(
      !dplyr::between(gini_index, 0, 1) ~ NA,
      TRUE ~ gini_index
    )
  )

data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    year,
    region_code,
    state_code,
    municipality_code
  )

data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ gini_index        <dbl> 0.5983, 0.5827, 0.6527, 0.6800, 0.5958, 0.6229, N…

Data Dictionary

Code

metadata <-
  data |>
  labelled::`var_label<-`(
    list(
      year = "Census year",
      region_code = "IBGE region code",
      region = "Region name",
      state_code = "IBGE state code",
      state = "State name",
      federal_unit = "Federal unit name",
      municipality_code = "IBGE municipality code",
      municipality = "Municipality name",
      gini_index = "Gini Index of per capita household income"
    )
  ) |>
  labelled::generate_dictionary(details = "full") |>
  labelled::convert_list_columns_to_character()

Code

metadata

Code

data

Saving the Valid Data

Data

Code

data |> readr::write_csv(here::here("data", "valid.csv"))

Code

data |> readr::write_rds(here::here("data", "valid.rds"))

Metadata

Code

metadata |> readr::write_csv(here::here("data", "metadata.csv"))

Code

metadata |> readr::write_rds(here::here("data", "metadata.rds"))

Visualizing the Data

The Gini Index is a measure of income inequality within a population, ranging from 0 to 1. A value close to 0 indicates a perfectly equal income distribution, while a value close to 1 indicates total inequality.

Code

brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n_prop = x,
    colors = c(
      brandr::get_brand_color("dark-red-triadic-blue"),
      # brandr::get_brand_color("white"),
      brandr::get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      brandr::get_brand_color("dark-red")
    )
  )
}

Code

data |>
  dplyr::filter(year == 2010) |>
  tidyr::drop_na(gini_index) |>
  plotr:::plot_hist(
    col = "gini_index",
    density_line_color = "red",
    x_label = "Gini Index",
    print = FALSE
  ) +
  ggplot2::xlim(0, 1) +
  ggplot2::labs(
    title = paste0(
      "Gini Index of Per Capita Household Income by Brazilian Municipality"
    ),
    subtitle = "Year: 2010",
    caption = "Source: DATASUS/IPEA/IBGE"
  )

Code

data |>
  dplyr::filter(year == 2010) |>
  tidyr::drop_na(gini_index) |>
  plotr:::plot_brazil_municipality(
    col_fill = "gini_index",
    col_code = "municipality_code",
    year = plotr:::get_closest_geobr_year(2010, type = "municipality"),
    comparable_areas = FALSE,
    reverse = TRUE,
    limits = c(0, 1),
    breaks = seq(0, 1, 0.25),
    palette = brand_div_palette,
    print = FALSE
  ) +
  ggplot2::labs(
    title = paste0(
      "Gini Index of Per Capita Household Income by Brazilian Municipality"
    ),
    subtitle = "Year: 2010",
    caption = "Source: DATASUS/IPEA/IBGE"
  )

How to Cite

To cite this work, please use the following format:

Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-gini-index

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010},
  author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/datasus-gini-index}
}

License

This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

Instituto de Pesquisas Econômicas e Aplicadas, & Instituto Brasileiro de Geografia e Estatística. (n.d.). Índice de Gini da renda domiciliar per capita segundo município – Período: 1991, 2000 e 2010 [Gini Index of per capita household income by municipality – Period: 1991, 2000, and 2010] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/ibge/censo/cnv/ginibr.def

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz