A Reproducible Pipeline for Processing DATASUS Data on the Gini Index of Per Capita Household Income by Brazilian Municipality for the Years 1991, 2000, and 2010

Author

Daniel Vartanian & Aline Martins de Carvalho

Published

2025-05-05

Project Status: Inactive – The project has reached a stable, usable state but is no longer being actively developed; support/maintenance will be provided as time allows. OSF DOI License: CC0-1.0

Overview

This report provides a reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010. The main goal is to provide an open and reliable workflow for processing these data, supporting research and informed public policy decisions.

Data Availability

The processed data are available in both csv and rds formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr package.

A backup copy of the raw data is also available in OSF. You can access it here.

Methods

Source of Data

The data used in this report were sourced from the Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 (Instituto de Pesquisas Econômicas e Aplicadas & Instituto Brasileiro de Geografia e Estatística, n.d.). These data are derived from the Population Census conducted by the Brazilian Institute of Geography and Statistics (IBGE).

For technical information about the raw dataset, see the official technical note (in Portuguese).

Data Munging

The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 1: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Source: Reproduced from Wickham et al. (2023).

Code Style

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.

Setting the Environment

Downloading the Data

Code
if (!dir.exists(here::here("data"))) dir.create("data")
file <- here::here("data", paste0("raw.csv"))

"http://tabnet.datasus.gov.br/cgi/ibge/censo/bases/ginibr.csv" |>
  httr2::request() |>
  httr2::req_progress() |>
  httr2::req_perform(file)

Reading the Data

data <-
  file |>
  readr::read_delim(
    delim = ";",
    col_names = FALSE,
    col_types = readr::cols(.default = "c"),
    trim_ws = TRUE,
    skip = 3
  ) |>
  dplyr::slice(1:5565) |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::everything(),
      .fns = ~ iconv(.x, from = "latin1", to = "UTF-8")
    )
  )
data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ X1 <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre dos Parecis"…
#> $ X2 <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6527", "...…
#> $ X3 <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,6474", "0,58…
#> $ X4 <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5017", "0,5…

Renaming the Data

data <-
  data |>
  janitor::clean_names() |>
  dplyr::rename(
    municipality = x1,
    x1991 = x2,
    x2000 = x3,
    x2010 = x4
  )
data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ municipality <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre do…
#> $ x1991        <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6…
#> $ x2000        <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,64…
#> $ x2010        <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5…

Tidying the Data

data <-
  data |>
  dplyr::mutate(
    dplyr::across(
      .cols = dplyr::starts_with("x"),
      .fns = ~ dplyr::case_when(
        .x == "..." ~ NA,
        TRUE ~ .x |> stringr::str_replace_all(",", ".")
      )
    ),
    dplyr::across(
      .cols = dplyr::starts_with("x"),
      .fns = as.numeric
    )
  ) |>
  tidyr::pivot_longer(
    cols = starts_with("x"),
    names_to = "year",
    values_to = "gini_index"
  ) |>
  dplyr::mutate(
    year =
      year |>
      stringr::str_remove("x") |>
      as.integer(),
    municipality_code =
      municipality |>
      stringr::str_extract("\\d*") |>
      as.integer(),
    municipality =
      municipality |>
      stringr::str_remove("\\d*") |>
      stringr::str_trim()
  ) |>
  dplyr::relocate(
    year,
    municipality_code,
    .before = municipality
  )
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 4
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Adding State and Region Data

brazil_municipalities <- orbis::get_brazil_municipality(
  year = plotr:::get_closest_geobr_year(2000, type = "municipality")
)
data <-
  data |>
  dplyr::select(-municipality) |>
  dplyr::left_join(
    brazil_municipalities |>
      dplyr::mutate(
        municipality_code =
          municipality_code |>
          stringr::str_sub(end = -2) |>
          as.integer()
      ),
    by = "municipality_code"
  ) |>
  dplyr::relocate(
    year,
    region_code,
    region,
    state_code,
    state,
    federal_unit,
    municipality_code,
    municipality
  )
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Validating the Data

data <-
  data |>
  dplyr::mutate(
    gini_index = dplyr::case_when(
      !dplyr::between(gini_index, 0, 1) ~ NA,
      TRUE ~ gini_index
    )
  )
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index        <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…

Arranging the Data

data <-
  data |>
  dplyr::arrange(
    year,
    region_code,
    state_code,
    municipality_code
  )
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year              <int> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1…
#> $ region_code       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region            <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code        <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state             <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit      <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ municipality      <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ gini_index        <dbl> 0.5983, 0.5827, 0.6527, 0.6800, 0.5958, 0.6229, N…

Data Dictionary

Code
metadata <-
  data |>
  labelled::`var_label<-`(
    list(
      year = "Census year",
      region_code = "IBGE region code",
      region = "Region name",
      state_code = "IBGE state code",
      state = "State name",
      federal_unit = "Federal unit name",
      municipality_code = "IBGE municipality code",
      municipality = "Municipality name",
      gini_index = "Gini Index of per capita household income"
    )
  ) |>
  labelled::generate_dictionary(details = "full") |>
  labelled::convert_list_columns_to_character()
Code
metadata
Code
data

Saving the Valid Data

Data

Code
data |> readr::write_csv(here::here("data", "valid.csv"))
Code
data |> readr::write_rds(here::here("data", "valid.rds"))

Metadata

Code
metadata |> readr::write_csv(here::here("data", "metadata.csv"))
Code
metadata |> readr::write_rds(here::here("data", "metadata.rds"))

Visualizing the Data

The Gini Index is a measure of income inequality within a population, ranging from 0 to 1. A value close to 0 indicates a perfectly equal income distribution, while a value close to 1 indicates total inequality.

Code
brand_div_palette <- function(x) {
  brandr:::make_color_ramp(
    n_prop = x,
    colors = c(
      brandr::get_brand_color("dark-red-triadic-blue"),
      # brandr::get_brand_color("white"),
      brandr::get_brand_color_mix(
        position = 950,
        color_1 = "dark-red",
        color_2 = "dark-red-triadic-blue",
        alpha = 0.5
      ),
      brandr::get_brand_color("dark-red")
    )
  )
}
Code
data |>
  dplyr::filter(year == 2010) |>
  tidyr::drop_na(gini_index) |>
  plotr:::plot_hist(
    col = "gini_index",
    density_line_color = "red",
    x_label = "Gini Index",
    print = FALSE
  ) +
  ggplot2::xlim(0, 1) +
  ggplot2::labs(
    title = paste0(
      "Gini Index of Per Capita Household Income by Brazilian Municipality"
    ),
    subtitle = "Year: 2010",
    caption = "Source: DATASUS/IPEA/IBGE"
  )

Code
data |>
  dplyr::filter(year == 2010) |>
  tidyr::drop_na(gini_index) |>
  plotr:::plot_brazil_municipality(
    col_fill = "gini_index",
    col_code = "municipality_code",
    year = plotr:::get_closest_geobr_year(2010, type = "municipality"),
    comparable_areas = FALSE,
    reverse = TRUE,
    limits = c(0, 1),
    breaks = seq(0, 1, 0.25),
    palette = brand_div_palette,
    print = FALSE
  ) +
  ggplot2::labs(
    title = paste0(
      "Gini Index of Per Capita Household Income by Brazilian Municipality"
    ),
    subtitle = "Year: 2010",
    caption = "Source: DATASUS/IPEA/IBGE"
  )

How to Cite

To cite this work, please use the following format:

Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-gini-index

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010},
  author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://sustentarea.github.io/datasus-gini-index}
}

License

This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.

Acknowledgments

This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).

This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048
Instituto de Pesquisas Econômicas e Aplicadas, & Instituto Brasileiro de Geografia e Estatística. (n.d.). Índice de Gini da renda domiciliar per capita segundo município – Período: 1991, 2000 e 2010 [Gini Index of per capita household income by municipality – Period: 1991, 2000, and 2010] [Data set]. DATASUS - Tabnet. Retrieved November 16, 2023, from http://tabnet.datasus.gov.br/cgi/ibge/censo/cnv/ginibr.def
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org
Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz