A Reproducible Pipeline for Processing DATASUS Data on the Gini Index of Per Capita Household Income by Brazilian Municipality for the Years 1991, 2000, and 2010
Overview
This report provides a reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010. The main goal is to provide an open and reliable workflow for processing these data, supporting research and informed public policy decisions.
Data Availability
The processed data are available in both csv
and rds
formats via a dedicated repository on the Open Science Framework (OSF), accessible here. A metadata file is included alongside the validated data. You can also access these files directly from R using the osfr
package.
A backup copy of the raw data is also available in OSF. You can access it here.
Methods
Source of Data
The data used in this report were sourced from the Department of Informatics of the Brazilian Unified Health System (DATASUS) platform, which provides data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 (Instituto de Pesquisas Econômicas e Aplicadas & Instituto Brasileiro de Geografia e Estatística, n.d.). These data are derived from the Population Census conducted by the Brazilian Institute of Geography and Statistics (IBGE).
For technical information about the raw dataset, see the official technical note (in Portuguese).
Data Munging
The data munging followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 1. All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.) and several R packages.
The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.
Source: Reproduced from Wickham et al. (2023).
Code Style
The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. See the README file in the code repository to learn how to run it.
Setting the Environment
Downloading the Data
Code
if (!dir.exists(here::here("data"))) dir.create("data")
file <- here::here("data", paste0("raw.csv"))
"http://tabnet.datasus.gov.br/cgi/ibge/censo/bases/ginibr.csv" |>
httr2::request() |>
httr2::req_progress() |>
httr2::req_perform(file)
Reading the Data
data <-
file |>
readr::read_delim(
delim = ";",
col_names = FALSE,
col_types = readr::cols(.default = "c"),
trim_ws = TRUE,
skip = 3
) |>
dplyr::slice(1:5565) |>
dplyr::mutate(
dplyr::across(
.cols = dplyr::everything(),
.fns = ~ iconv(.x, from = "latin1", to = "UTF-8")
)
)
data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ X1 <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre dos Parecis"…
#> $ X2 <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6527", "...…
#> $ X3 <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,6474", "0,58…
#> $ X4 <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5017", "0,5…
Renaming the Data
data <-
data |>
janitor::clean_names() |>
dplyr::rename(
municipality = x1,
x1991 = x2,
x2000 = x3,
x2010 = x4
)
data |> dplyr::glimpse()
#> Rows: 5,565
#> Columns: 4
#> $ municipality <chr> "110001 Alta Floresta D'Oeste", "110037 Alto Alegre do…
#> $ x1991 <chr> "0,5983", "...", "...", "0,569", "0,5827", "...", "0,6…
#> $ x2000 <chr> "0,5868", "0,508", "0,6256", "0,6534", "0,5927", "0,64…
#> $ x2010 <chr> "0,5893", "0,5491", "0,5417", "0,5355", "0,5496", "0,5…
Tidying the Data
data <-
data |>
dplyr::mutate(
dplyr::across(
.cols = dplyr::starts_with("x"),
.fns = ~ dplyr::case_when(
.x == "..." ~ NA,
TRUE ~ .x |> stringr::str_replace_all(",", ".")
)
),
dplyr::across(
.cols = dplyr::starts_with("x"),
.fns = as.numeric
)
) |>
tidyr::pivot_longer(
cols = starts_with("x"),
names_to = "year",
values_to = "gini_index"
) |>
dplyr::mutate(
year =
year |>
stringr::str_remove("x") |>
as.integer(),
municipality_code =
municipality |>
stringr::str_extract("\\d*") |>
as.integer(),
municipality =
municipality |>
stringr::str_remove("\\d*") |>
stringr::str_trim()
) |>
dplyr::relocate(
year,
municipality_code,
.before = municipality
)
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 4
#> $ year <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…
Adding State and Region Data
brazil_municipalities <- orbis::get_brazil_municipality(
year = plotr:::get_closest_geobr_year(2000, type = "municipality")
)
data <-
data |>
dplyr::select(-municipality) |>
dplyr::left_join(
brazil_municipalities |>
dplyr::mutate(
municipality_code =
municipality_code |>
stringr::str_sub(end = -2) |>
as.integer()
),
by = "municipality_code"
) |>
dplyr::relocate(
year,
region_code,
region,
state_code,
state,
federal_unit,
municipality_code,
municipality
)
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…
Validating the Data
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year <int> 1991, 2000, 2010, 1991, 2000, 2010, 1991, 2000, 2…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110001, 110001, 110037, 110037, 110037, 1…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Alta Floresta D'Oeste",…
#> $ gini_index <dbl> 0.5983, 0.5868, 0.5893, NA, 0.5080, 0.5491, NA, 0…
Arranging the Data
data <-
data |>
dplyr::arrange(
year,
region_code,
state_code,
municipality_code
)
data |> dplyr::glimpse()
#> Rows: 16,695
#> Columns: 9
#> $ year <int> 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1…
#> $ region_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ region <chr> "North", "North", "North", "North", "North", "Nor…
#> $ state_code <int> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 1…
#> $ state <chr> "Rondônia", "Rondônia", "Rondônia", "Rondônia", "…
#> $ federal_unit <chr> "RO", "RO", "RO", "RO", "RO", "RO", "RO", "RO", "…
#> $ municipality_code <int> 110001, 110002, 110003, 110004, 110005, 110006, 1…
#> $ municipality <chr> "Alta Floresta D'Oeste", "Ariquemes", "Cabixi", "…
#> $ gini_index <dbl> 0.5983, 0.5827, 0.6527, 0.6800, 0.5958, 0.6229, N…
Data Dictionary
Code
metadata <-
data |>
labelled::`var_label<-`(
list(
year = "Census year",
region_code = "IBGE region code",
region = "Region name",
state_code = "IBGE state code",
state = "State name",
federal_unit = "Federal unit name",
municipality_code = "IBGE municipality code",
municipality = "Municipality name",
gini_index = "Gini Index of per capita household income"
)
) |>
labelled::generate_dictionary(details = "full") |>
labelled::convert_list_columns_to_character()
Code
metadata
Code
data
Saving the Valid Data
Data
Metadata
Visualizing the Data
The Gini Index is a measure of income inequality within a population, ranging from 0 to 1. A value close to 0 indicates a perfectly equal income distribution, while a value close to 1 indicates total inequality.
Code
brand_div_palette <- function(x) {
brandr:::make_color_ramp(
n_prop = x,
colors = c(
brandr::get_brand_color("dark-red-triadic-blue"),
# brandr::get_brand_color("white"),
brandr::get_brand_color_mix(
position = 950,
color_1 = "dark-red",
color_2 = "dark-red-triadic-blue",
alpha = 0.5
),
brandr::get_brand_color("dark-red")
)
)
}
Code
data |>
dplyr::filter(year == 2010) |>
tidyr::drop_na(gini_index) |>
plotr:::plot_hist(
col = "gini_index",
density_line_color = "red",
x_label = "Gini Index",
print = FALSE
) +
ggplot2::xlim(0, 1) +
ggplot2::labs(
title = paste0(
"Gini Index of Per Capita Household Income by Brazilian Municipality"
),
subtitle = "Year: 2010",
caption = "Source: DATASUS/IPEA/IBGE"
)
Code
data |>
dplyr::filter(year == 2010) |>
tidyr::drop_na(gini_index) |>
plotr:::plot_brazil_municipality(
col_fill = "gini_index",
col_code = "municipality_code",
year = plotr:::get_closest_geobr_year(2010, type = "municipality"),
comparable_areas = FALSE,
reverse = TRUE,
limits = c(0, 1),
breaks = seq(0, 1, 0.25),
palette = brand_div_palette,
print = FALSE
) +
ggplot2::labs(
title = paste0(
"Gini Index of Per Capita Household Income by Brazilian Municipality"
),
subtitle = "Year: 2010",
caption = "Source: DATASUS/IPEA/IBGE"
)
How to Cite
To cite this work, please use the following format:
Vartanian, D., & Carvalho, A. M. (2025). A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010 [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://sustentarea.github.io/datasus-gini-index
A BibTeX entry for LaTeX users is
@techreport{vartanian2025,
title = {A reproducible pipeline for processing DATASUS data on the Gini Index of per capita household income by Brazilian municipality for the years 1991, 2000, and 2010},
author = {{Daniel Vartanian} and {Aline Martins de Carvalho}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group at the University of São Paulo},
langid = {en},
url = {https://sustentarea.github.io/datasus-gini-index}
}
License
This content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.
Acknowledgments
This work is part of the Sustentarea Research and Extension Group project: Global syndemic: The impact of anthropogenic climate change on the health and nutrition of children under five years old attended by Brazil’s public health system (SUS).
This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil (CNPq).