# Note
This article records some of my random thoughts before and during my work on REAM and its toolchain. It probably doesn't belong in the official documentation, and will be moved to my personal website in the future.
The first section lists some of the problems I experience when using Microsoft Excel and Word to collect and document social science data. The second section explains the motivation behind creating REAM. The third section explains some of the design choices for the toolchain, and how these designs makes REAM an appealing choice for social scientists.
# Problems
The project was inspired by my work at the Cline Center. I work on the Composition of Religious and Ethnic Groups (CREG) Project, and my task is to merge several existing datasets on ethnic groups in one master dataset, with a region focus on Sub-Saharan Africa. For every match I make between datasets, I have to update the master dataset, record the rationale in the documentation, as well as the sources that support the match. That being said, we have documentation for (almost) every data point in the master dataset.
We use Microsoft Excel to store the data, and Microsoft Word to document the matches. However, there are several things I don't like about the workflow.
# Dataset and documentation are separate
For every match I make, I first edit the master dataset in Excel, then edit the documentation in Word. Recording the same decision is two different applications is tedious and error prone. The potential inconsistency between the dataset and the documentation not only make the dataset harder to read but also hard to update in the future.
My ideal solution should merge the dataset and its documentation in one single file.
# Large spreadsheets are hard to edit
Spreadsheets are handy tools to organize data when everything fit in my screen. I can see all the columns and rows, and have a full picture of of my dataset. The What You See Is What You Get (WYSIWYG) design is what makes Excel and other spreadsheet applications popular at the first place.
But when the dataset spans beyond my screen, it's most appealing design diminishes. I have to scroll horizontally and vertically to navigate between cells. I have to filter and sort and zoom out to squeeze all relevant information into the screen. I argue that if you need a hyper-scrolling mouse to move between cells comfortably, you are probably using the wrong tool.
My ideal solution should work well with large dataset natively. One way to do so is to modularize the dataset.
# Spreadsheet's simple structure limits the way we store data
The spreadsheet seems like a perfect format to store social science datasets: each row represents a data point, each column represents a variable, and the entire spreadsheet is essentially one gigantic matrix.
Now consider collecting data to test the following empirical model adapted from Fearon and Laitin (2003) to test whether mountainous countries are more prone to civil war, with GDP as a control variable:
If stored in the form of spreadsheet, your dataset should look something like:
1 | |||||
2 | |||||
3 | |||||
4 | |||||
5 | |||||
6 |
Since the spreadsheet is small and well-ordered, its easy for users to update the data.
Say we want to update
One solution is to update the relevant cells with a scripting language, such as VBA, Python or R.
So you would do something like data$Mountainous[data$Country == 'x',] = new_value
in R.
This requires users to be familiar with programming.
Also using any languages other than VBA meaning exporting and re-importing the spreadsheet.
Another solution is to change the way we store data.
Observe that any data point with the same
{
"country": [
{
"name": "x",
"Mountainous": "m_x"
"year": [
{ "name": 1, "CivilWar": "w_x1", "GDP": "g_x1" }
{ "name": 2, "CivilWar": "w_x2", "GDP": "g_x2" }
]
},
{
"name": "y",
"Mountainous": "m_y"
"year": [
{ "name": 1, "CivilWar": "w_y1", "GDP": "g_y1" }
{ "name": 2, "CivilWar": "w_y2", "GDP": "g_y2" }
]
},
{
"name": "z",
"Mountainous": "m_z"
"year": [
{ "name": 1, "CivilWar": "w_z1", "GDP": "g_z1" }
{ "name": 2, "CivilWar": "w_z2", "GDP": "g_z2" }
]
}
]
}
Now there is only one
[
{"country_name": "x", "country_Mountainous": "m_x", "year_name": 1, "year_CivilWar": "w_x1", "year_GDP": "g_x1"},
{"country_name": "x", "country_Mountainous": "m_x", "year_name": 2, "year_CivilWar": "w_x2", "year_GDP": "g_x2"},
{"country_name": "y", "country_Mountainous": "m_y", "year_name": 1, "year_CivilWar": "w_y1", "year_GDP": "g_y1"},
{"country_name": "y", "country_Mountainous": "m_y", "year_name": 2, "year_CivilWar": "w_y2", "year_GDP": "g_y2"},
{"countrz_name": "z", "countrz_Mountainous": "m_z", "year_name": 1, "year_CivilWar": "w_z1", "year_GDP": "g_z1"},
{"countrz_name": "z", "countrz_Mountainous": "m_z", "year_name": 2, "year_CivilWar": "w_z2", "year_GDP": "g_z2"}
]
then convert this into CSV.
But JSON is difficult to edit and read, and it doesn't support comments natively. My ideal solution should use a human-friendly data serialization standard that works well with hierarchal data.
# I don't like Microsoft Office Suite
I use Linux, and Microsoft Office doesn't run natively on Linux. So I had to run a virtual Windows 10 on top of Linux - that is 2 CPU cores, 2.5GB of memory, 200MB of video memory, and 50GB of storage minimum - just to run a spreadsheet application and a word application. My ideal solution should be cross-platform.
Microsoft controls my data. If I purchase Microsoft 365 service, I am buying the subscription, not the actual programmes. Once I stop renewing my subscription, my datasets and documentations may not render or work properly. My ideal solution should be free and open-source.
Microsoft Excel comes with many wonderful functionalities, most of which I do not use. I had worked on the same project so long that I wrote custom R and Python scripts to perform commonly used operations more efficiently. Now I find myself using Excel only as a CSV editor with color support. That being said, my ideal solution should have a minimalistic and high-performing front-end for basic editing, and provide the option to connect to a kernel of my choosing when advance functionalities are required, like Vim and Jupyter Notebook. (Fun fact: There are 140+ kernels available for Jupyter Notebook, not just ipython)
# Conclusion
To reiterate, I want a tool/language that
- saves data and its documentation in one file,
- is able to handle large datasets natively,
- works well with hierarchal data,
- is human-readable,
- is cross-platform,
- is free and open-source,
- provides or works well with a minimalistic editor.
In additional, I also want the solution to have version control support, or work well with an existing version control tool.
# Why REAM
I did not initially write REAM with all 7 features in mind.
When collecting my data for my undergraduate thesis, I wanted to edit my dataset in my favourite editor Neovim. I tried a plugin that handles CSV files in Vim, but found some of its core functionalities to be inconsistent. This is when I start exploring CSV alternatives that are easy to edit in text editors. Meanwhile, I started reflecting on my past experience with data and see how I cam improve my workflow with my new solution.
# YAML
My first choice was YAML since it's a human-friendly data serialization standard I was already familiar with. It also support comments so I can add inline documentation. I can rewrite the previous example in YAML:
country:
- name: x
Mountainous: m_x
year:
- name: 1
CivilWar: w_x1
GDP: g_x1
- name: 2
CivilWar: w_x2
GDP: g_x2
- name: y
Mountainous: m_y
year:
- name: 1
CivilWar: w_y1
GDP: g_y1
- name: 2
CivilWar: w_y2
GDP: g_y2
- name: z
Mountainous: m_z
year:
- name: 1
CivilWar: w_z1
GDP: g_z1
- name: 2
CivilWar: w_z2
GDP: g_z2
It's so much more readable than JSON.
But that's not all. YAML's anchors and aliases allow easy reference to existing data and make the dataset modular. If I adjust the way data is organized:
country:
- &country_x
country_name: x
country_Mountainous: mx
- &country_y
country_name: y
country_Mountainous: my
- &country_z
countrz_name: z
countrz_Mountainous: mz
year:
- <<: *country_x
year_name: 1
year_CivilWar: wx1
year_GDP: gx1
- <<: *country_x
year_name: 2
year_CivilWar: wx2
year_GDP: gx2
- <<: *country_y
year_name: 1
year_CivilWar: wy1
year_GDP: gy1
- <<: *country_y
year_name: 2
year_CivilWar: wy2
year_GDP: gy2
- <<: *country_z
year_name: 1
year_CivilWar: wz1
year_GDP: gz1
- <<: *country_z
year_name: 2
year_CivilWar: wz2
year_GDP: gz2
With Python's pyyaml
library doing most of the heavy work, all I have to do is:
import yaml
# read yaml file
with open("test3.yaml", "r") as f:
dat = yaml.safe_load(f)
# extract year entries
years = [ list(entry.values()) for entry in dat['year'] ]
# save as csv
with open("test3.csv", "w") as f:
for year in years:
lines = ",".join([ str(item) for item in year ])
f.write(line)
f.write("\n")
to generate the CSV file.
But this breaks the hierarchal structure.
To prevent recursive referencing, the data structure is now flat; both country
and year
entires are now placed at the same level instead of being nested in one another.
Since variables with identical keys will be overwritten when referencing, I have to rename variables to make them unique, hence the country_
and year_
prefixes.
And since comments in YAML are simply ignored, I have to modify the parser to extract my inline documentation.
Modifying the parser to parse comments is not hard, but fixing other problems requires creating a superset of the language. If that's the case, I see no reason why I must stick to YAML.
# TOML
TOML is a another config file standard that emphasize human readability, and I quite like the syntax for Array of Tables. I can rewrite the previous example in TOML:
[[country]]
name = "x"
Mountainous = "mx"
[[country.year]]
name = 1
CivilWar = "wx1"
GDP = "gx1"
[[country.year]]
name = 2
CivilWar = "wx2"
GDP = "gx2"
[[country]]
name = "y"
Mountainous = "my"
[[country.year]]
name = 1
CivilWar = "wy1"
GDP = "gy1"
[[country.year]]
name = 2
CivilWar = "wy2"
GDP = "gy2"
[[country]]
name = "z"
Mountainous = "mz"
[[country.year]]
name = 1
CivilWar = "wz1"
GDP = "gz1"
[[country.year]]
name = 2
CivilWar = "wz2"
GDP = "gz2"
I prefer TOML's syntax over YAML's when it comes to nested data as it doesn't rely on indentation, but it lack referencing. In fact, the author explicitly rejected the idea during early development (Issue #13 and Issue #77). To implement reference in TOML implies creating a superset of TOML.
In conclusion, I want a language that is able to reference and reuse existing data (like the rejected proposal in Issue #13, but with better syntax) and has readable syntax for "list of objects" (or "array of tables" in TOML).
I found none, so I wrote my own.
# SQL
You might wonder why I didn't consider SQL databases, and the answer is quite simple: I did not know SQL back then.
Nevertheless, if I want REAM to be more than a personal project, I have to consider what exactly does REAM provide that SQL does not, instead of reinventing an inferior relational database. Why would social scientists choose REAM over a 40+ year technology that is proven to be efficient and reliable?
The most fundamental design differences between the two is that SQL databases are dynamic, while REAM store everything in static files:
# SQL databases are harder to perform CRUD operations on
CRUD are four basic data manipulation operations: create, read, update and delete. There are three methods that I can think of.
(1) manually type the INSERT
, SELECT
, UPDATE
, DELECT
commands into the SQL prompt.
(2) manually edit INSERT
commands in a .sql
file, then \i
and SELECT
whenever you want to read it.
(3) hire a full-stack engineer to write a front-end CRUD application and maintain the back-end server, or do them yourself.
The third method is too expensive for a personal project. As to the first two methods... I don't think SQL is designed to be used like that.
REAM, on the other side, can be edited in any text editor. Since it is designed to do so, it is much pleasant to the eyes in plain text.
SQL:
INSERT INTO person (first_name, last_name, gender, email)
values ('John', 'Doe', 'Male', 'john.doe@example.com');
REAM:
## Person
- first_name: John
- last_name: Doe
- gender: Male
- email: john.doe@example.com
An editor for REAM is also under development. I'm working on a module that generates a graphical CRUD application based on the schema provided.
# SQL databases are harder to version control
There are many solutions for version controlling SQL databases. Some involving commercial softwares, others advocate "best practices." This is a very complicated topic and I do not have the knowledge to elaborate on each of the solutions. In summary, there is no native way to version control SQL databases.
REAM, on the other side, save everything in text files.
You can use any version control tools to track changes.
I use Git, and designed the syntax to work well with git-diff
.
# What if I want to build a website to present my data?
This is what modern databases are primarily used for. If this is a hard requirement, you should use SQL databases.
But this is not to say you can't use REAM to build websites. With the rise of static site generators, it is now possible to write a website entirely in Markdown. Since REAM is a strict subset of Markdown, you may be able to figure out a way to use REAM with the static site generator of your choosing.
(I have been searching for a static Wiki generator that (1) uses Markdown, (2) uses Git for version control, and (3) runs entirely on the client side. The only tool that satisfy all three requirements is MDwiki, which is no longer maintained.)
In conclusion, SQL is a powerful tool. You may be able to adapt your workflow to work with SQL databases, but it requires significant amount of skill and/or money to make it work. REAM aims to provide an easy workflow to create, edit, distribute and reuse dataset, which will be further explained in the next section.
# Toolchain Design
The subsection on SQL briefly explain why REAM may be a better option compared to SQL databases. This section further explain some of the features that makes REAM and it toolchain an appealing tool for social scientists.
# REAM-Editor
Although the language is designed to be editor-friendly, probably not many people are willing edit datasets in text editors. An easy-to-use editor to edit REAM file is needed.
# Web-based
When designing REAM and its toolchain, I have collaborative work in mind. In order to make contribution easy, the editor should require no local installation or third party dependencies. To edit a REAM file, all you need is to visit the online editor, and drag and drop the file to the browser. Pull request can be done through GitHub's website, and GitHub integration might be added to the editor. No programming or command line is required for basic operations.
# Real-Time Preview
Since CSV/spreadsheet is REAM's primary compile target, the editor should provide real-time preview, just like StackEdit (This is actually where the inspiration come from)
# Graphical CRUD Application
After REAM Schema is implemented, REAM-Editor should be able to generate HTML forms based on the schema provided for basic CRUD operation. Synchronize type checking and error handling should also be implemented.
# Hackable
The Editor is written in Vue, so it should be easy to incorporate custom Vue components so as to edit and render the dataset however you desire.
# Parser
The parser is the most important part of the toolchain. It compiles REAM datasets to analysis-ready formats, such as CSV and JSON. More formats might be supported in the future.
Even though the parser can be written in any programming languages, there is only a few to choose from if we want the parser to be included in the web-based editor. The current implementation is written in JavaScript.
However, I still want the parser to be able to work as a command line tool to provide more advanced functionalities. To distribute such CLI tool, there are two options:
Publish to npm, but I suspect few of the potential users have node and npm/yarn installed.
Distribute native binaries, but it might be huge in size.
Both are possible, but neither is ideal.
I am currently experimenting with a Rust implementation. If successful, the parser would be compiled to binary executables for the CLI tool, and WASM modules for the web-based editor.
# Package Manager
If I want to plot fancy graphs in R, I do not need write a plotting library.
I can use ggplot2
:
- Download
ggplot2
install.packages("ggplot2")
2. Import `ggplot2`
```R
library(ggplot2)
- Use
ggolot2
plot_1 = ggplot(dat = data) + geom_point(aes(x = x, y = y))
If I want to do matrix calculations in Python, I do not need to write a new linear algebra library.
I can use numpy
- Download
numpy
pip install --user numpy
- Import
numpy
import numpy as np
import numpy.linalg as la
- Use
numpy
mat_A_inv = la.inv(mat_A)
If I want to add GDP variables to my dataset, I do not need to calculate GDP for each country. I can use GDP data from World Bank.
- Download the dataset
(1) Google World Bank GDP
and click on the first result
(2) Download the data in CSV
(3) Unzip API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.zip
- Import dataset
(1) Read the CSV
wb = read.csv("./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.csv")
(2) See error messages
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names)
(3) Open the CSV file in a text editor. Discover the dataset header is on line 5.
(4) Read CSV again
wb = read.csv("./API_NY.GDP.MKTP.CD_DS2_en_csv_v2_1678496.csv", skip = 4, header = T)
- Use the dataset
(1) Write a function to extract GDP:
get_gdp = function(country, year) {
col_i = grep(paste0('X', year), names(wb))
gdp = wb[wb$Country.Name == country,][col_i]
return(gdp[1,1])
}
(2) Apply the function on each row in your dataset
my_data$GDP = apply(my_data, 1, function(row) get_gdp(row['country'], row['year']))
(3) Discover Cote d'Ivoire has NA
GDP. Oh, it's called "Sierra Leone" in the World Banks' dataset.
(4) Figure out all the name differences between the two datasets and write a "dictionary" to translate the names.
country_dict = list(
"Brunei" = "Brunei Darussalam",
"Dominican Republic" = "Dominica",
"Cote d'Ivoire" = "Sierra Leone"
# You get the idea
)
(5) Modify get_gdp
get_gdp = function(my_country, year) {
col_i = grep(paste0('X', year), names(wb))
wb_country = country_dict[[my_country]]
gdp = wb[wb$Country.Name == wb_country,][col_i]
return(gdp[1,1])
}
(6) Apply the function on each row in your dataset
my_data$GDP = apply(my_data, 1, function(row) get_gdp(row['country'], row['year']))
There should be a easier way to import and add existing datasets to new datasets.
In fact, it's not unheard of to download datasets with package managers.
Beside the build-in datasets in R, you can download quite a few datasets from CRAN
using install.pacakges
, including example datasets in libraries (diamonds
in ggplot2
), datasets as libraries (titanic
), or wrappers of APIs (censusapi
).
I think it would be awesome if we have a package and project manager for REAM datasets. Writing a package manager that people don't hate is a notoriously difficult task, so I don't expect this to be officially released in the near future. I'll work on a naive implementation (essentially wrappers of git commands) as a proof of concept.