In this recipe we show how to convert a CLDF dataset from long (or narrow) format to wide format.
CLDF follows best practices in data management by mandating a long format for its ValueTable. This complies with the recommendation that
Each observation must have its own row. Tidy Data
However, Linguists are often used to data in a wide format, e.g. a matrix (or spreadsheet) where each language has a row and each typological feature or semantic concept has a column (with feature values and lexemes as cells, respectively).
Fortunately, it is always possible to convert the observations of a CLDF dataset to such a matrix. As an example, we'll convert the 2020 version of WALS Online to the matrix format provided as download for previous versions.
- Download and unzip the data from .
You should now have a local folder
cldf-datasets-wals-014143f/
on your disk. - Larger CLDF datasets are best handled using SQLite, so we load the data into a SQLite database running
pycldf
'screatedb
command:Using this db to extract the feature values for a single WALS feature is straightforward:$ cldf createdb cldf-datasets-wals-014143f/cldf/StructureDataset-metadata.json wals2020.sqlite
SELECT l.cldf_name as 'Language', c.cldf_name as 'Value' FROM LanguageTable as l, ValueTable as v, CodeTable as c WHERE v.cldf_parameterReference = '1A' and l.cldf_id = v.cldf_languageReference and v.cldf_codeReference = c.cldf_id ;
Retrieving values for all features, and labeling the columns with feature names is a bit more complicated, though. Actually, retrieving all values is not the problem:
SELECT
l.cldf_id as wals_code
FROM
LanguageTable as l
JOIN ValueTable as v ON l.cldf_id = v.cldf_languageReference
JOIN CodeTable as c ON v.cldf_codeReference = c.cldf_id
GROUP BY
l.cldf_id
;
But how do we tease out the individual values and sort them into columns? We can exploit the fact that WALS has at most one value per language and feature, so the following construct works:
SELECT
l.cldf_id as wals_code,
max(case when v.cldf_parameterReference = '1A' then c.Number || ' ' || c.cldf_name end) as '1A Consonant Inventories'
FROM
LanguageTable as l
JOIN ValueTable as v ON l.cldf_id = v.cldf_languageReference
JOIN CodeTable as c ON v.cldf_codeReference = c.cldf_id
GROUP BY
l.cldf_id
;
But typing all max(...) as ...
for 192 features is tedious.
So we'll use a two-step process to cobble together a suitable SQL query.
- First, we run a query to retrieve the data which is needed to assemble the result set specification in the final query:
SELECT 'max(case when v.cldf_parameterReference = ''' || p.cldf_id || ''' then c.Number || '' '' || c.cldf_name end) as ''' || p.cldf_id || ' ' || replace(p.cldf_name, '''', '"') || ''',' FROM ParameterTable as p ORDER BY p.cldf_id ;
- Then we plug the result (stripping the final comma) into the full query:
SELECT l.cldf_id as wals_code, l.cldf_iso639P3code as iso_code, l.cldf_glottocode as glottocode, l.cldf_name as Name, l.cldf_latitude as latitude, l.cldf_longitude as longitude, l.Genus as genus, l.Family as family, l.cldf_macroarea as macroarea, max(case when v.cldf_parameterReference = '1A' then c.Number || ' ' || c.cldf_name end) as '1A Consonant Inventories', [...] FROM LanguageTable as l JOIN ValueTable as v ON l.cldf_id = v.cldf_languageReference JOIN CodeTable as c ON v.cldf_codeReference = c.cldf_id GROUP BY l.cldf_id, l.cldf_iso639P3code, l.cldf_glottocode, l.cldf_name, l.cldf_latitude, l.cldf_longitude, l.Genus, l.Family, l.cldf_macroarea ORDER BY l.cldf_id ;
The full query can be run against the SQLite database as follows
sqlite3 -header -csv wals2020.sqlite < matrix.sql > matrix.csv
to create a CSV table matrix.csv which can be loaded into spreadsheet programs.