greshape

greshape is a fast alternative to reshape that additionally implements the equivalents to R's spread and gather from tidyr. It also allows an arbitrary number of grouping by variables (i()) and keys (j()).

Important

Run gtools, upgrade to update gtools to the latest stable version.

Syntax

In R parlance, j() is called keys(), which is frankly much clearer to understand than i and j. While regular Stata syntax is also supported, greshape provides the aliases by() and keys() for i() and j() respectively (note spread and wide accept multiple j keys).

wide -> long
------------

greshape long stubnames, by(varlist) [options]

greshape gather varlist, keys(varlist) values(varname) [options]

long -> wide
------------

greshape wide stubnames, by(varlist) keys(varname) [options]

greshape spread varlist, keys(varname) [options]

I think the above syntax is clearer (happy to receive feedback otherwise) but greshape also accepts the traditional Stata i, j syntax:

greshape long stubnames, i(varlist) [options]
greshape wide stubnames, i(varlist) j(varname) [options]

Syntax Details

The stubnames are a list of variable prefixes. The suffixes are either saved or taken from keys(), depending on the shape of the data. Remember this picture:

       long
    +------------+                   wide
    | i  j  stub |                  +----------------+
    |------------|                  | i  stub1 stub2 |
    | 1  1   4.1 |     greshape     |----------------|
    | 1  2   4.5 |   <---------->   | 1    4.1   4.5 |
    | 2  1   3.3 |                  | 2    3.3   3.0 |
    | 2  2   3.0 |                  +----------------+
    +------------+

    To go from long to wide:

                                            j existing variable
                                           /
            greshape wide stub, by(i) keys(j)

    To go from wide to long:

            greshape long stub, by(i) keys(j)
                                           \
                                            j new variable(s)

Additionally, the user can reshape in the style of R's tidyr package. To go from long to wide:

    greshape spread varlist, keys(j)

Note that spread (and gather) both require variable names, not prefixes. Further, all variables not specified in the reshape are assumed to be part of by() and the new variables are simply named after the values of keys(). From wide to long:

    greshape gather varlist, keys(j) values(values)

This does not check for duplicates or sorts the data. Variables not named are assumed to be part of by()). The values of the variables in varlist are saved in values(), with their names saved in keys().

reshape's extended syntax is not supported; that is, greshape does not implement "reshape mode" where a user can type reshape long or reshape wide after the first reshape. This syntax is cumbersome to support and prone to errors given the degree to which greshape had to rewrite the base code. This also means the "advanced" commands are not supported, including: clear, error, query, i, j, xij, and xi.

Options

Long and Wide

Long only

  • by(varlist) (Required) Use varlist as the ID variables (alias i()).
  • keys(varname) Wide to long: varname, new variable to store stub suffixes (default _j; alias j()).
  • string Whether to allow for string matches to each stub
  • match(str) Where to match levels of keys() in stub (default @). Use match(regex) for complex matches (see examples below for details).
  • dropmiss Drop missing observations for reshaped variables, including extended missing values. With multiple stubs, it only drops the observation if every output variable will be missing for that row.

Wide only

  • by(varlist) (Required) Use varlist as the ID variables (alias i()).
  • keys(varlist) (Required) Long to wide: varlist, existing variable with stub suffixes (alias j()).
  • colsepparate(str) Column separator when multiple variables are passed to j().
  • match(str) Where to replace the levels of keys() in stub (default @).
  • labelformat(str) Custom label formatting using placeholders; only with one key()/j() variable. Default is #keyvalue# #stublabel#, replaced with the key() variable value corresponding to the newly created variable, and the stub variable label. Other placeholders available are #stubname#, for the source variable name, #keyname# for the key() variable name, #keylabel# for the key() variable lavel, and #keyvaluelabel# to use the value label, if available, instead of the value itself.
  • prefix(str) Custom renaming of reshaped variables. One rename per stub; {opt @} syntax allowed. For example, with two stubs you can specify prefix(#stub# foo@bar) and the first stub's variables will be named normally (#stub# is replaced with the stub name) while the second will be named foo@bar with @ replaced by the j() variable's values.

Wide and long

  • fast Do not wrap the reshape in preserve/restore pairs.
  • unsorted Leave the data unsorted (faster). Original sort order is not preserved.
  • nodupcheck Wide to long, allow duplicate by() values (faster).
  • nomisscheck Long to wide, allow missing values and/or leading blanks in keys() (faster).
  • nochecks This is equivalent to all 4 of the above options (fastest).
  • xi(drop) Drop variables not in the reshape, by(), or keys().

Gather and Spread

Gather only

  • values(varname) (Required) Store values in varname.
  • keys(varname) Wide to long: varname, new variable to store variable names (default _key).
  • uselabels[(str)] Store variable labels instead of their names. Optionally specify a varlist with the variables to do this for (or uselabels(varlist, exclude) to specify the variables not to do this for).
  • dropmiss Drop missing observations for reshaped variables, including extended missing values.

Spread only

  • keys(varlist) (Required) Long to wide: varlist, existing variable with variable names.
  • labelformat(str) Custom label formatting using placeholders; only with one key()/j() variable. Default is #keyvalue# #stublabel#, replaced with the key() variable value corresponding to the newly created variable, and the stub variable label. Other placeholders available are #stubname#, for the source variable name, #keyname# for the key() variable name, #keylabel# for the key() variable lavel, and #keyvaluelabel# to use the value label, if available, instead of the value itself.
  • prefix(str) Custom renaming of reshaped variables. One common rename; {opt @} syntax allowed. For example, with two stubs you can specify prefix(foo@bar) and the output variables will be named foo@bar with @ replaced by the key() variable's values.

Gather and Spread

  • by(varlist) check varlist are the ID variables. Throws an error otherwise.
  • xi(drop) Drop variables not in the reshape or in by(). That is, if by() is specified then drop variables that have not been explicitly named.
  • fast Do not wrap the reshape in preserve/restore pairs.

Gtools options

(Note: These are common to every gtools command.)

  • compress Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option.

  • forcestrl Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables.

  • verbose prints some useful debugging info to the console.

  • benchmark or bench(level) prints how long in seconds various parts of the program take to execute. Level 1 is the same as benchmark. Levels 2 and 3 additionally prints benchmarks for internal plugin steps.

  • hashmethod(str) Hash method to use. default automagically chooses the algorithm. biject tries to biject the inputs into the natural numbers. spooky hashes the data and then uses the hash.

  • oncollision(str) How to handle collisions. A collision should never happen but just in case it does gtools will try to use native commands. The user can specify it throw an error instead by passing oncollision(error).

Remarks

greshape converts data from wide to long form and vice versa. It is a fast alternative to reshape, and it additionally implements greshape spread and greshape gather, both of which are marginally faster and in the style of the equivalent R commands from tidyr.

It is well-known that reshape is a slow command, and there are several alternatives that I have encountered to speed up reshape, including: fastreshape, parallel, sreshape, and various custom solutions (e.g. here). While these alternatives are slower than greshape, the speed gains are not uniform.

If j() is numeric, the data is already sorted by i(), and there are not too may variables to reshape, then fastreshape comes closest to achieving comparable speeds to greshape. Under most circumstances, however, greshape is typically 20-60% faster than fastreshape on sorted data, and up to 90% faster if j() has string values or if the data is unsorted (by default greshape will output the data in the correct sort order). In other words, greshape's speed gains are very robust, while other solutions' are not.

Note

greshape relies on temporary files written to your disk storage to reshape the data in memory. While this might deteriorate performance for particularly large reshapes, the speed gains are large enough that greshape should still be faster than its Stata counterpart.

Examples

You can download the raw code for the examples below here

Basic usage

Syntax is largely analogous to reshape

webuse reshape1, clear
list
greshape long inc ue, i(id) j(year)
list, sepby(id)
greshape wide inc ue, i(id) j(year)

However, the preferred greshape parlance is by for i and keys for j, which I think is clearer.

webuse reshape1, clear
list
greshape long inc ue, by(id) keys(year)
list, sepby(id)
greshape wide inc ue, by(id) keys(year)

Allow string values in j; the option string is not necessary for long to wide:

webuse reshape4, clear
list
greshape long inc, by(id) keys(sex) string
list, sepby(id)
greshape wide inc, by(id) keys(sex)

Multiple j values:

webuse reshape5, clear
list
greshape wide inc, by(hid) keys(year sex) cols(_)
l

Complex stub matches

@ syntax is supported and can be modified via match()

webuse reshape3, clear
list
greshape long inc@r ue, by(id) keys(year)
list, sepby(id)
greshape wide inc@r ue, by(id) keys(year)
list

webuse reshape3, clear
list
greshape long inc[year]r ue, by(id) keys(year) match([year])
list, sepby(id)
greshape wide inc[year]r ue, by(id) keys(year) match([year])
list

Output variables can be renamed using user-specified patterns.

webuse reshape3, clear
qui greshape long inc@r ue, by(id) keys(year)
qui greshape wide inc[hi]r ue, by(id) keys(year) prefix(year[hi]income #stub#) match([hi])
desc, full

webuse reshape3, clear
qui greshape gather inc*r ue*, values(values) key(variable)
qui greshape spread values, key(variable) prefix(foo@bar_#stub#)
desc, full

Note that stata variable syntax is only supported for long to wide, and cannot be combined with @ syntax. For complex pattern matching from wide to long, use match(regex) or match(ustrregex). With regex, the first group is taken to be the group to match (this can be modified via /#, e.g. r(e)g(e)x/2 matches the 2nd group). With ustrregex, every part of the match outside lookarounds is matched (e.g. (?<=(foo|bar)[0-9]{0,2}stub)([0-9]+)(?=alice|bob) matches ([0-9]+)).

webuse reshape3, clear
greshape long inc([0-9]+).+ (ue)(.+)/2, by(id) keys(year) match(regex)
greshape wide inc@r u?, by(id) keys(year)

Note for ustrregex (Stata 14+ only), Stata does not support matches of indeterminate length inside lookarounds (this is a limitation that is not uncommon across several regex implementations).

Custom Labels

This was motivated by the labeling convention of separate. To mimic its labeling, you can do

sysuse auto, clear
local labelformat labelformat(#stubname#, #keyname# == #keyvaluelabel#)
greshape wide mpg, by(make) key(foreign) `labelformat'
desc mpg*

However, you can use any combination of placeholders. For instance,

sysuse auto, clear
local labelformat labelf(#stublabel#; #keylabel# == #keyvaluelabel#)
greshape wide mpg, by(make) key(foreign) `labelformat'
desc mpg*

If no label of value labels are available, the program falls back to variable name and values

sysuse auto, clear
label drop origin
label var mpg ""
label var foreign ""
local labelformat labelf(#stublabel#; #keylabel# == #keyvaluelabel#)
greshape wide mpg, by(make) key(foreign) `labelformat'
desc mpg*

Gather and Spread

webuse reshape1, clear
greshape gather inc* ue*, values(values) keys(varaible)
greshape spread values, keys(varaible)

Drop missing observations

Often it is desireable to drop missing observations when reshaping long. For example

clear
set obs 10
gen i = _n
expand i
bys i: gen j = _n
gen x = _n
gen y = -_n
greshape wide x y, by(i) key(j)

When reshaping this data back into long, we would normally get 100 observations, with 45 of them missing. However, we can dispense with the additional missing values via dropmiss:

greshape long x y, by(i) key(j) dropmiss
assert _N == 55
assert x  == _n
assert y  == -_n

Fine-grain control over error checks

By default, greshape throws an error with problematic observations, but this can be ignored if so desired.

webuse reshape2, clear
list
cap noi greshape long inc, by(id) keys(year)
preserve
cap noi greshape long inc, by(id) keys(year) nodupcheck
restore

gen j = string(_n) + " "
cap noi greshape wide sex inc*, by(id) keys(j)
preserve
cap noi greshape wide sex inc*, by(id) keys(j) nomisscheck
restore

drop j
gen j = _n
replace j = . in 1
cap noi greshape wide sex inc*, by(id) keys(j)
preserve
cap noi greshape wide sex inc*, by(id) keys(j) nomisscheck
restore

Not all errors are solvable, however. For example, xi variables must be unique by i, and j vannot define duplicate values.

cap noi greshape wide inc*, by(id) keys(j) nochecks

drop j
gen j = string(_n) + " "
replace j = "1." in 2
cap noi greshape wide inc*, by(id) keys(j) nochecks

There is no fix for j defining non-unique names, since variable names must be unique. In this case you must manually clean your data before reshaping. However, greshape allows the user to specify that xi variables can be dropped (i.e. nonly explicitly named variables are kept):

drop j
gen j = _n
cap noi greshape wide inc*, by(id) keys(j) xi(drop) nochecks