gcollapse
Efficiently make dataset of summary statistics using C.
gcollapse converts the dataset in memory into a dataset of means, sums, medians, etc., similar to collapse. Unlike collapse, however, first, last, firstnm, lastnm for string variables are not supported.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Note
Stata 17+, MP version, introduced significant speed improvements
to the native collapse
command, specially with many cores. Depending
on the collapse, it can be up to twice as fast than gcollapse
; however,
it remained slower for some use cases. YMMV.
Syntax
This is a fast option to Stata's collapse, with several additions.
gcollapse clist [if] [in] [weight] [, options ]
where clist is either
[(stat)] varlist [ [(stat)] ... ]
[(stat)] target_var=varname [target_var=varname ...] [ [(stat)] ...]
or any combination of the varlist
or target_var
forms, and stat is one of
Stat | Description |
---|---|
mean | means (default) |
geomean | geometric means (missing if var has any negative values) |
count | number of nonmissing observations |
nmissing | number of missing observations |
nunique | counts unique elements |
median | medians |
p#.# | arbitrary quantiles (#.# must be strictly between 0, 100) |
p1 | 1st percentile |
p2 | 2nd percentile |
... | 3rd-49th percentiles |
p50 | 50th percentile (same as median) |
... | 51st-97th percentiles |
p98 | 98th percentile |
p99 | 99th percentile |
iqr | interquartile range |
sum | sums |
rawsum | sums, ignoring optionally specified weight except observations with a weight of zero are excluded |
nansum | sum; returns . instead of 0 if all entries are missing |
rawnansum | rawsum; returns . instead of 0 if all entries are missing |
sd | standard deviation |
variance | variance |
cv | coefficient of variation (sd/mean ) |
semean | standard error of the mean (sd/sqrt(n)) |
sebinomial | standard error of the mean, binomial (sqrt(p(1-p)/n)) (missing if source not 0, 1) |
sepoisson | standard error of the mean, Poisson (sqrt(mean / n)) (missing if negative; result rounded to nearest integer) |
skewness | Skewness |
kurtosis | Kurtosis |
percent | percentage of nonmissing observations |
max | maximums |
min | minimums |
select# | # th smallest non-missing |
select-# | # th largest non-missing |
rawselect# | # th smallest non-missing, ignoring weights |
rawselect-# | # th largest non-missing, ignoring weights |
range | range (max - min ) |
first | first value |
last | last value |
firstnm | first nonmissing value |
lastnm | last nonmissing value |
gini | computes the Gini coefficient (negative values are truncated to 0) |
gini dropneg | computes the Gini coefficient (negative values are dropped) |
gini keepneg | computes the Gini coefficient (negative values are Kept; the user is responsible for the interpretation of the gini coefficient in this case) |
Weights
aweight, fweight, iweight, and pweight are allowed and mimic collapse
(see help weight
and the weights section in help collapse
).
pweights may not be used with sd, semean, sebinomial, or sepoisson. iweights may not be used with semean, sebinomial, or sepoisson. aweights may not be used with sebinomial or sepoisson.
Options
-
by(varlist)
specifies the groups over which the means, etc., are to be calculated. It can contain any mix of string or numeric variables. -
cw
specifies casewise deletion. If cw is not specified, all possible observations are used for each calculated statistic. -
fast
specifies that gcollapse not restore the original dataset should the user press Break.
Extras
-
rawstat(varlist)
Sequence of target names for which to ignore weights, except observations with a weight of zero or missing, which are excluded. This is a generalization of rawsum, but it is specified for each individual target (if no target is specified, the source variable name is what we call target). -
merge
merges the collapsed data back to the original data set. Note that if you want to replace the source or target variable(s) then you need to specifyreplace
. -
wildparse
specifies that the function call should be parsed assuming targets are named using rename-stle syntax. For example,gcollapse (sum) s_x* = x*, wildparse
-
replace
Replace allows replacing existing variables with merge. -
freq(varname)
stores the group frequency count in freq. It differs from count because it merely stores the number of occurrences of the group in the data, rather than the non-missing count. Hence it is equivalent to summing a dummy variable equal to 1 everywhere. -
labelformat(str)
Specifies the label format of the output. #stat# is replaced with the statistic: #Stat# for titlecase, #STAT# for uppercase, #stat:pretty# for a custom replacement; #sourcelabel# for the source label and #sourcelabel:start:nchars# to extract a substring from the source label. The default is (#stat#) #sourcelabel#. #stat# palceholders in the source label are also replaced. -
labelprogram(str)
Specifies the program to use with #stat:pretty#. This is an rclass that must set prettystat as a return value. The program must specify a value for each summary stat or return #default# to use the default engine. The programm is passed the requested stat by gcollapse. -
unsorted
Do not sort resulting data set. Saves speed.
Switches
-
forceio
By default, when there are more than 3 additional targets (i.e. the number of targets is greater than the number of source variables plus 3) the function tries to be smart about whether adding empty variables in Stata before the collapse is faster or slower than collapsing the data to disk and reading them back after keeping only the first J observations (assuming J is the number of groups). For J small relative to N, collapsing to disk will be faster. This check involves some overhead, however, so if J is known to be smallforceio
will be faster. -
forcemem
The opposite offorceio
. The check for whether to use memory or disk check involves some overhead, so if J is known to be large forcemem will be faster. -
double
stores data in double precision. -
sumcheck
Check whether byte, int, or long sum will overflow. By default sum targets are double; in this case, sum targets check the smallest integer type that will be suitable and only assigns a double if the sum would overflow.
Gtools options
(Note: These are common to every gtools command.)
-
compress
Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrl
Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verbose
prints some useful debugging info to the console. -
benchmark
orbench(level)
prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark
. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)
Hash method to use.default
automagically chooses the algorithm.biject
tries to biject the inputs into the natural numbers.spooky
hashes the data and then uses the hash. -
oncollision(str)
How to handle collisions. A collision should never happen but just in case it doesgtools
will try to use native commands. The user can specify it throw an error instead by passingoncollision(error)
.
Out of memory
(See also Stata's own discussion in help memory
.)
There are many reasons for why an OS may run out of memory. The best-case scenario is that your system is running some other memory-intensive program. This is specially likely if you are running your program on a server, where memory is shared across all users. In this case, you should attempt to re-run gcollapse once other memory-intensive programs finish.
If no memory-intensive programs were running concurrently, the second best-case scenario is that your user has a memory cap that your programs can use. Again, this is specially likely on a server, and even more likely on a computing grid. If you are on a grid, see if you can increase the amount of memory your programs can use (there is typically a setting for this). If your cap was set by a system administrator, consider contacting them and asking for a higher memory cap.
If you have no memory cap imposed on your user, the likely scenario is that your system cannot allocate enough memory for gcollapse. At this point you have two options: One option is to try fcollapse or collapse, which are slower but using either should require a trivial one-letter change to the code; another option is to re-write the code to collapse the data in segments (the easiest way to do this would be to collapse a portion of all variables at a time and perform a series of 1:1 merges at the end).
Replacing gcollapse with fcollapse or plain collapse is an option because gcollapse often uses more memory. This is a consequence of Stata's inability to create variables via C plugins. This forces gcollapse to create variables before collapsing, meaning that if there are J groups and N observations, gcollapse uses N - J more rows than the ideal collapse program, per variable.
gcollapse was written with this limitation in mind and tries to save memory in various ways (for example, if J is small relative to N, gcollapse will use free disk space instead of memory, which not only saves memory but is also much faster). Nevertheless, it is possible that your system will allocate enough memory for fcollapse or collapse in situations where it cannot allocate enough memory for gcollapse.
Stored results
gcollapse stores the following in r():
r(N) number of non-missing observations r(J) number of groups r(minJ) largest group size r(maxJ) smallest group size
Examples
You can download the raw code for the examples below here
Basic usage
The syntax for its basic use is the same as collapse:
sysuse auto, clear gcollapse (sum) price mpg (mean) m1 = price m2 = mpg if !mi(rep78), by(foreign) l +---------------------------------------------+ | foreign price mpg m1 m2 | |---------------------------------------------| 1. | Domestic 296604 938 6,179.3 19.5417 | 2. | Foreign 127473 531 6,070.1 25.2857 | +---------------------------------------------+
You can call multiple names per statistic in any order, optionally specifying the target name. Further, weights can be selectively applied to each target.
sysuse auto, clear gcollapse (mean) price praw = price [fw = rep78], by(foreign) rawstat(praw) l +------------------------------+ | foreign price praw | |------------------------------| 1. | Domestic 6,162.5 6,179.3 | 2. | Foreign 6,133.8 6,070.1 | +------------------------------+
Note, however, that rows with missing or 0 values of rep78 are excluded regardless when selectively applying weights.
Unique counts
sysuse auto, clear gcollapse (nunique) rep78 mpg turn, by(foreign) l +-------------------------------+ | foreign rep78 mpg turn | |-------------------------------| 1. | Domestic 6 17 17 | 2. | Foreign 4 13 7 | +-------------------------------+
Wild Parsing
clear set obs 10 gen x1 = _n gen x2 = _n^2 gen x3 = _n^3 gcollapse mean_x* = x*, wildparse l +-----------------------------+ | mean_x1 mean_x2 mean_x3 | |-----------------------------| 1. | 5.5 38.5 302.5 | +-----------------------------+
Quantiles
gcollapse allows the user to specify arbitrary quantiles:
sysuse auto, clear gcollapse (p2.5) p2_5 = price (p97.5) p97_5 = price, by(foreign) l +---------------------------+ | foreign p2_5 p97_5 | |---------------------------| 1. | Domestic 3,299 14,500 | 2. | Foreign 3,748 12,990 | +---------------------------+
This is useful if you have a large number of observations per group:
clear set obs 1000 gen long id = _n gcollapse /// (p2) p2 = id /// (p2.5) p2_5 = id /// (p3) p3 = id /// (p96) p96 = id /// (p97.5) p97_5 = id /// (p98) p98 = id l +--------------------------------------------+ | p2 p2_5 p3 p96 p97_5 p98 | |--------------------------------------------| 1. | 20.5 25.5 30.5 960.5 975.5 980.5 | +--------------------------------------------+
Label outputs
The default label for collapsed stats is "(stat) source label". I find this format ugly, so I have implemented a very basic engine to label outputs:
sysuse auto, clear gcollapse (mean) price, labelformat(#stat#: #sourcelabel#) disp _n(1) "`:var label price'"
mean: Price
The following placeholder options are available in the engine:
-
#stat#
,#Stat#
, and#STAT#
are replaced with the lower-, title-, and upper-case name of the summary stat. -
#sourcelabel#
,#sourcelabel:start:numchars#
are replaced with the source label, optionally extractingnumchars
characters fromstart
(numchars
can be.
to denote all characters fromstart
). -
#stat:pretty#
replces each stat name with a nicer version (mean to Mean, sd to St Dev., and so on). The user can specify a their own custom pretty program vialabelprogram()
. The program MUST be an rclass program and returnprettystat
. For example
capture program drop my_pretty_stat program my_pretty_stat, rclass if ( `"`0'"' == "sum" ) local prettystat "Total" else if ( `"`0'"' == "mean" ) local prettystat "Average" else { local prettystat "#default#" } return local prettystat = `"`prettystat'"' end sysuse auto, clear gcollapse /// (mean) mean = price /// (sum) sum = price /// (sd) sd = price, /// freq(obs) /// labelformat(#stat:pretty# #sourcelabel#) labelp(my_pretty_stat) disp _n(1) "`:var label mean'" /// _n(1) "`:var label sum'" /// _n(1) "`:var label sd'" /// _n(1) "`:var label obs'"
Average Price Total Price St Dev. Price Group size
We can see that mean
and sum
were set to the custom label, while sd
was
set to the default. You can also specify a different label format for each
variable if you put the stat palceholder in the variable label.
sysuse auto, clear gen mean = price gen sum = price label var mean "Price (#stat#)" label var sum "Price #stat:pretty#" label var price "`:var label price' #stat:pretty#" gcollapse /// (mean) mean /// (sum) sum /// (sd) sd = price, /// labelformat(#sourcelabel#) labelp(my_pretty_stat) disp _n(1) "`:var label mean'" /// _n(1) "`:var label sum'" /// _n(1) "`:var label sd'"
Price (mean) Price Sum Price St Dev.
Merge
You can merge summary stats back to the main data with gcollapse. This is
equivalent to a sequence of egen
statements or to collapse
followed by
merge. That is, if you want to create bulk summary statistics, you might want
to do:
sysuse auto, clear qui { preserve collapse (mean) m_pr = price (sum) s_gr = gear_ratio, by(rep78) tempfile bulk save `bulk' restore merge m:1 rep78 using `bulk', assert(3) nogen }
But with gcollapse this is simplified to
sysuse auto, clear gcollapse (mean) m_pr = price (sum) s_gr = gear_ratio, by(rep78) merge
If you wish to replace the source variables, you can do
sysuse auto, clear gcollapse (mean) price (sum) gear_ratio, by(rep78) merge replace
Using I/O vs memory
gcollapse tries to determine whether using memory or using your disk's temporary drive is better. For example:
sysuse auto, clear gen long id = _n * 1000 expand id replace id = _n tempfile io save `io' local call (sum) s1 = id /// (mean) s2 = id /// (max) s3 = id /// (min) s4 = id /// (sd) s5 = id gcollapse `call', by(foreign) v
Bijection OK with all integers (i.e. no extended miss val)? Yes. Counting sort on hash; min = 1, max = 2 N = 2,775,000; 2 unbalanced groups of sizes 1,378,000 to 1,397,000 Will write 4 extra targets to disk (full data = 84.7 MiB; collapsed data = 6.1e-05 MiB). Adding targets before collapse estimated to take 0.00027 seconds. Adding targets after collapse estimated to take 1.9e-10 seconds. Writing/reading targets to/from disk estimated to take 1.4e-07 seconds. Will write to disk and read back later to save time.
Foreign has 2 levels, and we can see that gcollapse
determines that collapsing to disk would save time.
However, we can skip this check if we know a variable
has few levels:
use `io', clear gcollapse `call', by(foreign) verbose forceio bench
Parsed by variables, sources, and targets; .013 seconds Recast source variables to save memory; .06 seconds Parsed by variables; .001 seconds Bijection OK with all integers (i.e. no extended miss val)? Yes. Counting sort on hash; min = 1, max = 2 N = 2,775,000; 2 unbalanced groups of sizes 1,378,000 to 1,397,000 Plugin runtime; .21 seconds Total runtime (internals); .213 seconds Added extra targets after collapse; 0 seconds Read extra targets from disk; .004 seconds Program exit executed; 0 seconds
We can see that gcollapse
skipped the check but that it read the collapsed
targets from disk after the collapse. We can also force gcollapse
to use memory:
use `io', clear gcollapse `call', by(foreign) verbose forcemem bench
Parsed by variables, sources, and targets; .006 seconds Recast source variables to save memory; .044 seconds Dropped superfluous variables; .061 seconds Generated additional targets; 0 seconds Parsed by variables; .002 seconds Bijection OK with all integers (i.e. no extended miss val)? Yes. Counting sort on hash; min = 1, max = 2 N = 2,775,000; 2 unbalanced groups of sizes 1,378,000 to 1,397,000 Plugin runtime; .222 seconds Total runtime (internals); .224 seconds Program exit executed; .001 seconds
Again, it skipped the check but this time but we can see it generated the targets before the collapse.