gegen
Efficient implementation of byable egen functions using C.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
gegen [type] newvar = fcn(arguments) [if] [in] [weight] [, ///
replace fcn_options gtools_options ]
Gtools options

compress
Try to compress strL tostr#
. 
forcestrl
Skip binary variable check and force gtools to read strL variables. 
verbose
prints some useful debugging info to the console. 
benchmark
prints how long in seconds various parts of the program take to execute. 
benchmarklevel(int)
depth of benchmark. 
hashmethod(str)
For debugging: default, biject, or spooky. 
oncollision(str)
For debugging: fallback or error. 
gtools_capture(str)
The above options are captured and not passed to egen in case the requested function is not internally supported by gtools. You can pass extra arguments here if their names conflict with captured gtools options.
Weights
Weights are only allowed for internallyimplemented functions. In particular they only affect: total, sum, mean, sd, count, median, iqr, percent, semean, sebinomial, sepoisson, percentiles, skewness, kurtosis. They are ignored by: tag, group, nunique, max, min, first, last, firstnm, lastnm. All other functions do not allow weights.
aweight, fweight, iweight, and pweight are allowed for the functions
listed below and mimic collapse
(see help weight
and the weights
section in help collapse
).
pweights may not be used with sd, semean, sebinomial, or sepoisson. iweights may not be used with semean, sebinomial, or sepoisson. aweights may not be used with sebinomial or sepoisson.
Compiled functions
The following are simply wrappers for other gtools functions. Consult
each command's corresponding help files for details. (Note that gstats
transform
in particular allows embedding options in the statistic call
rather than program arguments; while this is technically also possible
to do through gegen
, I do not recommend it. Instead, use window()
with
moving_stat
, interval()
with range_stat
, cumby()
with cumsum
, and
so on.) In the table, stat
can be replaced with any stat available to
gcollapse
except percent, nunique
:
function > calls  xtile(exp) > fasterxtile standardize(varname) > gstats transform normalize(varname) > gstats transform demean(varname) > gstats transform demedian(varname) > gstats transform moving_stat(varname) > gstats transform range_stat(varname) > gstats transform cumsum(varname) > gstats transform shift(varname) > gstats transform rank(varname) > gstats transform winsor(varname) > gstats winsor winsorize(varname) > gstats winsor
The functions listed here have been compiled and hence will run very quickly. Functions not listed here hash the data and then call egen with by(varlist) set to the hash, which is often faster than calling egen directly, but not always. The functions here should always be faster, however.
Generate IDs
group(varlist) [, missing counts(newvarname) fill(real)] may not be combined with by. It creates one variable taking on values 1, 2, ... for the groups formed by varlist. varlist may contain numeric variables, string variables, or a combination of the two. The default order of the groups is the sort order of the varlist. However, the user can specify: [+] varname [[+] varname ...] And the order will be inverted for variables that have  prepended. missing indicates that missing values in varlist (either . or "") are to be treated like any other value when assigning groups, instead of as missing values being assigned to the group missing. You can also specify counts() to generate a new variable with the number of observations per group; by default all observations within a group are filled with the count, but via fill() the user can specify the value the variable will take after the first observation that appears within a group. The user can also specify fill(data) to fill the first Jth observations with the count per group (in the sorted group order) or fill(group) to keep the default behavior.
Tag groups
tag(varlist) [, missing] may not be combined with by. It tags just 1 observation in each distinct group defined by varlist. When all observations in a group have the same value for a summary variable calculated for the group, it will be sufficient to use just one value for many purposes. The result will be 1 if the observation is tagged and never missing, and 0 otherwise. Note values for any observations excluded by either if or in are set to 0 (not missing). Hence, if tag is the variable produced by egen tag = tag(varlist), the idiom if tag is always safe. missing specifies that missing values of varlist may be included.
Summary stats
All the functions listed here allow by(varlist)
. If this is not specified,
then operations are performed by row. exp
must be a valid Stata espression
or a list of variables.
firstlastfirstnmlastnm(exp) creates a constant (within varlist) containing the first, last, first nonmissing, and last nonmissing observation. The functions are analogous to those in collapse and not to those in egenmore. count(exp) creates a constant (within varlist) containing the number of nonmissing observations of exp. nunique(exp) creates a constant (within varlist) containing the number of unique observations of exp. iqr(exp) creates a constant (within varlist) containing the interquartile range of exp. Also see pctile(). max(exp) creates a constant (within varlist) containing the maximum value of exp. mean(exp) creates a constant (within varlist) containing the mean of exp. geomean(exp) creates a constant (within varlist) containing the geometric mean of exp. If exp has any negative values, the function returns missing (`.`). If it has any zeros, the function returns zero. median(exp) creates a constant (within varlist) containing the median of exp. Also see pctile(). min(exp) creates a constant (within varlist) containing the minimum value of exp. range(exp) creates a constant (within varlist) containing the range of exp. pctile(exp) [, p(#)] creates a constant (within varlist) containing the #th percentile of exp. If p(#) is not specified, 50 is assumed, meaning medians. Also see median(). select(exp) , n(##) creates a constant (within varlist) containing the `#`th smallest nonmissing value of exp. If `#` is specified, the `#`th _largest_ nonmissing value is output instead. Note if there are any nonmissing values then `n(1)` and `n(1)` will output the same value as `min` and `max`, respectively. sd(exp) creates a constant (within varlist) containing the standard deviation of exp. Also see mean(). variance(exp) creates a constant (within varlist) containing the variance of exp. Also see sd(). cv(exp) creates a constant (within varlist) containing the coefficient of variation of exp. Also see sd() adn mean(). percent(exp) creates a constant (within varlist) containing the percent of nonmissing observations in the group relative to the sample. semean(exp) creates a constant (within varlist) containing the standard error of the mean (sd/sqrt(n)) sebinomial(exp) creates a constant (within varlist) containing the standard error of the mean, binomial (sqrt(p(1p)/n)) (missing if not 0, 1) sepoisson(exp) creates a constant (within varlist) containing the standard error of the mean, Poisson (sqrt(mean / n)) (missing if negative; result rounded to nearest integer) skewness(exp) creates a constant (within varlist) containing the skewness kurtosis(exp) creates a constant (within varlist) containing the kurtosis total(exp) [, missing] sum(exp) [, missing] creates a constant (within varlist) containing the sum of exp treating missing as 0. If missing is specified and all values in exp are missing, newvar is set to missing. Also see mean(). gini(exp) ginidropneg(exp) ginikeepneg(exp) creates a constant (within varlist) containing the Gini coefficient of exp, truncating negative values to 0. `ginidropneg` drops negative values, and `ginikeepneg` keeps negative values as is (the user is responsible for the interpretation of the Gini coefficient in this case).
Description
gegen creates newvar of the optionally specified storage type equal to fcn(arguments). Here fcn() is either one of the internally supported commands above or a byable function written for egen, as documented above. Only egen functions or internally supported functions may be used with egen. If you want to generate multiple summary statistics from a single variable it may be faster to use gcollapse with the merge option.
Depending on fcn(), arguments, if present, refers to an expression, varlist, or a numlist, and the options are similarly fcn dependent.
Out of memory
(See also Stata's own discussion: help memory.)
There are many reasons for why an OS may run out of memory. The bestcase scenario is that your system is running some other memoryintensive program. This is specially likely if you are running your program on a server, where memory is shared across all users. In this case, you should attempt to rerun gegen once other memoryintensive programs finish.
If no memoryintensive programs were running concurrently, the second bestcase scenario is that your user has a memory cap that your programs can use. Again, this is specially likely on a server, and even more likely on a computing grid. If you are on a grid, see if you can increase the amount of memory your programs can use (there is typically a setting for this). If your cap was set by a system administrator, consider contacting them and asking for a higher memory cap.
If you have no memory cap imposed on your user, the likely scenario is that your system cannot allocate enough memory for gegen. At this point you can try fegen or egen, which are slower but using either should require a trivial oneletter change to the code. Note, however, that replacing gegen with fegen or plain egen is not guaranteed to use less memory. I have not benchmarked memory use very extensively, so gegen might use less memory (I doubt that is the case in most scenarios, but it is possible).
You can also try to process the data by segments. However, if you are doing group operations you would need to first sort the data and make sure you are not splitting groups apart.
Examples
You can download the raw code for the examples below here
. sysuse auto, clear . gegen id = group(foreign) . gegen tag = group(foreign) . gegen sum = sum(mpg), by(foreign) . gegen sum2 = sum(mpg rep78), by(foreign) . gegen p5 = pctile(mpg rep78), p(5) by(foreign) . gegen nuniq = nunique(mpg), by(foreign)
The function can be any of the supported functions above. It can also be any function supported by egen:
. webuse egenxmpl4, clear . gegen hsum = rowtotal(a b c) rowtotal() is not a gtools function and no by(); falling back on egen . sysuse auto, clear (1978 Automobile Data) . gegen seq = seq(), by(foreign) seq() is not a gtools function; will hash and use egen