gunique
Efficiently calculate unique values of a variable or group of variables.
gunique is a faster alternative to unique. It reports the number of unique
values for the varlist. At the moment, its main difference from distinct is
that it always considers the variables jointly. It also has slighly different
options. For example, this supports the by(varlist) option that also appears
in the unique command, but does not support computing the number of unique
values for variables individually.
Important
Run gtools, upgrade to update gtools to the latest stable version.
Syntax
gunique varlist [if] [in] [, detail]
Options
-
by(varlist)counts unique values within levels of varlist and stores them in a new variable named _Unique. The user can specify the name of the new variable via the option generate(varname). -
generate(varname)supplies an alternative name for the new variable created by by. -
replaceReplaces _Unique (or the variable specified by generate()) if it exists. -
detailrequest summary statistics on the number of records which are present for unique values of the varlist. Withby(), it also prints the levels with the most unique values.
Gtools options
(Note: These are common to every gtools command.)
-
compressTry to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrlSkip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verboseprints some useful debugging info to the console. -
benchmarkorbench(level)prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)Hash method to use.defaultautomagically chooses the algorithm.bijecttries to biject the inputs into the natural numbers.spookyhashes the data and then uses the hash. -
oncollision(str)How to handle collisions. A collision should never happen but just in case it doesgtoolswill try to use native commands. The user can specify it throw an error instead by passingoncollision(error).
Stored results
gunique stores the following in r():
Scalars
r(unique) number of groups (last variable or joint)
r(N) number of non-missing observations
r(J) number of groups
r(minJ) largest group size
r(maxJ) smallest group size
Examples
You can download the raw code for the examples below
here ![]()
. sysuse auto, clear (1978 Automobile Data) . gunique * N = 69; 69 balanced groups of size 1 . gunique *, miss N = 74; 74 balanced groups of size 1 . gunique make-headroom N = 69; 69 balanced groups of size 1 . gunique rep78, d __000000 ------------------------------------------------------------- Percentiles Smallest 1% 2 2 5% 2 8 10% 2 11 Obs 5 25% 8 18 Sum of Wgt. 5 50% 11 Mean 13.8 Largest Std. Dev. 10.73313 75% 18 8 90% 30 11 Variance 115.2 95% 30 18 Skewness .5573459 99% 30 30 Kurtosis 2.113785 . gunique rep78, by(foreign) 'rep78' had 5 unique values in 69 observations. Variable _Unique has the number of unique values of 'rep78' by 'foreign'.
Acknowledgements
gunique was written to mimic the functionality of the community-contributed command unique, the latter written by
-
Michael Hills, retired
-
Tony Brady, Sealed Envelope Ltd, UK tony@sealedenvelope.com