gdistinct
Efficiently report number(s) of distinct observations or values.
gdistinct is a faster alternative to distinct. It displays the number of distinct observations with respect to the variables in varlist. By default, each variable is considered separately (excluding missing values) so that the number of distinct observations for each variable is reported and in this case the results are stored in a matrix.
The number of distinct observations is the same as the number of distinct values. Optionally, variables can be considered jointly so that the number of distinct groups defined by the values of variables in varlist is reported.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
This is a fast option to the user command distinct, additionally storing the results in a matrix.
gdistinct [varlist] [if] [in] [, ///
missing abbrev(#) joint minimum(#) maximum(#) ]
Options

missing
specifies that missing values are to be included in counting distinct observations. 
abbrev(#)
specifies that variable names are to be displayed abbreviated to at most # characters. This option has no effect with joint. 
joint
specifies that distinctness is to be determined jointly for the variables in varlist. 
minimum(#)
specifies that numbers of distinct values are to be displayed only if they are equal to or greater than a specified minimum. 
maximum(#)
specifies that numbers of distinct values are to be displayed only if they are less than or equal to a specified maximum. 
sort(order)
specifies the sort order of the output. May bealpha
(alphabetical by variable name),distinct
(number of distinct values), ortotal
(number of nonmissing values, unless optionmissing
is specified). Optionally prepend a negative sign to sort in descending order. Tiebreaks are resolved arbitrarily. This is ignored with optionjoint
.
Gtools options
(Note: These are common to every gtools command.)

compress
Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is readonly support. The user can try to compress strL variables using this option. 
forcestrl
Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. 
verbose
prints some useful debugging info to the console. 
benchmark
orbench(level)
prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark
. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. 
hashmethod(str)
Hash method to use.default
automagically chooses the algorithm.biject
tries to biject the inputs into the natural numbers.spooky
hashes the data and then uses the hash. 
oncollision(str)
How to handle collisions. A collision should never happen but just in case it doesgtools
will try to use native commands. The user can specify it throw an error instead by passingoncollision(error)
.
Stored results
gdistinct stores the following in r():
Scalars r(ndistinct) number of groups (last variable or joint) r(N) number of nonmissing observations r(J) number of groups r(minJ) largest group size r(maxJ) smallest group size Matrices r(ndistinct) number of nonmissing observations; one row per variable (default) or per varlist (with option joint)
Examples
You can download the raw code for the examples below here
gdistinct can function as a dropin replacement for distinct.
. sysuse auto, clear . gdistinct  Observations  total distinct + make  74 74 price  74 74 mpg  74 21 rep78  69 5 headroom  74 8 trunk  74 18 weight  74 64 length  74 47 turn  74 18 displacement  74 31 gear_ratio  74 36 foreign  74 2 . matrix list r(distinct) r(distinct)[12,2] N Distinct make 74 74 price 74 74 mpg 74 21 rep78 69 5 headroom 74 8 trunk 74 18 weight 74 64 length 74 47 turn 74 18 displacement 74 31 gear_ratio 74 36 foreign 74 2 . gdistinct, sort(distinct)  Observations  total distinct + price  74 74 make  74 74 weight  74 64 length  74 47 gear_ratio  74 36 displacement  74 31 mpg  74 21 turn  74 18 trunk  74 18 headroom  74 8 rep78  69 5 foreign  74 2 . gdistinct, max(10)  Observations  total distinct + rep78  69 5 headroom  74 8 foreign  74 2 . gdistinct makeheadroom  Observations  total distinct + make  74 74 price  74 74 mpg  74 21 rep78  69 5 headroom  74 8 . gdistinct makeheadroom, missing abbrev(6)  Observations  total distinct + make  74 74 price  74 74 mpg  74 21 rep78  74 6 head~m  74 8 . gdistinct foreign rep78, joint Observations total distinct 69 8 . gdistinct foreign rep78, joint missing Observations total distinct 74 10
Acknowledgements
gdistinct
was written to mimic the communitycontributed command distinct
, the latter written by

Gary Longton, Fred Hutchinson Cancer Research Center, USA. glongton@fhcrc.org

Nicholas J. Cox, Durham University, UK. n.j.cox@durham.ac.uk