gdistinct
Efficiently report number(s) of distinct observations or values.
gdistinct is a faster alternative to distinct. It displays the number of distinct observations with respect to the variables in varlist. By default, each variable is considered separately (excluding missing values) so that the number of distinct observations for each variable is reported and in this case the results are stored in a matrix.
The number of distinct observations is the same as the number of distinct values. Optionally, variables can be considered jointly so that the number of distinct groups defined by the values of variables in varlist is reported.
Important
Run gtools, upgrade to update gtools to the latest stable version.
Syntax
This is a fast option to the user command distinct, additionally storing the results in a matrix.
gdistinct [varlist] [if] [in] [, /// missing abbrev(#) joint minimum(#) maximum(#) ]
Options
-
missingspecifies that missing values are to be included in counting distinct observations. -
abbrev(#)specifies that variable names are to be displayed abbreviated to at most # characters. This option has no effect with joint. -
jointspecifies that distinctness is to be determined jointly for the variables in varlist. -
minimum(#)specifies that numbers of distinct values are to be displayed only if they are equal to or greater than a specified minimum. -
maximum(#)specifies that numbers of distinct values are to be displayed only if they are less than or equal to a specified maximum. -
sort(order)specifies the sort order of the output. May bealpha(alphabetical by variable name),distinct(number of distinct values), ortotal(number of non-missing values, unless optionmissingis specified). Optionally prepend a negative sign to sort in descending order. Tie-breaks are resolved arbitrarily. This is ignored with optionjoint.
Gtools options
(Note: These are common to every gtools command.)
-
compressTry to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrlSkip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verboseprints some useful debugging info to the console. -
benchmarkorbench(level)prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)Hash method to use.defaultautomagically chooses the algorithm.bijecttries to biject the inputs into the natural numbers.spookyhashes the data and then uses the hash. -
oncollision(str)How to handle collisions. A collision should never happen but just in case it doesgtoolswill try to use native commands. The user can specify it throw an error instead by passingoncollision(error).
Stored results
gdistinct stores the following in r():
Scalars
r(ndistinct) number of groups (last variable or joint)
r(N) number of non-missing observations
r(J) number of groups
r(minJ) largest group size
r(maxJ) smallest group size
Matrices
r(ndistinct) number of non-missing observations; one row
per variable (default) or per varlist (with
option joint)
Examples
You can download the raw code for the examples below
here ![]()
gdistinct can function as a drop-in replacement for distinct.
. sysuse auto, clear . gdistinct | Observations | total distinct --------------+---------------------- make | 74 74 price | 74 74 mpg | 74 21 rep78 | 69 5 headroom | 74 8 trunk | 74 18 weight | 74 64 length | 74 47 turn | 74 18 displacement | 74 31 gear_ratio | 74 36 foreign | 74 2 . matrix list r(distinct) r(distinct)[12,2] N Distinct make 74 74 price 74 74 mpg 74 21 rep78 69 5 headroom 74 8 trunk 74 18 weight 74 64 length 74 47 turn 74 18 displacement 74 31 gear_ratio 74 36 foreign 74 2 . gdistinct, sort(-distinct) | Observations | total distinct --------------+---------------------- price | 74 74 make | 74 74 weight | 74 64 length | 74 47 gear_ratio | 74 36 displacement | 74 31 mpg | 74 21 turn | 74 18 trunk | 74 18 headroom | 74 8 rep78 | 69 5 foreign | 74 2 . gdistinct, max(10) | Observations | total distinct --------------+---------------------- rep78 | 69 5 headroom | 74 8 foreign | 74 2 . gdistinct make-headroom | Observations | total distinct ----------+---------------------- make | 74 74 price | 74 74 mpg | 74 21 rep78 | 69 5 headroom | 74 8 . gdistinct make-headroom, missing abbrev(6) | Observations | total distinct --------+---------------------- make | 74 74 price | 74 74 mpg | 74 21 rep78 | 74 6 head~m | 74 8 . gdistinct foreign rep78, joint Observations total distinct 69 8 . gdistinct foreign rep78, joint missing Observations total distinct 74 10
Acknowledgements
gdistinct was written to mimic the community-contributed command distinct, the latter written by
-
Gary Longton, Fred Hutchinson Cancer Research Center, USA. glongton@fhcrc.org
-
Nicholas J. Cox, Durham University, UK. n.j.cox@durham.ac.uk