gquantiles
Efficiently compute percentiles, quantiles, categories, and frequency counts.
gquantiles is a by-able replacement for xtile, pctile, and _pctile that offers several additional features, like computing arbitrary quantiles (and an arbitrary number), frequency counts, and more (see the examples below).
gquantiles is also faster than the user-written fastxtile, so an alias, fasterxtile, is also provided.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
The full syntax is
gquantiles [newvar =] exp [if] [in] [weight], {pctile | xtile | _ pctile} quantiles_method [ gquantiles_options ]
This function accepts by()
with xtile
and pctile
. However, you
can simply use it as a replacement for native Stata commands.
Equivaent to pctile (store the quantiles of exp
in newvar
):
gquantiles newvar = exp [if] [in] [weight], pctile [nquantiles(#) genp(newvarname) altdef]
Equivaent to xtile (store the categories of exp
in newvar
):
gquantiles newvar = exp [if] [in] [weight], xtile [nquantiles(#) cutpoints(numlist) altdef]
fasterxtile newvar = exp [if] [in] [weight], [nquantiles(#) cutpoints(numlist) altdef]
Equivaent to _pctile (return the percentiles of exp
):
gquantiles exp [if] [in] [weight], _pctile [nquantiles(#) percentiles(numlist) altdef]
The options and behavior of the above largely mimic that of the Stata native
commands. You only need to read the rest of the documentation if you wish to
use some of the additional features that gquantiles provides.
Weights
aweight, fweight, and pweight are allowed and mimic the weights in
pctile
, xtile
, or _pctile
(see help weight
and the weights
section in help pctile
). Weights are not allowed with altdef
.
Options
Quantiles method
gquantiles offers 4 ways of specifying quantiles and 3 ways of specifying
cutoffs. The behavior of each differs slightly when specifying pctile
,
xtile
, and _pctile
(see stored results for details).
-
nquantiles(#)
Number of quantiles to copmpute. gquantiles computes percentiles (quantiles) corresponding to 100 * k / m for k = 1, 2, ..., m - 1, where m = #. For example,nquantiles(10)
requests that the 10th, 20th, ..., 90th percentiles be computed. The default isnquantiles(2)
, the median. -
cutpoints(varname)
(xtile
orpctile
only). Requests that the values ofvarname
be used instead of the quantiles of the source variable. Like the native equivalent, all values of varname are used, regardless of anyif
orin
restriction (the user can pass the optioncutifin
to restrictcutpoints
to theif in
range or optioncutby
to use different cutpoints in each group).
Note that without any additional options, in the case ofpctile
all this does is sortvarname
and store it in the target variable. Further, unlike the native equivalent, this variable does not need to be sorted or unique. Missings are excluded by default and the variable is sorted internally. The user can request duplicates be excluded via the optiondedup
. -
quantiles(numlist)
orpercentiles(numlist)
. Requests percentiles (quantiles) corresponding to those specified bynumlist
. For example,percentiles(10(20)90)
requests that the 10th, 30th, 50th, 70th, and 90th percentiles be computed.
Unlike the native equivalent, this can be specified withpctile
andxtile
as well as_pctile
. Withpctile
the results are stored in the target variable. Withxtile
the rank is computed usingnumlist
. With_pctile
the return values are stored in r(r1), r(r2), ..., etc. Note that the number of return values that can be set is artificially capped at 1001 because they take a really long time to set. For more than 1001 return values consider usingcutquantiles()
withpctile
or see the computing many quantiles section in the examples below. -
quantmatrix(matrix)
. Requests percentiles (quantiles) corresponding to the entries of the matrix. This must be a column vector or a row vector. The behavior of gquantiles using this option is otherwise equivalent to its behavior when passingquantiles()
. -
cutquantiles(varname)
(xtile
orpctile
only). Requests that the values ofvarname
be used as the percentiles (quantiles) to compute.varname
must have all its values between 0 and 100. By default all values are read, regardless ofif in
restrictions (use optioncutifin
to restrict the range orcutby
to use different cut quantiles per group). Missing values are dropped and duplicates are allowed (use optiondedup
to drop duplicate quantiles).
This option is included to allow the user to compute an arbitrary number of quantiles in a reasonable amount of time. While the user can force more than 1001 return values with the_pctile
option, this is ill-advised since performance will suffer greatly. -
cutoffs(numlist)
(Only withxtile
,pctile
, or_pctile
withbincount
). This option is similar tocutpoints
, except that instead of using the values of a variable the function uses the values ofnumlist
as the cutoffs.
With option_pctile
this option is only allowed withbincount
, which requests a count of the number of instances of the source variable within the bins defined by the cutoffs. -
cutmatrix(matrix)
. Requests cutoffs corresponding to the entries of the matrix. This must be a column vector or a row vector. The behavior of gquantiles using this option is otherwise equivalent to its behavior when passingcutoffs()
.
gquantiles options
Standard Options
-
altdef
. Use an alternative definition for quantiles. When you have a finite number of values, as is the case with all data with N observations, there are at most N - 1 exact quantiles (assuming no duplicates). To compute arbitrary quantiles we need to make additional assumptions.
Let x(i) denote i-th smallest value of the relevant variable such that i > qN / 100. If i - 1 < qN / 100 then the quantile is x(i). However, if i - 1 = qN / 100 the convention is to use the average of x(i) and x(i - 1). Optionaltdef
defines i <= q(N + 1) / 100 and h = q(N + 1) / 100 - i. Then the quantile is (1 - h) * x(i) + h * x(i + 1) instead. -
genp(newvar)
specifies a new variable to be generated containing the percentages corresponding to the percentiles (quantiles) be created.
Extras
-
by(varlist)
Compute quantiles by group (pctile or xtile only).pctile[()]
requires optionstrict
, which has the effect of ignoring groups where the number of quantiles requested is larger than the number of non-missing observations within the group.by()
is most useful with optiongroupid(varname)
. -
groupid(varname)
Store group ID invarname
. -
_pctile
(Not with by.) does the computation in the style of the native command_pctile
. It stores return values in r(1), r(2), and so on, as wll as a matrix calledr(quantiles_used)
orr(cutoffs_used)
in case quantiles or cutoffs are requested. See stored results for details. -
pctile
orpctile(newvar)
Computes the quantiles of the source variable (i.e. callsgquantiles
in the style of the native commandpctile
). Thepctile(newvar)
syntax is provided to allow combiningpctile
with other options. See examples. -
xtile
orxtile(newvar)
Computes the category of the source variable using the categories defined by the cutoffs or implied by the quantiles (i.e. callsgquantiles
in the style of the native commandxtile
). Thextile(newvar)
syntax is provided to allow combiningxtile
with other options. See examples. -
binfreq
(Not with by.) orbinfreq(newvar)
Stores the frequency counts of the source variable within the bins defined by the quantiles or the cuoffs. When weights are specified, this stores the sum of the weights within that category. Ifnewvar
is passed then the frequency counts are stored innewvar
; otherwise they are stored in the matrixr(quantiles_bincount)
orr(cutoffs_bincount)
.
Switches
-
method(#)
(Not with by.) If you have many duplicates or are computing many quantiles, you should specifymethod(1)
. If you have few duplicates or are computing few quantiles you should specifymethod(2)
. By default,gquantiles
tries to guess which method will run faster. See computation methods in the examples section below. -
dedup
By default all quantiles and cutoffs are used in computations, regardless of duplicate values. For instance, if the user asks for quantiles 1, 90, 10, 10, and 1, then quantiles 1, 1, 10, 10, and 90 are used. With this option only 1, 10, and 90 would be used. -
cutifin
By default all values of the variable requested viacutpoints
orcutquantiles
are used. The reason Stata behaves in this way is thatcutpoints
was written to take in the output ofpctile
(this is howxtile
works internally). Here, bothcutpoints
andcutquantiles
are generic options, so the user can read all values or just valuesif in
. -
cutby
By default all values of the variable requested viacutpoints
orcutquantiles
are used. With this option, each group uses a different set of quantiles or cutoffs (note this automatically sets optioncutifin
). -
returnlimit(#)
Default is 1001. When the user runs gquantiles in the style of_pctile
return values r(1), r(2), and so on are set. The limit of quantiles that can be computed is the number of observations in the data, which can be millions or billions. Setting that many return values is computationally infeasible! The user can turn off this check viareturnlimit(.)
. This is ill-advised. To compute that many quantiles it is better to use thepctile
option. -
strict
Withoutby()
, exit with error if the number of quantiles requested is above the number of non-missing values. In the case of xtile, this also restricts the number of quantiles to the number of observations in the data. Withby()
, skip groups where this happens. Note that Stata limitsnquantiles()
withxtile
because it creates a temporary variable with the quantiles. Hence from the point of view of gquantiles this is an artificial limitation. -
minmax
(Not with by.) Additionally storer(min)
andr(max)
. Percentiles (quantiles) are required to be strictly between 0 and 100. However, the min and max are relatively trivial to compute given the internals of the command, so the user can request them here. -
replace
Replace targets, should they exist.
Gtools options
(Note: These are common to every gtools command.)
-
compress
Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrl
Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verbose
prints some useful debugging info to the console. -
benchmark
orbench(level)
prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark
. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)
Hash method to use.default
automagically chooses the algorithm.biject
tries to biject the inputs into the natural numbers.spooky
hashes the data and then uses the hash. -
oncollision(str)
How to handle collisions. A collision should never happen but just in case it doesgtools
will try to use native commands. The user can specify it throw an error instead by passingoncollision(error)
.
Stored results
All calls store the following results
Scalars r(N) Number of observations r(min) Min (only if minmax was requested) r(max) Max (only if minmax was requested) r(nqused) Number of quantiles/cutoffs r(method_ratio) Rule used to decide between methods 1 and 2 r(nquantiles) Number of quantiles (only w nquantiles()) r(ncutpoints) Number of cutpoints (only w cutpoints()) r(nquantiles_used) Number of quantiles (only w quantiles()) r(nquantpoints) Number of quantiles (only w cutquantiles()) r(ncutoffs_used) Number of cutoffs (only w cutoffs()) r(r#) The #th quantile requested (only w _pctile) Macros r(quantiles) Quantiles used (only w percentiles() or quantiles()) r(cutoffs) Cutoffs used (only w option cutoffs()) Matrices r(quantiles_used) With _pctile or with quantiles() r(quantiles_binfreq) With option binfreq and any quantiles requested r(cutoffs_used) With _pctile or with cutoffs() r(cutoffs_binfreq) With option binfreq and any cutoffs requested
Methods and Formulas
The behavior of gquantiles
follows the behavior specified in
Stata's documentation
(in particular see the "Methods and formulas" section). We do not
repeat those formulas here.
One interesting thing to note is that Stata artificially limits the way in
which xtile
can be categorized. That is, the program defines categories
using the quantiles or the cutoffs as the right-inclusive endpoints. For
instance, nquantiles(NQ)
gives q(1), q(2), ..., q(NQ - 1) quantiles. (For
cutpoints(varname)
, q(i) is the i-th smallest value of varname
, and so
on for each way to specify quantiles and cutoffs). Now we have
| Category | Range | | -------- | ---------------------- | | 1 | (-Inf, q(1)] | | 2 | (q(1), q(2)] | | ... | ... | | NQ - 1 | (q(NQ - 2), q(NQ - 1)] | | NQ | (q(NQ - 1), Inf) |
In theory there is no reason to limit NQ. For example, the question "Where do these 4 values fit in these 1000 categories?" is a well-defined question. Even if there will be at least 996 categories contain no values, there is no reason to limit the number of categories to 4 (of course, since the 1000 categories are created from those 4 values, in practice this might not be adviseable).
So why does the limit exist in xtile
? It is actually a limit in pctile
,
which is used internally. Since pctile
stores the percentiles in a variable
in the data, this limit is actually very reasonable in this case (the same
limit is in gquantiles, pctile
). However, the user can request an arbitrary
number of quantiles via gquantiles, xtile nq(#)
. If the user does not like
this behavior, they can pass the option strict
.
Examples
You can download the raw code for the examples below here
Using by
One major feature that gquantile adds is by()
. It should be
internally consistent if the user specifies the option strict
.
For example,
clear set obs 1000000 gen group = int(runiform() * 100) gen x = runiform() local popts pctile strict by(group) cutby gquantiles pctile = x, `popts' nq(10) genp(perc) groupid(id) local xopts xtile strict by(group) cutby gquantiles xtile = x, `xopts' nq(10) gquantiles xtile2 = x, `xopts' cutpoints(pctile) gquantiles xtile3 = x, `xopts' cutquantiles(perc) assert xtile == xtile2 assert xtile == xtile3
However, there is no requirement for the user to do so:
sysuse auto, clear gquantiles xtile = price, xtile by(foreign) nq(50)
Computing many quantiles
Stata's _pctile caps the number of quantiles to 1001. pctile uses _pctile internally, so to compute more than 1001 percentiles it needs to loop over various runs of _pctile in a very inefficient way. This inefficiency carries over to xtile because that command uses pctile internally. (Presumably this is the reason for the limit in the user-written fastxtile).
The following executes with no errors in a reasonable amount of time
clear set obs 1000000 gen x = runiform() _pctile x, nq(1001) pctile p1 = x, nq(1001) gquantiles p2 = x, pctile nq(1001)
However, if you increase nq
the runtimes become excessive:
drop p* timer clear timer on 90 pctile p1 = x, nq(5001) timer off 90 timer on 10 gquantiles p2 = x, pctile nq(5001) timer off 10 assert p1 == p2 timer list
10: 0.42 / 1 = 0.4200 90: 61.14 / 1 = 61.1440
61 seconds for only 1,00,000 observations! This is in Stata/MP with 8 cores. gquantiles scales nicely, by contrast:
drop p* timer clear timer on 10 gquantiles p2 = x, pctile nq(`=_N + 1') timer off 10 clear set obs 100000000 gen x = runiform() * 100 timer on 20 gquantiles p2 = x, pctile nq(`=_N + 1') timer off 20 timer list
10: 0.44 / 1 = 0.4430 20: 36.53 / 1 = 36.5270 90: 61.14 / 1 = 61.1440
That's right, gquantiles computed 100M quantiles for 100M observations in 36 seconds, faster than pctile could compute 5,000 quantiles for 1M observations. As a side-note, using mata can afford a massive speedup, obviating the need to call C in case gquantiles does not do something you want. Consider:
clear all timer clear mata: void function mata_pctile (string scalar newvar, string scalar sourcevar, real scalar nq) { real scalar N real colvector X, quantiles, qpositions, qties, qtiesix, Q, Qties X = st_data(., sourcevar) N = rows(X) _sort(X, 1) quantiles = ((1::(nq - 1)) * N / nq) qpositions = ceil(quantiles) qties = (qpositions :== quantiles) Q = X[qpositions] if ( any(qties) ) { qtiesix = selectindex(qties) Qties = X[qpositions[qtiesix] :+ 1] Q[qtiesix] = (Q[qtiesix] + Qties) / 2 } st_addvar("`:set type'", newvar) st_store((1::(nq - 1)), newvar, Q) } end set obs 1000000 gen x = runiform() timer on 80 mata: mata_pctile("p0", "x", 5001) timer off 80 timer on 90 pctile p1 = x, nq(5001) timer off 90 timer on 10 gquantiles p2 = x, pctile nq(5001) timer off 10 assert p0 == p1 assert p1 == p2 timer list
10: 0.48 / 1 = 0.4770 80: 1.23 / 1 = 1.2300 90: 56.97 / 1 = 56.9720
Just by using mata we speeded up Stata 50 times! The mata solution does not scale as well as gquantiles, however:
clear timer clear set obs 10000000 gen x = runiform() timer on 80 mata: mata_pctile("p0", "x", 5001) timer off 80 timer on 10 gquantiles p2 = x, pctile nq(5001) timer off 10 timer list
10: 2.97 / 1 = 2.9720 80: 14.40 / 1 = 14.4030
With just 10M observations, gquantiles is still a 5x improvement over mata when computing many quantiles.
Computation methods
Computing quantiles involves selecting elements from an unordered array in one of two ways: Using a selection algorithm on the unsorted variable or sorting and then selecting elements of the sorted varaible.
The internal selection algorithm of gquantiles is very fast and on average will run in linear O(N) time (see quickselect). The sorting algorithm runs in O(N log(N)) time (see quicksort). Clearly, with few quantiles we can see the selection algorithm will be faster. However, with a large number of quantiles running multiple iterations of the selection algorithm is clearly slower than doing a single sort.
clear timer clear set obs 10000000 gen x = rnormal() * 100 timer on 10 gquantiles p1 = x, pctile nq(2) method(1) timer off 10 timer on 20 gquantiles p2 = x, pctile nq(2) method(2) timer off 20 assert p1 == p2 timer list
10: 3.39 / 1 = 3.3870 20: 1.10 / 1 = 1.1020
We can see that method 2 was more than 3 times faster for a single quantile.
timer clear timer on 10 gquantiles p1 = x, pctile nq(10) method(1) replace timer off 10 timer on 20 gquantiles p2 = x, pctile nq(10) method(2) replace timer off 20 timer list
10: 3.43 / 1 = 3.4290 20: 2.28 / 1 = 2.2840
While method 2 was still faster, computing 10 quantiles took twice the time it took to compute 1. By contrast, method 1 took essentially the same time. This is because after sorting the data, selecting elements is nearly instantaneous.
timer clear timer on 10 gquantiles p1 = x, pctile nq(100) method(1) replace timer off 10 timer on 20 gquantiles p2 = x, pctile nq(100) method(2) replace timer off 20 timer list
10: 3.22 / 1 = 3.2210 20: 7.04 / 1 = 7.0370
With 100 quantiles we can see that the performance of method 2 is now much worse than method 1. Internally, gquantiles will try to switch between the two methods based on the number of observations and the number of quantiles. You might be tempted to always specify method 2 for few quantiles, but there is a second way in which it is slower than sorting:
timer clear replace x = int(x) timer on 10 gquantiles p1 = x, pctile nq(10) method(1) replace timer off 10 timer on 20 gquantiles p2 = x, pctile nq(10) method(2) replace timer off 20 timer list
10: 1.29 / 1 = 1.2940 20: 1.48 / 1 = 1.4760
What happened? While both commands are faster, now method 1 is faster than method 2, whereas before it was 50% slower. This is because the specific sorting algorithm I use handles duplicates better than the selection algorithm.
timer clear timer on 10 gquantiles p1 = x, pctile nq(100) method(1) replace timer off 10 timer on 20 gquantiles p2 = x, pctile nq(100) method(2) replace timer off 20 timer list
10: 1.54 / 1 = 1.5390 20: 4.31 / 1 = 4.3080
Again, both are faster with duplicates, but method 1 is much faster.
Multiple subcommands
gquantiles allows the user to compute several things at once:
sysuse auto, clear gquantiles price, _pctile xtile(x1) pctile(p1) binfreq nq(10) matrix list r(quantiles_binfreq) l price x1 p1 in 1/10
r(quantiles_binfreq)[9,1] c1 r1 8 r2 7 r3 8 r4 7 r5 7 r6 8 r7 7 r8 8 r9 7 +----------------------+ | price x1 p1 | |----------------------| 1. | 4,099 2 3895 | 2. | 4,749 5 4099 | 3. | 3,799 1 4425 | 4. | 4,816 5 4647 | 5. | 7,827 8 5006.5 | |----------------------| 6. | 5,788 7 5705 | 7. | 4,453 4 6165 | 8. | 5,189 6 7827 | 9. | 10,372 9 11385 | 10. | 4,082 2 . | +----------------------+
Specifying quantiles and cutoffs
gquantiles allows for several ways to specify cutoffs
sysuse auto, clear gquantiles price, _pctile p(10(10)99) matrix p0 = r(quantiles_used) gquantiles p1 = price, pctile nq(10) genp(g1) xtile(x1) gquantiles x2 = price, xtile cutpoints(p1) gquantiles x3 = price, xtile cutquantiles(g1) qui glevelsof p1 gquantiles x4 = price, xtile cutoffs(`r(levels)') qui glevelsof g1 gquantiles x5 = price, xtile quantiles(`r(levels)') matrix list p0 l p1 g1 x? in 1/10
p0[9,1] c1 r1 3895 r2 4099 r3 4425 r4 4647 r5 5006.5 r6 5705 r7 6165 r8 7827 r9 11385 . l p1 g1 x? in 1/10 +--------------------------------------+ | p1 g1 x1 x2 x3 x4 x5 | |--------------------------------------| 1. | 3895 10 2 2 2 2 2 | 2. | 4099 20 5 5 5 5 5 | 3. | 4425 30 1 1 1 1 1 | 4. | 4647 40 5 5 5 5 5 | 5. | 5006.5 50 8 8 8 8 8 | |--------------------------------------| 6. | 5705 60 7 7 7 7 7 | 7. | 6165 70 4 4 4 4 4 | 8. | 7827 80 6 6 6 6 6 | 9. | 11385 90 9 9 9 9 9 | 10. | . . 2 2 2 2 2 | +--------------------------------------+