Benchmarks
Note
Stata tours massive speed improvements to sort and collapse
as of version 17. I do not have access to Stata 17 so I cannot
test this myself, but please be aware the benchmarks below
are presumably outdated for gcollapse
and hashsort
.
Hardware
Stata/MP benchmarks were run on a Linux setver with 8 cores.
Program: Stata/MP 15.2 (8 cores) OS: x86_64 GNU/Linux Processor: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.50GHz Cores: 2 sockets with 6 cores per socket and 2 virtual threads per core. Memory: 141GiB Swap: 325GiB
Stata/IC benchmarks favor gtools more sharply, as do benchmarks on Stata 14 and earlier.
Summary
Versus native equivalents
Versus ftools
Note
Updated benchmarks against ftools are forthcomming. ftools is a very
good speed improvement already, and if you are working largely in mata
I heartily recommend its API. However, it is still slower than gtools by
a factor of 2 to 10 (that is, gtools is 50% to 90% faster).
The commands here are also faster than the commands provided by ftools
;
further, gtools
commands take a mix of string and numeric variables, a
limitation of ftools
.
Versus sort
I have implemented a hash-based sorting command, hashsort
. This is
not an official part of gtools because it is not always faster
than regular sort. It has its uses, however. Namely in Stata/IC it will
usually be faster than regular sort
, and both in Stata/IC and in
Stata/MP it will also be faster than Stata's own gsort
:
Function | Versis | Speedup (IC) | Speedup (MP) |
---|---|---|---|
hashsort | sort | 2.5 to 4 | 0.7 to 0.9 |
gsort | 2 to 18 | 1 to 6 |
Random data used
We create a data set with the number of groups we want and expand it to
the number of observations we want. For instance, we create a dataset
with 10 observations and then expand it to 10M (via expand 1000000
).
Each of the variable names should be indicative of what they are (int1
is an integer, double1
is a double, and so on).
String variables were concatenated from a mix of arbitrary ascii
characters and random strings from the ralpha
package. All variables
include missing values and all strings include some blanks.
Contains data obs: 10,000,000 vars: 19 size: 1,460,000,000 ------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------ str_long str5 %9s str_mid str3 %9s str_short str3 %9s str_4 str11 %11s str_12 str12 %12s str_32 str32 %32s int1 long %12.0g int2 double %10.0g int3 long %12.0g double1 double %10.0g double2 double %10.0g double3 double %10.0g runif_small_flt float %9.0g runif_small_dbl double %10.0g rnorm_small_flt float %9.0g rnorm_small_dbl double %10.0g runif_big_flt float %9.0g runif_big_dbl double %10.0g rnorm_big_flt float %9.0g ------------------------------------------------------------ Sorted by:
Stata/MP Benchmarks
gcollapse
Simple
Complex
greshape
gcontract
gduplicates drop
gegen group
gisid
Non-unique
Unique
glevelsof
Note
Note levelsof
a significant speed improvement for numeric levels in
Stata 15, which is great. However, glevelsof
is still at least twice
as fast for numeric levels, and orders of magnitude faster for string
levels. Furthermore, glevelsof
takes multiple variables and can handle
a very large number of groups more efficienty (it can also bypass the
maximum macro length limit via the gen()
and matasave
options; see
here)
gquantiles
pctile
xtile
with by (vs astile)
gstats summarize
gstats tab
gstats winsor
Without by
With by
gunique
gduplicates
hashsort
VS sort
VS gsort