Benchmarks

Note

Stata tours massive speed improvements to sort and collapse as of version 17. I do not have access to Stata 17 so I cannot test this myself, but please be aware the benchmarks below are presumably outdated for gcollapse and hashsort.

Hardware

Stata/MP benchmarks were run on a Linux setver with 8 cores.

Program:   Stata/MP 15.2 (8 cores)
OS:        x86_64 GNU/Linux
Processor: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.50GHz
Cores:     2 sockets with 6 cores per socket and 2 virtual threads per core.
Memory:    141GiB
Swap:      325GiB

Stata/IC benchmarks favor gtools more sharply, as do benchmarks on Stata 14 and earlier.

Summary

Versus native equivalents

Versus ftools

Note

Updated benchmarks against ftools are forthcomming. ftools is a very good speed improvement already, and if you are working largely in mata I heartily recommend its API. However, it is still slower than gtools by a factor of 2 to 10 (that is, gtools is 50% to 90% faster).

The commands here are also faster than the commands provided by ftools; further, gtools commands take a mix of string and numeric variables, a limitation of ftools.

Versus sort

I have implemented a hash-based sorting command, hashsort. This is not an official part of gtools because it is not always faster than regular sort. It has its uses, however. Namely in Stata/IC it will usually be faster than regular sort, and both in Stata/IC and in Stata/MP it will also be faster than Stata's own gsort:

Function	Versis	Speedup (IC)	Speedup (MP)
hashsort	sort	2.5 to 4	0.7 to 0.9
	gsort	2 to 18	1 to 6

Random data used

We create a data set with the number of groups we want and expand it to the number of observations we want. For instance, we create a dataset with 10 observations and then expand it to 10M (via expand 1000000). Each of the variable names should be indicative of what they are (int1 is an integer, double1 is a double, and so on).

String variables were concatenated from a mix of arbitrary ascii characters and random strings from the ralpha package. All variables include missing values and all strings include some blanks.

Contains data
  obs:    10,000,000
 vars:            19
 size: 1,460,000,000
------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------
str_long        str5    %9s
str_mid         str3    %9s
str_short       str3    %9s
str_4           str11   %11s
str_12          str12   %12s
str_32          str32   %32s
int1            long    %12.0g
int2            double  %10.0g
int3            long    %12.0g
double1         double  %10.0g
double2         double  %10.0g
double3         double  %10.0g
runif_small_flt float   %9.0g
runif_small_dbl double  %10.0g
rnorm_small_flt float   %9.0g
rnorm_small_dbl double  %10.0g
runif_big_flt   float   %9.0g
runif_big_dbl   double  %10.0g
rnorm_big_flt   float   %9.0g
------------------------------------------------------------
Sorted by:

Stata/MP Benchmarks

gcollapse

Simple

Complex

greshape

gcontract

gduplicates drop

gegen group

gisid

Non-unique

Unique

glevelsof

Note

Note levelsof a significant speed improvement for numeric levels in Stata 15, which is great. However, glevelsof is still at least twice as fast for numeric levels, and orders of magnitude faster for string levels. Furthermore, glevelsof takes multiple variables and can handle a very large number of groups more efficienty (it can also bypass the maximum macro length limit via the gen() and matasave options; see here)

gquantiles

pctile

xtile

with by (vs astile)

gstats summarize

gstats tab

gstats winsor

Without by

With by

gunique

gduplicates

hashsort

VS sort

VS gsort