sort and gsort using hashes and C-plugins
While hashsort should always be faster than gsort, it might not always be faster than regular sort. In testing, hashsort was always faster in Stata/IC but not always in Stata/MP. If there are lots of duplicates then hashsort might be faster; if the sort variables are unique then hashsort will probably be slower.
gtools, upgrade to update
gtools to the latest stable version.
hashsort [+|-] varname [[+|-] varname ...] [, options ]
hashsort uses C-plugins to implement a hash-based sort that is always faster than sort for sorting groups and faster than gsort in general. hashsort hashes the data and sorts the hash, and then it sorts one observation per group. The fewer the number of gorups relative to the number of observations, the larger the speed gain.
If the sort is expected to be unique or if the number of groups is large, then this comes at a potentially large memory penalty and it may not be faster than sort (the exception is when the sorting variables are all integers).
Each varname can be numeric or a string. The observations are placed in ascending order of varname if + or nothing is typed in front of the name and are placed in descending order if - is typed. hashsort always produces a stable sort.
gen(varname)Store group ID in generate.
sortgenSet data sortby variable to
generateexits, it is replaced.
skipcheckSkip internal is sorted check.
(Note: These are common to every gtools command.)
compressTry to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option.
forcestrlSkip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables.
verboseprints some useful debugging info to the console.
bench(level)prints how long in seconds various parts of the program take to execute. Level 1 is the same as
benchmark. Levels 2 and 3 additionally prints benchmarks for internal plugin steps.
hashmethod(str)Hash method to use.
defaultautomagically chooses the algorithm.
bijecttries to biject the inputs into the natural numbers.
spookyhashes the data and then uses the hash.
oncollision(str)How to handle collisions. A collision should never happen but just in case it does
gtoolswill try to use native commands. The user can specify it throw an error instead by passing
You can download the raw code for the examples below here
. sysuse auto, clear . hashsort price . hashsort +price . hashsort rep78 -price . hashsort make . hashsort foreign -make
One thing that is useful is that hashsort can encode a set of variables and set the encoded variable as the sorting variable:
. sysuse auto, clear (1978 Automobile Data) . hashsort foreign -rep78, gen(id) sortgen (note: missing values will be sorted last) . disp "`: sortedby'" id . tab id id | Freq. Percent Cum. ------------+----------------------------------- 1 | 4 5.41 5.41 2 | 2 2.70 8.11 3 | 9 12.16 20.27 4 | 27 36.49 56.76 5 | 8 10.81 67.57 6 | 2 2.70 70.27 7 | 1 1.35 71.62 8 | 9 12.16 83.78 9 | 9 12.16 95.95 10 | 3 4.05 100.00 ------------+----------------------------------- Total | 74 100.00