FAQs
Have weights been implemented yet?
Yes! Several gtools
commands accept weights:

gcollapse
andgcontract
, which can match all the weight options incollapse
andcontract
, respectively. 
gegen
, which can do weights for internally impemented functions (egen
does not take weights, so functions that are not internally implemented cannot do weights either). 
gquantiles
andfasterxtile
(which fix some possible issues with the weights implementation inpctile
andxtile
) 
gtop
andgtoplevelsof
Why do I get an error with strL variables?
strL
variables in stata allow storing up to 2GB of data in each entry
(note that is not quite the same as GiB and not quite the same as a
string of that length).
This is great, but the Stata Plugin Interface has limited support for
them. Version 2.0, which is used in Stata 13 and earlier, does not
support strL
variables at all, and version 3.0, which is used in Stata
14 and later, only supports reading from strL
variables.
This means that gtools
can only support strL
variables in Stata 14
and later for some commands. In particular gcollapse
and gcontract
do not support strL
variables because those commands have to write
values to Stata, and that is not possible for strL
variables using
plugins. Further, strL
variabes can store binary data, which would
require rewriting various portions of gtools
; binary data support is
planned for a future release but does not have an ETA.
Why do I get error r(608)
?
Some users have reported that when gtools tries to create temporary
files it gives an error code r(608)
(file cannot be modified; likely
because the directory is readonly or the file became readonly). The
user can set the temporary directory manually via
global GTOOLS_TEMPDIR .
where .
will be the current directory. It can be any path to any
existing directory as long as the user has readwrite permission.
My computer has a 32bit CPU
I have only compiled gtools for 64bit CPUs. gtools uses 128bit hashes split into 2 64bit parts and the machines I have access to use 64bit CPUs, so this seemed quite natural to me.
However, my understanding is that it should be possible to compile gtools to support 64bit math on a 32bit processor. Unfortunately, I do not have access to 32bit machines, and I will not post a plugin as part of the package without adequate testing.
If you need to run gtools in a 32bit machine consider trying to compile the plugin yourself on the machine in question.
Why use platformdependent plugins?
C is fast! When optimizing stata, there are three options:
 Mata (already implemented)
 Java plugins (I don't like Java)
 C and C++ plugins
Sergio Correia's ftools
tests the limits of mata and achieves excellent
results, but Mata cannot compare to the raw speed a low level language like
C would afford. The only question is whether the overhead reading and writing
data to and from C compensates the speed gain, and in this case it does.
Why no multithreading?
Multithreading is really difficult to support, specially because I could not figure out a crossplatform way to implement multithreading. Perhaps if I had access to physical Windows and OSX hardware I would be able to do it, but I only have access to Linux hardware. And even then the multithreading implementation that worked on my machine broke the plugin on older systems.
Basically my version of OpenMP, which is what I'd normally use, does not play nice with Stata's plugin interface or with older Linux versions. Perhaps I will come back to multithreading in the future, but for now only the singlethreaded version is available, and that is already a massive speedup!
How can this be faster?
As I understand it, many of Stata's underpinnings are already compiled C code. However, there are two explanations why this is faster than Stata's native commands:

Hashing: I hash the data using a 128bit hash and sort on this hash using a radix sort (a counting sort that sorts large integers Xbits at a time; I choose X to be 16). Sorting on a single integer is much faster than sorting on a collection of variables with arbitrary data. With a 128bit hash you shouldn't have to worry about collisions (unless you're working with groups in the quintillionsâ€”that's 10^18). Hashing here is also faster than hashing in Sergio Correia's
ftools
, which uses a 32bit hash and will run into collisions just with levels in the thousands, so he has to resolve collisions. 
Efficiency: While Stata's buitin commands are not necessarily inefficient, the fact is many of its commands are ado files written in an addhoc manner. For instance, collapse loops through each statistic, computing them in turn. This amounts to one individual call per statistic to
by
, which is slow. Similar inefficiencies are found in egen, isid, levelsof, contract, and so on. While they are fast enough for even modestlysized data, when there are several million rows they begin to falter.
I think an interesting encapsulation of this second point is found
in the helpfile for the communitycontributed egenmore
. One of the
entries reads:
nmiss(exp) [ , by(byvarlist) ] returns the number of missing values in exp (...) Remark: Why this was written is a mystery. The oneline command egen nmiss = sum(missing(exp)) (...) shows that it is unnecessary.
I independently wrote a very similar command, namely gegen nmissing
,
so I obviously disagree the oneliner makes it unnecessary. The author
of egen nmiss
documents concern with functionality and speed of
coding, not speed of execution. For small data, that is actually the
correct tradeoff. There is no reason to spend a large portion of your
time optimizing nmiss
if you will only gain a tenth of a second.
However, gtools
is meant to be used on large datasets, in which case
the inefficiencies add up: The oneliner proposed creates a temporary
variable and then sums it by group. gtools
, on the other hand, simply
counts the number missing by group as it iterates through the variable
in the first place, which is a faster algorithm.
How does hashing work?
The point of using a hash is straightforward: Sorting a single integer
variable is much faster than sorting multiple variables with arbitrary
data. In particular I use a counting sort, which asymptotically performs
in O(n)
time compared to O(n log n)
for the fastest generalpurpose
sorting algorithms. (Note with a 128bit algorithm using a counting sort is
prohibitively expensive; gtools commands does 4 passes of a counting
sort, each sorting 16 bits at a time; if the groups are not unique after
sorting on the first 64 bits we sort on the full 128 bits.)
Given K
by variables, by_1
to by_K
, where by_k
belongs the set B_k
,
the general problem is to devise a function f
such that f: B_1 x ... x B_K > N
,
where N
are the natural (whole) numbers. Given B_k
can be integers,
floats, and strings, the natural way of doing this is to use a hash: A
function that takes an arbitrary sequence of data and outputs data of
fixed size.
In particular I use the Spooky Hash devised by Bob Jenkins, which is a 128bit hash. Stata caps observations at 20 billion or so, meaning a 128bit hash collision is de facto impossible. Nevertheless, the code is written to fall back on native commands should it encounter a collision.
An internal mechanism for resolving potential collisions is in the works. See issue 2 for a discussion.
Memory management with gcollapse
C cannot create or drop variables. This creates a problem for
gcollapse
(and possibly gcontract
) when N is large and the number
of groups J is small. For examplle, N = 100M means about 800MiB per
variable and J = 1,000 means barely 8KiB per variable. Adding variables
after the collapse is trivial and before the collapse it may take
several seconds.
The function tries to be smart about this: Variables are only created if the source variable cannot be replaced with the target. This conserves memory and speeds up execution time. (However, the function currently recasts unsuitably typed source variables, which saves memory but may slow down execution time.)
If there are more targets than sources, however, there are two options:
 Create the extra target variables in Stata before collapsing.
 Write the extra targets, collapsed, to disk and read them back later.
Ideally I could create the variables in Stata after collapsing and read them back from memory, but that is not possible. Hence we must choose one of the two options above, and it is not always obvious which will be faster.
Clearly for very large N and very small J, option 2 is faster. However, as J grows relative to N the tradeoff is not obvious. First, variables still have to be created in Stata. So disk operations have to be faster than (N  J) / N of the time it takes for Stata to the variables. In our example, disk operations on 8KiB per variable should be instantaneous and will almost surely be faster than operations on 720MiB per variable in memory.
But what if J is 10M? Is operating on ~80MiB on disk faster than ~720MiB on memory? The answer may well be no. What if J = 50M? Then the answer is almost surely no. For this reason, the code tries to benchmark how long it will take to collapse to disk and read back the data from disk versus creating the variables in memory and simply collapsing to memory.
This has a small overhead, so gcollapse
will only try the swtich when
there are at least 4 additional targets to create. In testing, the
overhead has been ~10% of the total runtime. If the user expects J to be
large, they can turn off this check via forcemem
. If the user expects
J to be small, they can force collapsing to disk via forceio
.