Have weights been implemented yet?

Yes! Several gtools commands accept weights:

  • gcollapse and gcontract, which can match all the weight options in collapse and contract, respectively.

  • gegen, which can do weights for internally impemented functions (egen does not take weights, so functions that are not internally implemented cannot do weights either).

  • gquantiles and fasterxtile (which fix some possible issues with the weights implementation in pctile and xtile)

  • gtop and gtoplevelsof

Why do I get an error with strL variables?

strL variables in stata allow storing up to 2GB of data in each entry (note that is not quite the same as GiB and not quite the same as a string of that length).

This is great, but the Stata Plugin Interface has limited support for them. Version 2.0, which is used in Stata 13 and earlier, does not support strL variables at all, and version 3.0, which is used in Stata 14 and later, only supports reading from strL variables.

This means that gtools can only support strL variables in Stata 14 and later for some commands. In particular gcollapse and gcontract do not support strL variables because those commands have to write values to Stata, and that is not possible for strL variables using plugins. Further, strL variabes can store binary data, which would require re-writing various portions of gtools; binary data support is planned for a future release but does not have an ETA.

Why do I get error r(608)?

Some users have reported that when gtools tries to create temporary files it gives an error code r(608) (file cannot be modified; likely because the directory is read-only or the file became read-only). The user can set the temporary directory manually via


where . will be the current directory. It can be any path to any existing directory as long as the user has read-write permission.

My computer has a 32-bit CPU

I have only compiled gtools for 64-bit CPUs. gtools uses 128-bit hashes split into 2 64-bit parts and the machines I have access to use 64-bit CPUs, so this seemed quite natural to me.

However, my understanding is that it should be possible to compile gtools to support 64-bit math on a 32-bit processor. Unfortunately, I do not have access to 32-bit machines, and I will not post a plugin as part of the package without adequate testing.

If you need to run gtools in a 32-bit machine consider trying to compile the plugin yourself on the machine in question.

Why use platform-dependent plugins?

C is fast! When optimizing stata, there are three options:

  • Mata (already implemented)
  • Java plugins (I don't like Java)
  • C and C++ plugins

Sergio Correia's ftools tests the limits of mata and achieves excellent results, but Mata cannot compare to the raw speed a low level language like C would afford. The only question is whether the overhead reading and writing data to and from C compensates the speed gain, and in this case it does.

Why no multi-threading?

Multi-threading is really difficult to support, specially because I could not figure out a cross-platform way to implement multi-threading. Perhaps if I had access to physical Windows and OSX hardware I would be able to do it, but I only have access to Linux hardware. And even then the multi-threading implementation that worked on my machine broke the plugin on older systems.

Basically my version of OpenMP, which is what I'd normally use, does not play nice with Stata's plugin interface or with older Linux versions. Perhaps I will come back to multi-threading in the future, but for now only the single-threaded version is available, and that is already a massive speedup!

How can this be faster?

As I understand it, many of Stata's underpinnings are already compiled C code. However, there are two explanations why this is faster than Stata's native commands:

  1. Hashing: I hash the data using a 128-bit hash and sort on this hash using a radix sort (a counting sort that sorts large integers X-bits at a time; I choose X to be 16). Sorting on a single integer is much faster than sorting on a collection of variables with arbitrary data. With a 128-bit hash you shouldn't have to worry about collisions (unless you're working with groups in the quintillions—that's 10^18). Hashing here is also faster than hashing in Sergio Correia's ftools, which uses a 32-bit hash and will run into collisions just with levels in the thousands, so he has to resolve collisions.

  2. Efficiency: While Stata's buit-in commands are not necessarily inefficient, the fact is many of its commands are ado files written in an add-hoc manner. For instance, collapse loops through each statistic, computing them in turn. This amounts to one individual call per statistic to by, which is slow. Similar inefficiencies are found in egen, isid, levelsof, contract, and so on. While they are fast enough for even modestly-sized data, when there are several million rows they begin to falter.

I think an interesting encapsulation of this second point is found in the helpfile for the community-contributed egenmore. One of the entries reads:

nmiss(exp) [ , by(byvarlist) ] returns the number of missing values in exp (...)
Remark: Why this was written is a mystery. The one-line command

    egen nmiss = sum(missing(exp)) (...) shows that it is unnecessary.

I independently wrote a very similar command, namely gegen nmissing, so I obviously disagree the one-liner makes it unnecessary. The author of egen nmiss documents concern with functionality and speed of coding, not speed of execution. For small data, that is actually the correct trade-off. There is no reason to spend a large portion of your time optimizing nmiss if you will only gain a tenth of a second. However, gtools is meant to be used on large datasets, in which case the inefficiencies add up: The one-liner proposed creates a temporary variable and then sums it by group. gtools, on the other hand, simply counts the number missing by group as it iterates through the variable in the first place, which is a faster algorithm.

How does hashing work?

The point of using a hash is straightforward: Sorting a single integer variable is much faster than sorting multiple variables with arbitrary data. In particular I use a counting sort, which asymptotically performs in O(n) time compared to O(n log n) for the fastest general-purpose sorting algorithms. (Note with a 128-bit algorithm using a counting sort is prohibitively expensive; gtools commands does 4 passes of a counting sort, each sorting 16 bits at a time; if the groups are not unique after sorting on the first 64 bits we sort on the full 128 bits.)

Given K by variables, by_1 to by_K, where by_k belongs the set B_k, the general problem is to devise a function f such that f: B_1 x ... x B_K -> N, where N are the natural (whole) numbers. Given B_k can be integers, floats, and strings, the natural way of doing this is to use a hash: A function that takes an arbitrary sequence of data and outputs data of fixed size.

In particular I use the Spooky Hash devised by Bob Jenkins, which is a 128-bit hash. Stata caps observations at 20 billion or so, meaning a 128-bit hash collision is de facto impossible. Nevertheless, the code is written to fall back on native commands should it encounter a collision.

An internal mechanism for resolving potential collisions is in the works. See issue 2 for a discussion.

Memory management with gcollapse

C cannot create or drop variables. This creates a problem for gcollapse (and possibly gcontract) when N is large and the number of groups J is small. For examplle, N = 100M means about 800MiB per variable and J = 1,000 means barely 8KiB per variable. Adding variables after the collapse is trivial and before the collapse it may take several seconds.

The function tries to be smart about this: Variables are only created if the source variable cannot be replaced with the target. This conserves memory and speeds up execution time. (However, the function currently recasts unsuitably typed source variables, which saves memory but may slow down execution time.)

If there are more targets than sources, however, there are two options:

  1. Create the extra target variables in Stata before collapsing.
  2. Write the extra targets, collapsed, to disk and read them back later.

Ideally I could create the variables in Stata after collapsing and read them back from memory, but that is not possible. Hence we must choose one of the two options above, and it is not always obvious which will be faster.

Clearly for very large N and very small J, option 2 is faster. However, as J grows relative to N the trade-off is not obvious. First, variables still have to be created in Stata. So disk operations have to be faster than (N - J) / N of the time it takes for Stata to the variables. In our example, disk operations on 8KiB per variable should be instantaneous and will almost surely be faster than operations on 720MiB per variable in memory.

But what if J is 10M? Is operating on ~80MiB on disk faster than ~720MiB on memory? The answer may well be no. What if J = 50M? Then the answer is almost surely no. For this reason, the code tries to benchmark how long it will take to collapse to disk and read back the data from disk versus creating the variables in memory and simply collapsing to memory.

This has a small overhead, so gcollapse will only try the swtich when there are at least 4 additional targets to create. In testing, the overhead has been ~10% of the total runtime. If the user expects J to be large, they can turn off this check via forcemem. If the user expects J to be small, they can force collapsing to disk via forceio.