glevelsof
Efficiently get levels of variable using C plugins
glevelsof displays a sorted list of the distinct values of varlist. It is meant to be a fast replacement of levelsof. Unlike levelsof, it can take a single variable or multiple variables.
Important
Run gtools, upgrade to update gtools to the latest stable version.
Syntax
glevelsof varlist [if] [in] [, options ]
Instead of varlist, it is possible to specify
[+|-] varname [[+|-] varname ...]
This will not affect the levels recovered but it will affect the sort order in which they are stored and printed.
Options
-
cleandisplays string values without compound double quotes. By default, each distinct string value is displayed within compound double quotes, as these are the most general delimiters. If you know that the string values in varlist do not include embedded spaces or embedded quotes, this is an appropriate option. clean does not affect the display of values from numeric variables. -
local(macname)inserts the list of values in local macro macname within the calling program's space. Hence, that macro will be accessible after glevelsof has finished. This is helpful for subsequent use, especially with foreach. -
missingspecifies that missing values of varlist should be included in the calculation. The default is to exclude them. -
separate(separator)specifies a separator to serve as punctuation for the values of the returned list. The default is a space. A useful alternative is a comma.
Extras
-
nolocalDo not storevarlistlevels in a local macro. This is specially useful withgen() -
silentDo not display the levels of varlist. Mainly for use withgen()andmatasave. Withmatasave, the levels are not sepparately stored as a string matrix, but the raw levels are kept. -
matasave[(str)]Save results in mata object (default name is GtoolsByLevels). SeeGtoolsByLevels.desc()for more. This object contains the raw variable levels innumxandcharx(since mata does not allow matrices of mixed-type). The levels are saved as a string inprinted(with value labels correctly applied) unless optionsilentis also specified. -
gen([prefix], [replace])Store the unique levels ofvarlistin a new varlist prefixed byprefixorreplacethevarlistwith its unique levels. The two optoins are mutually exclusive. -
colseparate(separator)specifies a separator to serve as punctuation for the columns of the returned list. The default is a pipe. Specifying a varlist instead of a varname is only useful for double loops or for use with gettoken. -
numfmt(format)Number format for printing. By default numbers are printed to 16 digits of precision, but the user can specify the number format here. By default, only "%.#g|f" and "%#.#g|f" are accepted since this is formated internally in C. However, with optionmatasavethis is formated in mata and has to be a mata format. -
unsortedDo not sort levels. This option is experimental and only affects the output when the input is not an integer (for integers, the levels are sorted internally regardless). While not sorting the levels is faster,glevelsofis typically used when the number of levels is small (10s, 100s, 1000s) and thus speed savings will be minimal.
Gtools options
(Note: These are common to every gtools command.)
-
compressTry to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrlSkip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verboseprints some useful debugging info to the console. -
benchmarkorbench(level)prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)Hash method to use.defaultautomagically chooses the algorithm.bijecttries to biject the inputs into the natural numbers.spookyhashes the data and then uses the hash. -
oncollision(str)How to handle collisions. A collision should never happen but just in case it doesgtoolswill try to use native commands. The user can specify it throw an error instead by passingoncollision(error).
Stored results
glevelsof stores the following in r():
Macros
r(levels) list of distinct values
r(sep) Row separator
r(colsep) Column sezparator
Scalars
r(N) number of non-missing observations
r(J) number of groups
r(minJ) largest group size
r(maxJ) smallest group size
When matasave is passed, the following data is stored in GtoolsByLevels:
real scalar anyvars
1: any by variables; 0: no by variables
real scalar anychar
1: any string by variables; 0: all numeric by variables
real scalar anynum
1: any numeric by variables; 0: all string by variables
string rowvector byvars
by variable names
real scalar kby
number of by variables
real scalar rowbytes
number of bytes in one row of the internal by variable matrix
real scalar J
number of levels
real matrix numx
numeric by variables
string matrix charx
string by variables
real scalar knum
number of numeric by variables
real scalar kchar
number of string by variables
real rowvector lens
> 0: length of string by variables; <= 0: internal code for numeric variables
real rowvector map
map from index to numx and charx
real rowvector charpos
position of kth character variable
string matrix printed
formatted (printf-ed) variable levels (not with option -silent-)
Remarks
glevelsof serves two different functions. First, it gives a compact display of the distinct values of varlist. More commonly, it is useful when you desire to cycle through the distinct values of varlist with (say) foreach; see [P] foreach. glevelsof leaves behind a list in r(levels) that may be used in a subsequent command.
glevelsof may hit the limits imposed by your Stata. However, it is typically used when the number of distinct values of varlist is modest. If you have many levels in varlist then an alternative may be gtoplevelsof, which shows the largest or smallest levels of a varlist by their frequency count.
Examples
You can download the raw code for the examples below
here ![]()
. sysuse auto (1978 Automobile Data) . glevelsof rep78 1 2 3 4 5 . qui glevelsof rep78, miss local(mylevs) . display "`mylevs'" 1 2 3 4 5 . . glevelsof rep78, sep(,) 1,2,3,4,5
De-duplicating a variable list
glevelsof can store the unique levels of a varlist. This is specially
useful when the user wants to obtain the unique levels but runs up
against the stata macro variable limit.
. set seed 42 . clear . set obs 100000 obs was 0, now 100000 . gen x = "a long string appeared" + string(mod(_n, 10000)) . gen y = int(10 * runiform()) . glevelsof x macro substitution results in line that is too long The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 165,216 characters, which is calculated on the basis of set maxvar. You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP. The maximum line length is defined as 16 more than the maximum macro length, which is currently 165,200 characters. Each unit increase in set maxvar increases the length maximums by 33. The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 1,081,527 characters if you set maxvar to its largest value. try gen(prefix) nolocal or mata(name) nolocal; see help glevelsof for details r(920); . glevelsof x, gen(uniq_) nolocal . gisid uniq_* in 1 / `r(J)'
If the user prefers to work with mata, simply pass the option
matasave[(name)]. With mixed-types, numbers and strings are
stored in separate matrices as well as a single printed matrix,
but the latter can be suppressed to save memory.
. glevelsof x y, mata(xy) nolocal (note: raw levels saved in xy; see mata xy.desc()) . glevelsof x, mata(x) nolocal silent (note: raw levels saved in x; see mata x.desc()) . mata xy.desc() xy is a class object with group levels | object | value | description | | ------- | ---------------- | ------------------------------------- | | J | 64958 | number of levels | | knum | 1 | # numeric by variables | | numx | 64958 x 1 matrix | numeric by var levels | | kchar | 1 | # of string by variables | | charx | 64958 x 1 matrix | character by var levels | | map | 1 x 2 vector | map by vars index to numx and charx | | lens | 1 x 2 vector | if string, > 0; if numeric, <= 0 | | charpos | 1 x 1 vector | position of kth character variable | | printed | 64958 x 2 vector | formatted (printf-ed) variable levels | . mata x.desc() x is a class object with group levels | object | value | description | | ------- | ---------------- | ------------------------------------- | | J | 10000 | number of levels | | knum | 0 | # numeric by variables | | numx | [empty] | numeric by var levels | | kchar | 1 | # of string by variables | | charx | 10000 x 1 matrix | character by var levels | | map | 1 x 1 vector | map by vars index to numx and charx | | lens | 1 x 1 vector | if string, > 0; if numeric, <= 0 | | charpos | 1 x 1 vector | position of kth character variable | | printed | [empty] | formatted (printf-ed) variable levels |
Last, the user can replace the source variables if need be. This is faster and saves memory, but it dispenses with the original variables.
. glevelsof x, gen(uniq_) nolocal . glevelsof x y, gen(, replace) nolocal . l in `r(J)' +-----------------------------------------+ | x y uniq_x | |-----------------------------------------| 64958. | a long string appeared9999 8 | +-----------------------------------------+ . l in `=_N' +----------------+ | x y uniq_x | |----------------| 100000. | . | +----------------+
Number format
levelsof by default shows many significant digits for numerical variables.
. sysuse auto, clear . replace headroom = headroom + 0.1 . glevelsof headroom 1.600000023841858 2.099999904632568 2.599999904632568 3.099999904632568 3.599999904632568 4.099999904632568 4.599999904632568 5.099999904632568 . levelsof headroom 1.600000023841858 2.099999904632568 2.599999904632568 3.099999904632568 3.599999904632568 4.099999904632568 4.599999904632568 5.099999904632568
This is cumbersome. You can specify a number format to compress this:
. glevelsof headroom, numfmt(%.3g) 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1
Multiple variables
glevelsof can parse multiple variables:
. local varlist foreign rep78
. glevelsof `varlist', sep("|") colsep(", ")
`"0, 1"'|`"0, 2"'|`"0, 3"'|`"0, 4"'|`"0, 5"'|`"1, 3"'|`"1, 4"'|`"1, 5"'
If you know a bit of mata, you can parse this string!
mata: string scalar function unquote_str(string scalar quoted_str) { if ( substr(quoted_str, 1, 1) == `"""' ) { quoted_str = substr(quoted_str, 2, strlen(quoted_str) - 2) } else if (substr(quoted_str, 1, 2) == "`" + `"""') { quoted_str = substr(quoted_str, 3, strlen(quoted_str) - 4) } return (quoted_str); } t = tokeninit(`"`r(sep)'"', (""), (`""""', `"`""'"'), 1) tokenset(t, `"`r(levels)'"') rows = tokengetall(t) for (i = 1; i <= cols(rows); i++) { rows[i] = unquote_str(rows[i]); } levels = J(cols(rows), `:list sizeof varlist', "") t = tokeninit(`"`r(colsep)'"', (""), (`""""', `"`""'"'), 1) for (i = 1; i <= cols(rows); i++) { tokenset(t, rows[i]) levels[i, .] = tokengetall(t) for (k = 1; k <= `:list sizeof varlist'; k++) { levels[i, k] = unquote_str(levels[i, k]) } } levels end
And now we have the leves in a mata matrix:
1 2
+---------+
1 | 0 1 |
2 | 0 2 |
3 | 0 3 |
4 | 0 4 |
5 | 0 5 |
6 | 1 3 |
7 | 1 4 |
8 | 1 5 |
+---------+
While this looks cumbersome, this mechanism is used internally by
gtoplevelsof to display its results.