glevelsof
Efficiently get levels of variable using C plugins
glevelsof displays a sorted list of the distinct values of varlist. It is meant to be a fast replacement of levelsof. Unlike levelsof, it can take a single variable or multiple variables.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
glevelsof varlist [if] [in] [, options ]
Instead of varlist, it is possible to specify
[+|-] varname [[+|-] varname ...]
This will not affect the levels recovered but it will affect the sort order in which they are stored and printed.
Options
-
clean
displays string values without compound double quotes. By default, each distinct string value is displayed within compound double quotes, as these are the most general delimiters. If you know that the string values in varlist do not include embedded spaces or embedded quotes, this is an appropriate option. clean does not affect the display of values from numeric variables. -
local(macname)
inserts the list of values in local macro macname within the calling program's space. Hence, that macro will be accessible after glevelsof has finished. This is helpful for subsequent use, especially with foreach. -
missing
specifies that missing values of varlist should be included in the calculation. The default is to exclude them. -
separate(separator)
specifies a separator to serve as punctuation for the values of the returned list. The default is a space. A useful alternative is a comma.
Extras
-
nolocal
Do not storevarlist
levels in a local macro. This is specially useful withgen()
-
silent
Do not display the levels of varlist. Mainly for use withgen()
andmatasave
. Withmatasave
, the levels are not sepparately stored as a string matrix, but the raw levels are kept. -
matasave[(str)]
Save results in mata object (default name is GtoolsByLevels). SeeGtoolsByLevels.desc()
for more. This object contains the raw variable levels innumx
andcharx
(since mata does not allow matrices of mixed-type). The levels are saved as a string inprinted
(with value labels correctly applied) unless optionsilent
is also specified. -
gen([prefix], [replace])
Store the unique levels ofvarlist
in a new varlist prefixed byprefix
orreplace
thevarlist
with its unique levels. The two optoins are mutually exclusive. -
colseparate(separator)
specifies a separator to serve as punctuation for the columns of the returned list. The default is a pipe. Specifying a varlist instead of a varname is only useful for double loops or for use with gettoken. -
numfmt(format)
Number format for printing. By default numbers are printed to 16 digits of precision, but the user can specify the number format here. By default, only "%.#g|f" and "%#.#g|f" are accepted since this is formated internally in C. However, with optionmatasave
this is formated in mata and has to be a mata format. -
unsorted
Do not sort levels. This option is experimental and only affects the output when the input is not an integer (for integers, the levels are sorted internally regardless). While not sorting the levels is faster,glevelsof
is typically used when the number of levels is small (10s, 100s, 1000s) and thus speed savings will be minimal.
Gtools options
(Note: These are common to every gtools command.)
-
compress
Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrl
Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verbose
prints some useful debugging info to the console. -
benchmark
orbench(level)
prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark
. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)
Hash method to use.default
automagically chooses the algorithm.biject
tries to biject the inputs into the natural numbers.spooky
hashes the data and then uses the hash. -
oncollision(str)
How to handle collisions. A collision should never happen but just in case it doesgtools
will try to use native commands. The user can specify it throw an error instead by passingoncollision(error)
.
Stored results
glevelsof stores the following in r():
Macros r(levels) list of distinct values r(sep) Row separator r(colsep) Column sezparator Scalars r(N) number of non-missing observations r(J) number of groups r(minJ) largest group size r(maxJ) smallest group size
When matasave
is passed, the following data is stored in GtoolsByLevels
:
real scalar anyvars 1: any by variables; 0: no by variables real scalar anychar 1: any string by variables; 0: all numeric by variables real scalar anynum 1: any numeric by variables; 0: all string by variables string rowvector byvars by variable names real scalar kby number of by variables real scalar rowbytes number of bytes in one row of the internal by variable matrix real scalar J number of levels real matrix numx numeric by variables string matrix charx string by variables real scalar knum number of numeric by variables real scalar kchar number of string by variables real rowvector lens > 0: length of string by variables; <= 0: internal code for numeric variables real rowvector map map from index to numx and charx real rowvector charpos position of kth character variable string matrix printed formatted (printf-ed) variable levels (not with option -silent-)
Remarks
glevelsof serves two different functions. First, it gives a compact display of the distinct values of varlist. More commonly, it is useful when you desire to cycle through the distinct values of varlist with (say) foreach; see [P] foreach. glevelsof leaves behind a list in r(levels) that may be used in a subsequent command.
glevelsof may hit the limits imposed by your Stata. However, it is typically used when the number of distinct values of varlist is modest. If you have many levels in varlist then an alternative may be gtoplevelsof, which shows the largest or smallest levels of a varlist by their frequency count.
Examples
You can download the raw code for the examples below here
. sysuse auto (1978 Automobile Data) . glevelsof rep78 1 2 3 4 5 . qui glevelsof rep78, miss local(mylevs) . display "`mylevs'" 1 2 3 4 5 . . glevelsof rep78, sep(,) 1,2,3,4,5
De-duplicating a variable list
glevelsof
can store the unique levels of a varlist. This is specially
useful when the user wants to obtain the unique levels but runs up
against the stata macro variable limit.
. set seed 42 . clear . set obs 100000 obs was 0, now 100000 . gen x = "a long string appeared" + string(mod(_n, 10000)) . gen y = int(10 * runiform()) . glevelsof x macro substitution results in line that is too long The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 165,216 characters, which is calculated on the basis of set maxvar. You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP. The maximum line length is defined as 16 more than the maximum macro length, which is currently 165,200 characters. Each unit increase in set maxvar increases the length maximums by 33. The maximum value of set maxvar is 32,767. Thus, the maximum line length may be set up to 1,081,527 characters if you set maxvar to its largest value. try gen(prefix) nolocal or mata(name) nolocal; see help glevelsof for details r(920); . glevelsof x, gen(uniq_) nolocal . gisid uniq_* in 1 / `r(J)'
If the user prefers to work with mata, simply pass the option
matasave[(name)]
. With mixed-types, numbers and strings are
stored in separate matrices as well as a single printed matrix,
but the latter can be suppressed to save memory.
. glevelsof x y, mata(xy) nolocal (note: raw levels saved in xy; see mata xy.desc()) . glevelsof x, mata(x) nolocal silent (note: raw levels saved in x; see mata x.desc()) . mata xy.desc() xy is a class object with group levels | object | value | description | | ------- | ---------------- | ------------------------------------- | | J | 64958 | number of levels | | knum | 1 | # numeric by variables | | numx | 64958 x 1 matrix | numeric by var levels | | kchar | 1 | # of string by variables | | charx | 64958 x 1 matrix | character by var levels | | map | 1 x 2 vector | map by vars index to numx and charx | | lens | 1 x 2 vector | if string, > 0; if numeric, <= 0 | | charpos | 1 x 1 vector | position of kth character variable | | printed | 64958 x 2 vector | formatted (printf-ed) variable levels | . mata x.desc() x is a class object with group levels | object | value | description | | ------- | ---------------- | ------------------------------------- | | J | 10000 | number of levels | | knum | 0 | # numeric by variables | | numx | [empty] | numeric by var levels | | kchar | 1 | # of string by variables | | charx | 10000 x 1 matrix | character by var levels | | map | 1 x 1 vector | map by vars index to numx and charx | | lens | 1 x 1 vector | if string, > 0; if numeric, <= 0 | | charpos | 1 x 1 vector | position of kth character variable | | printed | [empty] | formatted (printf-ed) variable levels |
Last, the user can replace the source variables if need be. This is faster and saves memory, but it dispenses with the original variables.
. glevelsof x, gen(uniq_) nolocal . glevelsof x y, gen(, replace) nolocal . l in `r(J)' +-----------------------------------------+ | x y uniq_x | |-----------------------------------------| 64958. | a long string appeared9999 8 | +-----------------------------------------+ . l in `=_N' +----------------+ | x y uniq_x | |----------------| 100000. | . | +----------------+
Number format
levelsof
by default shows many significant digits for numerical variables.
. sysuse auto, clear . replace headroom = headroom + 0.1 . glevelsof headroom 1.600000023841858 2.099999904632568 2.599999904632568 3.099999904632568 3.599999904632568 4.099999904632568 4.599999904632568 5.099999904632568 . levelsof headroom 1.600000023841858 2.099999904632568 2.599999904632568 3.099999904632568 3.599999904632568 4.099999904632568 4.599999904632568 5.099999904632568
This is cumbersome. You can specify a number format to compress this:
. glevelsof headroom, numfmt(%.3g) 1.6 2.1 2.6 3.1 3.6 4.1 4.6 5.1
Multiple variables
glevelsof
can parse multiple variables:
. local varlist foreign rep78 . glevelsof `varlist', sep("|") colsep(", ") `"0, 1"'|`"0, 2"'|`"0, 3"'|`"0, 4"'|`"0, 5"'|`"1, 3"'|`"1, 4"'|`"1, 5"'
If you know a bit of mata, you can parse this string!
mata: string scalar function unquote_str(string scalar quoted_str) { if ( substr(quoted_str, 1, 1) == `"""' ) { quoted_str = substr(quoted_str, 2, strlen(quoted_str) - 2) } else if (substr(quoted_str, 1, 2) == "`" + `"""') { quoted_str = substr(quoted_str, 3, strlen(quoted_str) - 4) } return (quoted_str); } t = tokeninit(`"`r(sep)'"', (""), (`""""', `"`""'"'), 1) tokenset(t, `"`r(levels)'"') rows = tokengetall(t) for (i = 1; i <= cols(rows); i++) { rows[i] = unquote_str(rows[i]); } levels = J(cols(rows), `:list sizeof varlist', "") t = tokeninit(`"`r(colsep)'"', (""), (`""""', `"`""'"'), 1) for (i = 1; i <= cols(rows); i++) { tokenset(t, rows[i]) levels[i, .] = tokengetall(t) for (k = 1; k <= `:list sizeof varlist'; k++) { levels[i, k] = unquote_str(levels[i, k]) } } levels end
And now we have the leves in a mata matrix:
1 2 +---------+ 1 | 0 1 | 2 | 0 2 | 3 | 0 3 | 4 | 0 4 | 5 | 0 5 | 6 | 1 3 | 7 | 1 4 | 8 | 1 5 | +---------+
While this looks cumbersome, this mechanism is used internally by
gtoplevelsof
to display its results.