gstats winsor
Efficiently winsorize a list of varaibles, optionally specifying weights.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
gstats winsor varlist [if] [in] [weight] [, by(varlist) options]
Options
-
prefix(str)
Generate targets as prefixsource (default empty). -
suffix(str)
Generate targets as sourcesuffix (default _w with cut and _tr with trim). -
generate(namelist)
Named targets to generate; one per source. -
cuts(#.# #.#)
Cut points (detault 1.0 and 99.0 for 1st and 99th percentiles). -
trim
Trim instead of Winsorize (i.e. replace outliers with missing values). -
label
Add Winsorized/trimming note to target labels. -
replace
Replace targets if they exist.
Gtools options
(Note: These are common to every gtools command.)
-
compress
Try to compress strL to str#. The Stata Plugin Interface has only limited support for strL variables. In Stata 13 and earlier (version 2.0) there is no support, and in Stata 14 and later (version 3.0) there is read-only support. The user can try to compress strL variables using this option. -
forcestrl
Skip binary variable check and force gtools to read strL variables (14 and above only). Gtools gives incorrect results when there is binary data in strL variables. This option was included because on some windows systems Stata detects binary data even when there is none. Only use this option if you are sure you do not have binary data in your strL variables. -
verbose
prints some useful debugging info to the console. -
benchmark
orbench(level)
prints how long in seconds various parts of the program take to execute. Level 1 is the same asbenchmark
. Levels 2 and 3 additionally prints benchmarks for internal plugin steps. -
hashmethod(str)
Hash method to use.default
automagically chooses the algorithm.biject
tries to biject the inputs into the natural numbers.spooky
hashes the data and then uses the hash. -
oncollision(str)
How to handle collisions. A collision should never happen but just in case it doesgtools
will try to use native commands. The user can specify it throw an error instead by passingoncollision(error)
.
Remarks
gstats winsor winsorizes or trims (if the trim option is specified)
the variables in varlist at particular percentiles specified by option
cuts(#1 #2)
. By defult, new variables will be generated with a
suffix "_w" or "_tr", respectively. The user can control this via the
suffix()
option. The replace option replaces the variables with their
winsorized or trimmed ones.
Winsorizing vs trimming
Note
This discussion is nearly verbatim from the equivalent help section from winsor2.
Winsorizing is not equivalent to simply excluding data, which is
a simpler procedure, called trimming or truncation. In a trimmed
estimator, the extreme values are discarded; in a Winsorized estimator,
the extreme values are instead replaced by certain percentiles,
specified by option cuts(# #). For details, see help winsor
(if
installed), and help trimmean
(if installed).
For example, you type the following commands to get the 1st and 99th
percentiles of the variable wage, 1.930993
and 38.70926
.
sysuse nlsw88, clear sum wage, detail
By default, gstats winsor winsorizes wage at 1st and 99th percentiles,
gstats winsor wage, replace cuts(1 99)
which can be done by hand:
replace wage = 1.930993 if wage < 1.930993 replace wage = 38.70926 if wage > 38.70926
Note that, values smaller than the 1st percentile are repalced by that
value, and similarly with values above the 99th percentile. When the
trim
option is specified, those values are set to missing instead
(which are discarded by most commands):
gstats winsor wage, replace cuts(1 99) trim
which can also be done by hand:
replace wage = . if wage < 1.930993 replace wage = . if wage > 38.70926
In this case, we discard values smaller than 1th percentile or greater than 99th percentile. This is trimming.
Examples
You can download the raw code for the examples below
here
Note
This examples are nearly verbatim from the equivalent help section from winsor2.
Winsor at (p1 p99)
; get new variable wage_w
sysuse nlsw88, clear gstats winsor wage
Winsor 3 variables at 0.5th and 99.5th percentiles, and overwrite the old variables
gstats winsor wage age hours, cuts(0.5 99.5) replace
Winsor 3 variables at (p1 p99), gen new variables with suffix _win
,
and add variable labels
gstats winsor wage age hours, suffix(_win) label
Left-winsorizing only, at 1th percentile
cap noi gstats winsor wage, cuts(1 100) gstats winsor wage, cuts(1 100) s(_w2)
Right-trimming only, at 99th percentile
gstats winsor wage, cuts(0 99) trim
Winsor variables at (p1 p99) by (industry), overwrite the old variables
gstats winsor wage hours, replace by(industry)
Acknowledgements
gstats winsor
was written largely to mimic the functionality of the community-contributed command winsor2
, the latter written by
- Yujun, Lian (Arlion) Department of Finance, Lingnan College, Sun Yat-Sen University.
- E-mail: arlionn@163.com
- Blog: http://blog.cnfol.com/arlion
- Homepage: http://www.lingnan.sysu.edu.cn/lnshizi/faculty_vch.asp?name=lianyj
In turn, winsor2
had incorporated some code from winsor
, by
- Nicholas J. Cox, Durham University, U.K. n.j.cox@durham.ac.uk
and winsorizeJ.ado
, by
- Judson Caskey