gduplicates
Efficiently report, tag, or drop duplicate observations using C plugins. This is a faster alternative to duplicates. It can replicate every sub-command of duplicates; that is, it reports, displays, lists, tags, or drops duplicate observations, depending on the subcommand. Duplicates are observations with identical values either on all variables if no varlist is specified or on a specified varlist.
Note that for sub-commands examples
and list
the output is NOT
sorted by default. To mimic duplicates entirely, pass option sorted
when using those sub-commands.
Important
Run gtools, upgrade
to update gtools
to the latest stable version.
Syntax
Any observations that do not satisfy specified if and/or in conditions are ignored when you use report, examples, list, or drop. The variable created by tag will have missing values for such observations.
Further, option sorted
is required to fully mimic duplicates examples
and duplicates list; otherwise, gduplicates will not sort the list of
examples or the full list of duplicates. This default behavior improves
performance but may be harder to read.
Report duplicates
gduplicates report [varlist] [if] [in]
Print a table showing observations that occur as one or more copies and indicating how many observations are "surplus" in the sense that they are the second (third, ...) copy of the first of each group of duplicates.
List one example for each group of duplicates
gduplicates examples [varlist] [if] [in] [, sorted options]
List one example for each group of duplicated observations. Each example represents the first occurrence of each group in the dataset.
List all duplicates
gduplicates list [varlist] [if] [in] [, sorted options]
List all duplicated observations.
Tag duplicates
gduplicates tag [varlist] [if] [in] , generate(newvar)
Generate a variable representing the number of duplicates for each observation. This will be 0 for all unique observations.
Drop duplicates
gduplicates drop [if] [in] gduplicates drop varlist [if] [in] , force
Drop all but the first occurrence of each group of duplicated observations. The word drop may not be abbreviated.
Options
Unlike other gtools
commands, gdistinct
extra arguments are
captured. See help list
for the full options available with examples
and list (both call the list command internally).
To pass gtools
options use gtools(str)
.
Examples
You can download the raw code for the examples below
here ;
however, note this merely mimics the examples in help duplicates
.
sysuse auto keep make price mpg rep78 foreign expand 2 in 1/2
Report duplicates
gduplicates report Duplicates in terms of all variables -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 72 0 2 | 4 2 --------------------------------------
List one example for each group of duplicated observations
sort mpg gduplicates examples Duplicates in terms of all variables +----------------------------------------------------------------------+ | group: # e.g. obs make price mpg rep78 foreign | |----------------------------------------------------------------------| | 2 2 2 AMC Pacer 4,749 17 3 Domestic | | 1 2 1 AMC Concord 4,099 22 3 Domestic | +----------------------------------------------------------------------+ WARNING: examples left unsorted to improve performance; use option sort to mimic duplicates
gduplicates examples, sorted Duplicates in terms of all variables +----------------------------------------------------------------------+ | group: # e.g. obs make price mpg rep78 foreign | |----------------------------------------------------------------------| | 1 2 1 AMC Concord 4,099 22 3 Domestic | | 2 2 2 AMC Pacer 4,749 17 3 Domestic | +----------------------------------------------------------------------+
List all duplicated observations
gduplicates list Duplicates in terms of all variables +--------------------------------------------------------------+ | group: obs: make price mpg rep78 foreign | |--------------------------------------------------------------| | 2 18 AMC Pacer 4,749 17 3 Domestic | | 2 19 AMC Pacer 4,749 17 3 Domestic | | 1 45 AMC Concord 4,099 22 3 Domestic | | 1 50 AMC Concord 4,099 22 3 Domestic | +--------------------------------------------------------------+ WARNING: list left unsorted to improve performance; use option sort to mimic duplicates
Create variable dup containing the number of duplicates (0 if observation is unique)
gduplicates tag, generate(dup)
List the duplicated observations
list if dup == 1 +----------------------------------------------------+ | make price mpg rep78 foreign dup | |----------------------------------------------------| 18. | AMC Pacer 4,749 17 3 Domestic 1 | 19. | AMC Pacer 4,749 17 3 Domestic 1 | 45. | AMC Concord 4,099 22 3 Domestic 1 | 50. | AMC Concord 4,099 22 3 Domestic 1 | +----------------------------------------------------+
Drop all but the first occurrence of each group of duplicated observations
gduplicates drop Duplicates in terms of all variables (2 observations deleted)
List all duplicated observations
gduplicates list Duplicates in terms of all variables (0 observations are duplicates)