5 Programming & Advanced Features
Stata features the ability to create user-written commands. These can range from simple data manipulation commands to completely new statistical models. This is an advanced feature that not many users will need.
However, there are several components of the programming capabilities which are very useful even without writing your own commands. Here we’ll discuss several.
Let’s open a fresh version of “auto”:
sysuse auto, clear
. data) (1978 automobile
5.1 Macros
While variables stored as strings aren’t of much use to us, strings stored as other strings can be quite useful. Imagine the following scenario: You have a collection of 5 variables that you want to perform several different operations on. You might have code like this:
list var1 var2 var3 var4 var5 in 1/5
summarize var1 var2 var3 var4 var5
label define mylab 0 "No" 1 "Yes"
label values var1 var2 var3 var4 var5 mylab
duplicates list var1 var2 var3 var4 var5
This can get extremely tedious as the number of variables and commands increases. You could copy and paste a lot, but even that takes a lot of effort.
Instead, we can store the list of variables (strictly speaking, the string which contains the list of variables) in a shorter key string, and refer to that instead!
local vars = "var1 var2 var3 var4 var5"
list `vars' in 1/5
summarize `vars'
label define mylab 0 "No" 1 "Yes"
label values `vars' mylab
duplicates list `vars'
The first command, local
, defines what is known as a “local macro”. Whenever it is referred to, wrapped in a backtick (to the left of the 1 key at the top-left of the keyboard) and a single quote, Stata replaces it with the original text. So when you enter
list `vars' in 1/5
Stata immediately replaces `vars'
with var1 var2 var3 var4 var5
, then executes
list var1 var2 var3 var4 var5 in 1/5
Important: Local macros are deleted as soon as code finishes executing! That means that you must use them in a do-file, and you must run all lines which create and access the macro at the same time, by highlighting them all.
If your macro contains text that should be quoted, you still need to quote it when accessing. For example, if you had
label variable price1 "Price (in dollars) at Time Point 1"
label variable price2 "Price (in dollars) at Time Point 2"
you could instead write
local pricelab = "Price (in dollars) at Time Point"
label variable price1 "`pricelab' 1"
label variable price2 "`pricelab' 2"
You can use display
to print the content of macros to the output to preview them.
local test = "abc"
.
display "`test'"
. abc
You may occasionally see code that excludes the =
in defining a macro (e.g. local vars "var1 var2"
). This matters when working with numeric macros. Using an =
forces the evaluates the macro, exlcuding it doesn’t. For example,
local x 1 + 3
.
display "`x'"
.
1 + 3
local y = 1 + 3
.
display "`y'"
. 4
Note the use of quotations there. display
would evaluate numerics regardless of =
or not, so by treating it as a string, we can see the difference.
5.1.1 Global macros
local
defines a “local macro”; global
defines a “global macro”. Global macros persist between runs - whereas a local macro is removed after the code finishing executing, the global stays around.
Global macros are accessesd with $
:
global dog "Black Lab"
.
display "$dog"
. Black Lab
5.1.2 Class and Return
Every command in Stata is of a particular type. One major aspect of the type is what the command “returns”. Some commands are n-class, which means they don’t return anything. Some are c-class, which are only used by programmers and rarely useful elsewhere. The two common ones are e-class and r-class. The distinction between the two is inconsequential, besides that they store their “returns” in different places.
Here, summarize
is a r-class command, so it stores its returns in “return”. We can see them all by return list
. On the other hand, mean
(which we haven’t discussed, but basically displays summary statistics similar to summarize
but provides some additional functionality) is an e-class command, storing its results in ereturn
:
summarize price
.
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
return list
.
scalars:r(N) = 74
r(sum_w) = 74
r(mean) = 6165.256756756757
r(Var) = 8699525.974268788
r(sd) = 2949.495884768919
r(min) = 3291
r(max) = 15906
r(sum) = 456229
mean price
.
of obs = 74
Mean estimation Number
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
price | 6165.257 342.8719 5481.914 6848.6
--------------------------------------------------------------
ereturn list
.
scalars:e(df_r) = 73
e(N_over) = 1
e(N) = 74
e(k_eq) = 1
e(rank) = 1
macros:e(cmdline) : "mean price"
e(cmd) : "mean"
e(vce) : "analytic"
e(title) : "Mean estimation"
e(estat_cmd) : "estat_vce_only"
e(varlist) : "price"
e(marginsnotok) : "_ALL"
e(properties) : "b V"
matrices:e(b) : 1 x 1
e(V) : 1 x 1
e(sd) : 1 x 1
e(_N) : 1 x 1
e(error) : 1 x 1
functions:e(sample)
Rather than try and keep track of what gets stored where, if you look at the very bottom of any help file, it will say something like “summarize
stores the following in r()
:” or “mean
stores the following in e()
:”, corresponding to return
and ereturn
respectively.
Along with the One Data principal, Stata also follows the One _-class principal - meaning you can only view the return
or ereturn
for the most recent command of that class. So if you run a summarize
command, then do a bunch of n-class calls (gsort
for example), the return list
call will still give you the returns for that first summarize
. However, as soon as you run another r-class command, you lose access to the first one. You can save any piece of it using a macro. For example, to calculate the average difference in price between foreign and domestic cars1:
summarize price if foreign == 1
.
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
price | 22 6384.682 2621.915 3748 12990
local fprice = r(mean)
.
summarize price if foreign == 0
.
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------
price | 52 6072.423 3097.104 3291 15906
local dprice = r(mean)
.
display `dprice' - `fprice'
. -312.25874
5.2 Variable Lists
Introduced in Stata 16, variable lists solves a common technique used in previous versions of Stata to define a global
containing a list of variables to be used later in the document. For example, you might see something like this at the top of a Do file:
global predictors x1 x2 x3 x4
then further down the document something like
regress y $predictors
logit z $predictors
Stata has formalized this concept with the addition of the vl
command (variable list). It works similarly to the use of globals: lists of variables are defined, then later reference via the $name
syntax. However, using vl
has the benefits of improved organization, customizations unique to variable lists, error checking, and overall convenience.
5.2.1 Initialization of Variable Lists
To begin using variable lists, vl set
must be run.
sysuse auto
. data)
(1978 automobile
set
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vlcategorical | 2 categorical variables
$vlcontinuous | 2 continuous variables
$vluncertain | 7 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
Notes
of vlcategorical and vlcontinuous to ensure they are
1. Review contents list vlcategorical and type vl list vlcontinuous.
correct. Type vl
in vluncertain, you can reallocate them
2. If there are any variables or vlother. Type
to vlcategorical, vlcontinuous, list vluncertain.
vl
3. Use vl move to move variables among classifications. For example,type vl move (x50 x80) vlcontinuous to move variables x50 and x80 to
the continuous classification.
global macros. Type the vlname without the leading
4. vlnames are sign ($) when using vl commands. Example: vlcategorical not
dollar $vlcategorical. Type the dollar sign with other Stata commands to
get a varlist.
This produces a surprisingly large amount of output. When you initialize the use of variable lists, Stata will automatically create four variable lists, called the “System variable lists”. Every numeric variable in the current data set is automatically placed into one of these four lists:
vlcategorical
: Variables which Stata thinks are categorical. These generally have to be non-negative, integer valued variables with less than 10 unique values.vlcontinuous
: Variables which Stata thinks are continuous. These generally are variables which have negative values, have non-integer values, or are non-negative integers with more than 100 unique values.vluncertain
: Variables which Stata is unsure whether they are continuous or categorical. These generally are non-negative integer valued variables with between 10 and 100 unique values.vlother
: Any numeric variables that aren’t really useful - either all missing or constant variables.
There is a potential fifth system variable list, vldummy
, which is created when option dummy
is passed. Unsurprisingly, this will take variables containing only values 0 and 1 out of vlcategorical
and into this list.
The “Notes” given below the output are generic; they appear regardless of how well Stata was able to categorize the variables. They can be suppressed with the nonotes
option to vl set
.
The two thresholds given above, 10 and 100, can be adjusted by the categorical
and uncertain
options. For example,
set, categorical(20) uncertain(50) vl
Running vl set
on an already vl
-set data set will result in an error, unless the clear
option is given, which will re-generate the lists.
set, dummy nonotes
. vl or more already classified variables specified
one be added to vl's system classifications, but
You requested that variables
you specified 11 variables that were already classified.r(110);
set, dummy nonotes clear
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 2 continuous variables
$vluncertain | 7 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
In the above, we changed our minds and wanted to include the vldummy
list, but since we’d already vl
-set, we had the clear
the existing set.
5.2.2 Viewing lists
When initializing the variable lists, we’re treated to a nice table of all defined lists. We can replay it via
dir
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 2 continuous variables
$vluncertain | 7 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
To see the actual contents of the variable lists, we’ll need to sue vl list
.
list
. vl
----------------------------------------------------
Variable | Macro Values Levels
-------------+--------------------------------------$vldummy 0 and 1 2
foreign | $vlcategorical integers >=0 5
rep78 | $vlcontinuous noninteger
headroom | $vlcontinuous noninteger
gear_ratio | $vluncertain integers >=0 74
price | $vluncertain integers >=0 21
mpg | $vluncertain integers >=0 18
trunk | weight | $vluncertain integers >=0 64
length | $vluncertain integers >=0 47
$vluncertain integers >=0 18
turn | $vluncertain integers >=0 31
displacement | ----------------------------------------------------
This output produces one row for each variable in each variable list it is in. We haven’t used this yet, but variables can be in multiple lists.
We can list only specific lists:
list vlcategorical
. vl
------------------------------------------------
Variable | Macro Values Levels
---------+--------------------------------------$vlcategorical integers >=0 5
rep78 | ------------------------------------------------
or specific variables
list (turn weight)
. vl
------------------------------------------------
Variable | Macro Values Levels
---------+--------------------------------------$vluncertain integers >=0 18
turn | weight | $vluncertain integers >=0 64
------------------------------------------------
If turn
was in multiple variable lists, each would appear as a row in this output.
There’s a bit of odd notation which can be used to sort the output by variable name, which makes it easier to identify variables which appear in multiple lists.
list (_all), sort
. vl
----------------------------------------------------
Variable | Macro Values Levels
-------------+--------------------------------------$vluncertain integers >=0 31
displacement | $vldummy 0 and 1 2
foreign | $vlcontinuous noninteger
gear_ratio | $vlcontinuous noninteger
headroom | length | $vluncertain integers >=0 47
$vluncertain integers >=0 21
mpg | $vluncertain integers >=0 74
price | $vlcategorical integers >=0 5
rep78 | $vluncertain integers >=0 18
trunk | $vluncertain integers >=0 18
turn | weight | $vluncertain integers >=0 64
----------------------------------------------------
The (_all)
tells Stata to report on all variables, and sorting (when you specify at least one variable) orders by variable name rather than variable list name.
This will also list any numeric variables which are not found in any list.
5.2.2.1 Moving variables in system lists
After initializing the variable lists, if you plan on using the system lists, you may need to move variables around (e.g. classifying the vluncertain
variables into their proper lists). This can be done via vl move
which has the syntax
list> vl move (<variables to move>) <destination
For example, all the variables in vluncertain
are actually continuous:
list vluncertain
. vl
----------------------------------------------------
Variable | Macro Values Levels
-------------+--------------------------------------$vluncertain integers >=0 74
price | $vluncertain integers >=0 21
mpg | $vluncertain integers >=0 18
trunk | weight | $vluncertain integers >=0 64
length | $vluncertain integers >=0 47
$vluncertain integers >=0 18
turn | $vluncertain integers >=0 31
displacement |
----------------------------------------------------
weight length turn displacement) vlcontinuous
. vl move (price mpg trunk note: 7 variables specified and 7 variables moved.
------------------------------
Macro # Added/Removed
------------------------------$vldummy 0
$vlcategorical 0
$vlcontinuous 7
$vluncertain -7
$vlother 0
------------------------------
dir
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 9 continuous variables
$vluncertain | 0 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
Alternatively, since we’re moving all variables in vluncertain
, we can see our first use of the variable list!
set, dummy nonotes clear
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 2 continuous variables
$vluncertain | 7 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
$vluncertain) vlcontinuous
. vl move (note: 7 variables specified and 7 variables moved.
------------------------------
Macro # Added/Removed
------------------------------$vldummy 0
$vlcategorical 0
$vlcontinuous 7
$vluncertain -7
$vlother 0
------------------------------
Note that variable lists are essentially just global macros so can be referred to via $name
. Note, however, that the $
is only used when we want to actually use the variable list as a macro - in this case, we wanted to expand vluncertain
into it’s list of variables. When we’re referring to a variable list in the vl
commands, we do not use the $
.
5.2.3 User Variable Lists
In addition to the System variable lists, you can define your own User variables lists, which I imagine will be used far more often. These are easy to create with vl create
:
weight mpg)
. vl create mylist1 = (note: $mylist1 initialized with 2 variables.
weight length trunk)
. vl create mylist2 = (note: $mylist2 initialized with 3 variables.
dir, user
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
User |$mylist1 | 2 variables
$mylist2 | 3 variables
-------------------------------------------------------------------------------
list, user
. vl
------------------------------------------------
Variable | Macro Values Levels
---------+--------------------------------------weight | $mylist1 integers >=0 64
$mylist1 integers >=0 21
mpg | weight | $mylist2 integers >=0 64
length | $mylist2 integers >=0 47
$mylist2 integers >=0 18
trunk | ------------------------------------------------
Note the addition of the user
option to vl list
and vl dir
to show only User variable lists and suppress the System variable lists. We can also demonstrate the odd sorting syntax here:
list (_all), sort user
. vl
----------------------------------------------------
Variable | Macro Values Levels
-------------+--------------------------------------not in vluser 31
displacement | not in vluser 2
foreign | not in vluser
gear_ratio | not in vluser
headroom | length | $mylist2 integers >=0 47
$mylist1 integers >=0 21
mpg | not in vluser 74
price | not in vluser 5
rep78 | $mylist2 integers >=0 18
trunk | not in vluser 18
turn | weight | $mylist1 integers >=0 64
weight | $mylist2 integers >=0 64
----------------------------------------------------
You can refer to variable lists in all the usual shortcut ways:
q x1-x100 z*) vl create mylist = (
We can add labels to variable lists:
label mylist1 "Related to gas consumption"
. vl
label mylist2 "Related to size"
. vl
dir, user
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
User |$mylist1 | 2 Related to gas consumption
$mylist2 | 3 Related to size
-------------------------------------------------------------------------------
5.2.3.1 Modifying User Variable Lists
First, note that with User Variable Lists, the vl move
command does not work. It only works with system variable lists.
We can create new user variable lists which build off old lists with vl create
. To add a new variable:
. vl create mylist3 = mylist2 + (gear_ratio)note: $mylist3 initialized with 4 variables.
list, user
. vl
--------------------------------------------------
Variable | Macro Values Levels
-----------+--------------------------------------weight | $mylist1 integers >=0 64
$mylist1 integers >=0 21
mpg | weight | $mylist2 integers >=0 64
length | $mylist2 integers >=0 47
$mylist2 integers >=0 18
trunk | weight | $mylist3 integers >=0 64
length | $mylist3 integers >=0 47
$mylist3 integers >=0 18
trunk | $mylist3 noninteger
gear_ratio |
--------------------------------------------------
. vl create mylist4 = mylist2 - (turn)note: $mylist4 initialized with 3 variables.
list, user
. vl
--------------------------------------------------
Variable | Macro Values Levels
-----------+--------------------------------------weight | $mylist1 integers >=0 64
$mylist1 integers >=0 21
mpg | weight | $mylist2 integers >=0 64
length | $mylist2 integers >=0 47
$mylist2 integers >=0 18
trunk | weight | $mylist3 integers >=0 64
length | $mylist3 integers >=0 47
$mylist3 integers >=0 18
trunk | $mylist3 noninteger
gear_ratio | weight | $mylist4 integers >=0 64
length | $mylist4 integers >=0 47
$mylist4 integers >=0 18
trunk | --------------------------------------------------
Instead of adding (or removing) single variables at a time, we can instead add or remove lists. Keeping with the comment above, you do not use $
here to refer to the list.
. vl create mylist5 = mylist2 - mylist1note: $mylist5 initialized with 2 variables.
list mylist5
. vl
------------------------------------------------
Variable | Macro Values Levels
---------+--------------------------------------length | $mylist5 integers >=0 47
$mylist5 integers >=0 18
trunk | ------------------------------------------------
However, if we want to simply modify an existing list, a better approach would be the vl modify
command. vl create
and vl modify
are similar to generate
and replace
; the former creates a new variable list while the later changes an existing variable list, but the syntax right of the =
is the same.
. vl modify mylist3 = mylist3 + (headroom)note: 1 variable added to $mylist3.
weight)
. vl modify mylist3 = mylist3 - (note: 1 variable removed from $mylist3.
5.2.4 Dropping variable list
Variable lists can be dropped via vl drop
dir, user
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
User |$mylist1 | 2 Related to gas consumption
$mylist2 | 3 Related to size
$mylist3 | 4 variables
$mylist4 | 3 variables
$mylist5 | 2 variables
-------------------------------------------------------------------------------
drop mylist4 mylist5
. vl
dir, user
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
User |$mylist1 | 2 Related to gas consumption
$mylist2 | 3 Related to size
$mylist3 | 4 variables
-------------------------------------------------------------------------------
System lists cannot be dropped; if you run vl drop vlcontinuous
it just removes all the variables from it.
5.2.5 Using Variable Lists
To be explicit, we can use variable lists in any command which would take the variables in that list. For example,
describe $mylist3
.
Variable Storage Display Valuename type format label Variable label
-------------------------------------------------------------------------------length int %8.0g Length (in.)
int %8.0g Trunk space (cu. ft.)
trunk float %6.2f Gear ratio
gear_ratio float %6.1f Headroom (in.)
headroom
describe $vlcategorical
.
Variable Storage Display Valuename type format label Variable label
-------------------------------------------------------------------------------int %8.0g Repair record 1978 rep78
We can also use them in a modeling setting.
regress mpg $mylist3
.
of obs = 74
Source | SS df MS Number F(4, 69) = 30.77
-------------+---------------------------------- F = 0.0000
Model | 1565.65298 4 391.413244 Prob >
Residual | 877.806484 69 12.7218331 R-squared = 0.6408
-------------+---------------------------------- Adj R-squared = 0.6199
Total | 2443.45946 73 33.4720474 Root MSE = 3.5668
------------------------------------------------------------------------------
mpg | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------length | -.1837962 .0327629 -5.61 0.000 -.2491564 -.1184361
trunk | -.0103867 .1627025 -0.06 0.949 -.3349693 .3141959
gear_ratio | 1.526952 1.27546 1.20 0.235 -1.017521 4.071426
headroom | .0136375 .6602514 0.02 0.984 -1.303528 1.330803_cons | 51.33708 8.300888 6.18 0.000 34.77727 67.8969
------------------------------------------------------------------------------
However, we’ll run into an issue here - how to specify categorical variables or interactions? The vl substitute
command creates “factor-variable lists” that can include factor variable indicators (i.
), continuous variable indicators (c.
), and interactions (#
or ##
). (The name “factor-variable list” is slightly disingenuous; you could create a “factor-variable list” that includes no actual factors, for example, if you wanted to interact two continuous variables.)
Creating a factor-varible list via vl substitute
can be done by specifying variables or variable lists.
. vl substitute sublist1 = mpg mylist3
display "$sublist1"
. length trunk gear_ratio headroom
mpg
dir
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 9 continuous variables
$vluncertain | 0 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
User |$mylist1 | 2 Related to gas consumption
$mylist2 | 3 Related to size
$mylist3 | 4 variables
$sublist1 | factor-variable list
-------------------------------------------------------------------------------
Note the use of display "$listname"
instead of vl list
. Factor-variable lists are not just lists of vairables, they also can include the features above, so must be displayed. Note that in the vl dir
, “sublist1” has no number of variables listed, making it stand apart.
We can make this more interesting by actually including continuous/factor indicatores and/or interactions.
. vl substitute sublist2 = c.mylist1##i.vldummy
display "$sublist2"
. weight mpg i.foreign i.foreign#c.weight i.foreign#c.mpg
Note the need to specify that mylist1 is continuous (with c.
). It follows the normal convention that Stata assumes predictors in a model are continuous by default, unless they’re invloved in an interaction, in which case it assumes they are factors by default.
regress price $sublist2
.
of obs = 74
Source | SS df MS Number F(5, 68) = 16.82
-------------+---------------------------------- F = 0.0000
Model | 351163805 5 70232760.9 Prob >
Residual | 283901591 68 4175023.4 R-squared = 0.5530
-------------+---------------------------------- Adj R-squared = 0.5201
Total | 635065396 73 8699525.97 Root MSE = 2043.3
------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------weight | 4.415037 .8529259 5.18 0.000 2.71305 6.117024
mpg | 237.691 125.0383 1.90 0.062 -11.81907 487.201
|
foreign |
Foreign | 8219.603 7265.713 1.13 0.262 -6278.902 22718.11
|
foreign#|weight |
c.
Foreign | .7408054 1.647504 0.45 0.654 -2.546738 4.028348
|
foreign#|
c.mpg |
Foreign | -257.4683 155.426 -1.66 0.102 -567.616 52.67938
|_cons | -13285.44 5149.648 -2.58 0.012 -23561.41 -3009.481
------------------------------------------------------------------------------
5.2.5.1 Updating factor-variable Lists
Factor-variable lists cannot be directly modified.
display "$sublist1"
. length trunk gear_ratio headroom
mpg
. vl modify sublist1 = sublist1 - mpgnot allowed
sublist1 factor variables not allowed in this context
vlusernames containing r(198);
However, if you create a factor-variable list using only other variable lists, if those lists get updated, so does the factor-variable list!
. vl create continuous = (turn trunk)note: $continuous initialized with 2 variables.
. vl create categorical = (rep78 foreign)note: $categorical initialized with 2 variables.
. vl substitute predictors = c.continuous##i.categorical
display "$predictors"
.
turn trunk i.rep78 i.foreign i.rep78#c.turn i.foreign#c.turn i.rep78#c.trunk i.
> foreign#c.trunk
. vl modify continuous = continuous - (trunk)note: 1 variable removed from $continuous.
. quiet vl rebuild
display "$predictors"
. turn i.rep78 i.foreign i.rep78#c.turn i.foreign#c.turn
Note the call to vl rebuild
. Among other things, it will re-generate the factor-variable lists. (It produces a vl dir
output without an option to suppress it, hence the use of quiet
.)
5.2.6 Stored Statistics
You may have noticed that certain characteristics of the variable are reported.
list mylist3
. vl
--------------------------------------------------
Variable | Macro Values Levels
-----------+--------------------------------------$mylist3 noninteger
headroom | $mylist3 integers >=0 18
trunk | length | $mylist3 integers >=0 47
$mylist3 noninteger
gear_ratio | --------------------------------------------------
This reports some characteristics of the variables (integer, whether it’s non-negative) and the number of unique values. We can also see some other statistics:
list mylist3, min max obs
. vl
-------------------------------------------------------------------------------
Variable | Macro Values Levels Min Max Obs
---------+---------------------------------------------------------------------$mylist3 noninteger 1.5 5 74
headroom | $mylist3 integers >=0 18 5 23 74
trunk | length | $mylist3 integers >=0 47 142 233 74
$mylist3 noninteger 2.19 3.89 74
gear_r~o | -------------------------------------------------------------------------------
This is similar to codebook
except faster; these characteristics are saved at the time the variable list is created or modified and not updated automatically. If the data changes, this does not get updated.
drop if weight < 3000
.
(35 observations deleted)
summarize weight
.
dev. Min Max
Variable | Obs Mean Std.
-------------+---------------------------------------------------------weight | 39 3653.846 423.5788 3170 4840
list (weight), min max obs
. vl
-------------------------------------------------------------------------------
Variable | Macro Values Levels Min Max Obs
---------+---------------------------------------------------------------------weight | $vlcontinuous integers >=0 64 1760 4840 74
weight | $mylist1 integers >=0 64 1760 4840 74
weight | $mylist2 integers >=0 64 1760 4840 74
-------------------------------------------------------------------------------
To re-generate these stored statistics, we call vl set
again, with the update
option.
set, update
. vl
-------------------------------------------------------------------------------
| Macro's contents
|------------------------------------------------------------
Macro | # Vars Description
------------------+------------------------------------------------------------
System |$vldummy | 1 0/1 variable
$vlcategorical | 1 categorical variable
$vlcontinuous | 9 continuous variables
$vluncertain | 0 perhaps continuous, perhaps categorical variables
$vlother | 0 all missing or constant variables
-------------------------------------------------------------------------------
list (weight), min max obs
. vl
-------------------------------------------------------------------------------
Variable | Macro Values Levels Min Max Obs
---------+---------------------------------------------------------------------weight | $vlcontinuous integers >=0 34 3170 4840 39
weight | $mylist1 integers >=0 34 3170 4840 39
weight | $mylist2 integers >=0 34 3170 4840 39
-------------------------------------------------------------------------------
When the update
option is passed, variable lists are not affected, only stored statistics are updated.
5.3 Linking data sets
In addition to allowing multiple data sets to be open at a time, we can link frames together such that rows of data in each frames are connected to each-other and can inter-operate. This requires a linking variable in each data set which will connect the rows. The two data sets can be at the same levels or at different levels.
For example, we might have data sets collected from multiple waves of surveys and follow-ups during which the same people (modulo some non-responses) are contained in each data set. Then the person ID variable in the data sets would be the linking variable.
Another example might be one file at the person level, and another file at the city level. The linking variable would be city name, which would be unique in the city file, but could potentially be repeated in the person level file.
The command to link files is frlink
and requires specifying both the linking variable(s) and the frame to link to.
frlink 1:1 linkvar, frame(otherframe)
Let’s load some data from NHANES. Each file contains a row per subject.
reset
. frame
rename default demographics
. frame
. frame create diet
. frame create bp
. "https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT", cle
. import sasxport5
> ar
"https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR1T
. frame diet: import sasxport5 > OT_I.XPT", clear
"https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BPX_I.
. frame bp: import sasxport5 > XPT", clear
dir
. frame
* bp 9544 x 21
* demographics 9971 x 47
* diet 9544 x 168
data. Note: Frames marked with * contain unsaved
So as you can see, the current frame is the “demographics” frame, and the other frames contains diet and blood pressure information. The variable seqn
records person ID.
. frlink 1:1 seqn, frame(bp)in frame demographics unmatched)
(427 observations
. frlink 1:1 seqn, frame(diet)in frame demographics unmatched) (427 observations
The 1:1
subcommand specifies that it is a 1-to-1 link - each person has no more than 1 row of data in each file. An alternative is m:1
which allows multiple rows in the main file to be linked to a single row in the second frame. 1:m
is not allowed at this point in time.
These commands created two new variables bp
and diet
(the same new as the linked frames) which indicate which row of the linked from is connected with the given row.
list bp diet in 25/29
.
+-----------+
| bp diet |
|-----------|
25. | 25 25 |
26. | 26 26 |
27. | . . |
28. | 27 27 |
29. | 28 28 | +-----------+
Here we see that row 27 in the demographics file was not found in either “bp” or “diet” and thus has no entry in the bp
or diet
variables.
Links are tracked by the variables, we can see the current status of a link via frlink describe
:
describe diet
. frlink
History:
-----------------------------------------------------------------------------variable diet created on 18 Aug 2023 by
Link
. frlink 1:1 seqn, frame(diet)
Frame diet contained an unnamed dataset
-----------------------------------------------------------------------------
Verifying linkage ...date. Linkage is up to
We can see all links from the current frame via frlink dir
:
dir
. frlink
(2 frlink variables found)
-----------------------------------------------------------------------------by frlink 1:1 seqn, frame(bp)
bp created
-----------------------------------------------------------------------------by frlink 1:1 seqn, frame(diet)
diet created
-----------------------------------------------------------------------------"frlink describe varname" to find out more, including whether
Note: Type variable is still valid. the
To unlink frames, simply drop the variable.
drop diet .
Finally, the names of the created variables can be modified via the generate
option to frlink
:
generate(linkdiet)
. frlink 1:1 seqn, frame(diet) in frame demographics unmatched)
(427 observations
dir
. frlink
(2 frlink variables found)
-----------------------------------------------------------------------------by frlink 1:1 seqn, frame(bp)
bp created
-----------------------------------------------------------------------------by frlink 1:1 seqn, frame(diet) generate(linkdiet)
linkdiet created
-----------------------------------------------------------------------------"frlink describe varname" to find out more, including whether
Note: Type variable is still valid. the
5.3.1 Working with linked frames
Once we have linked frames, we can use variables in the linked frame in analyses on the main frame.
The frget
command can copy variables from the linked frame into the primary frame.
summarize bpxchr
. variable bpxchr not found
r(111);
. frget bpxchr, from(bp)missing values generated)
(8,033 variable copied from linked frame)
(1
summarize bpxchr
.
dev. Min Max
Variable | Obs Mean Std.
-------------+--------------------------------------------------------- bpxchr | 1,938 106.5614 21.75754 58 190
This merges appropriately, with a 1:1
or m:1
link, to properly associate the values of the variable with the right observations.
Alternatively, when using generate
, we can reference a variable in another frame.
generate nonsense = frval(linkdiet, dr1tcalc)/frval(bp, bpxpls) + dmdhrage
. missing values generated) (3,158
Note that this calculation used variables from all three frames. A less nonsensical example might be where we want the percent of a countries population located in a given state. Imagine we have the primary frame of county data, and then a separate frame “state” containing state level information.
generate percentpopulation = population/frval(state, population)
5.4 Loops
Using macros can simplify code if you have to use the same string repeatedly, but what if you want to perform the same command repeatedly with different variables? Here we can use a foreach
loop. This is easiest to see with examples.
sysuse pop2000, clear
. by age and sex) (2000 U.S. Census population
The “pop2000” data contains data from the 2000 census, broken down by age and gender. The values are in total counts, lets say instead we want percentages by gender. For example, what percentage of Asians in age 25-29 are male? We could generate this manually.
generate maletotalperc = maletotal/total
.
generate femtotalperc = femtotal/total
.
generate malewhiteperc = malewhite/white .
This gets tedious fast as we need a total of 12 lines! Notice, however, that each line has a predictable pattern:
generate <gender><race>perc = <gender><race>/<race>
We can exploit this by creating a foreach
loop over the racial categories and only needing a single command.
drop *perc
.
foreach race of varlist total-island {
. generate male`race'perc = male`race'/`race'
2. generate fem`race'perc = fem`race'/`race'
3.
4. }
list *perc in 1, abbreviate(100)
.
+---------------------------------------------------------------+
1. | maletotalperc | femtotalperc | malewhiteperc | femwhiteperc |
| .5116206 | .4883794 | .5130497 | .4869503 |
|---------------+--------------+----------------+---------------|
| maleblackperc | femblackperc | maleindianperc | femindianperc |
| .5078017 | .4921983 | .5100116 | .4899884 |
|---------------+--------------+----------------+---------------|
| maleasianperc | femasianperc | maleislandperc | femislandperc |
| .5029027 | .4970973 | .5167261 | .4832739 | +---------------------------------------------------------------+
Let’s breakdown each piece of the command. The command syntax for foreach
is
foreach <new macroname> of varlist <list of variables>
The loop will create a macro that you name (in the example above, it was named “race”), and repeatedly set it to each subsequent entry in the list of variables. So in the code above, first “race” is set to “total”, then the two generate
commands are run. Next, “race” is set to “white”, then the two commands are run. Etc.
Within each of the generate
commands, we use the backtick-quote notation just like with macros.
Finally, we end the foreach
line with an open curly brace, {
, and the line after the last command within the loop has the matching close curly brace, }
.
We can also nest these loops. Notice that both generate
statements above are identical except for “male” vs “fem”. Let’s put an internal loop:
drop *perc
.
foreach race of varlist total-island {
. foreach gender in male fem {
2. generate `gender'`race'perc = `gender'`race'/`race'
3.
4. }
5. }
list *perc in 1, ab(100)
.
+---------------------------------------------------------------+
1. | maletotalperc | femtotalperc | malewhiteperc | femwhiteperc |
| .5116206 | .4883794 | .5130497 | .4869503 |
|---------------+--------------+----------------+---------------|
| maleblackperc | femblackperc | maleindianperc | femindianperc |
| .5078017 | .4921983 | .5100116 | .4899884 |
|---------------+--------------+----------------+---------------|
| maleasianperc | femasianperc | maleislandperc | femislandperc |
| .5029027 | .4970973 | .5167261 | .4832739 | +---------------------------------------------------------------+
Each time “race” gets set to a new variable, we enter another loop where “gender” gets set first to “male” then to “fem”. To help visualize it, here is what “race” and “gender” are set to each time the gen
command is run:
generate command |
“race” | “gender” |
---|---|---|
1 | total | male |
2 | total | fem |
3 | white | male |
4 | white | fem |
5 | black | male |
6 | black | fem |
7 | indian | male |
8 | indian | fem |
9 | asian | male |
10 | asian | fem |
11 | island | male |
12 | island | fem |
Notice the syntax of the above two foreach
differs slightly:
foreach <macro name> of varlist <variables>
foreach <macro name> in <list of strings>
It’s a bit annoying, but Stata handles the “of” and “in” slight differently. The “in” treats any strings on the right as strict. Meaning if the above loop over race were
foreach race in total-island
then Stata would set “race” to “total-island” and the generate
command would run once! By using “of varlist”, you are telling Stata that before it sets “race” to anything, expand the varlist using the rules such as *
and -
.
There is also
foreach <macro name> of numlist <list of numbers>
The benefit of of numlist
is that numlists support things like 1/4 representing 1, 2, 3, 4. So
foreach num of numlist 1 3/5
Loops over 1, 3, 4, 5, whereas
foreach num in 1 3/5
loops over just “1” and “3/5”.
The use of in
is for when you need to loop over strings that are neither numbers nor variables (such as “male” and “fem” from above).
5.5 Suppressing output and errors
There are two useful command prefixes that can be handy while writing more elaborate Do-files.
5.5.1 Capturing an error
Imagine the following scenario. You want to write a Do-file that generates a new variable. However, you may need to re-run chunks of the Do-file repeatedly, so that the generate
statement is hit repeatedly. After the first generate
, we can’t call it again and need to use replace
instead. However, if we used replace
, it wouldn’t work the first time! One solution is to drop
the variable before we generate
it:
sysuse auto, clear
. data)
(1978 automobile
drop newvar
. variable newvar not found
r(111);
generate newvar = 1 .
That error, while not breaking the code, is awfully annoying! However, if we prefix it by capture
, the error (and all output from the command) are “captured” and hidden.
list price in 1/5
.
+-------+
| price |
|-------|
1. | 4,099 |
2. | 4,749 |
3. | 3,799 |
4. | 4,816 |
5. | 7,827 |
+-------+
capture list price in 1/5
.
list abcd
. variable abcd not found
r(111);
capture list abcd .
Therefore, the best way to generate our new variable is
capture drop newvar
.
generate newvar = 1 .
5.5.1.1 Return Code
When you capture
a command that errors, Stata saves the error code in the _rc
macro.
list abc
. variable abc not found
r(111);
capture list abc
.
display _rc
. 111
If the command does not error, _rc
contains 0.
capture list price
.
display _rc
. 0
This can be used to offer additional code if an error occurs
capture <some command>
if _rc > 0 {
...else {
}
... }
If the command inside the capture
runs without error, the if
block will run. If the command inside the capture
errors, the else
block will run.
Say you wanted to rename a variable if it exists, and if doesn’t exist, create it. (For example, you have to process a large number of files, and in some files, this variable may be missing for all rows and thus not reported.) You could run the following:
capture rename oldvar newvar
if _rc > 0 {
generate newvar = .
}
5.5.2 Quieting the output
quietly
does the same basic thing as capture
, except it does not hide errors. It can be useful combined with the returns:
quietly summarize price
.
display r(mean)
. 6165.2568
This will come in very handy when you start running statistical models, where the output can be over a single screen, whereas you only want a small piece of it.
Just to make the difference between capture
and quietly
clear:
list price in 1/5
.
+-------+
| price |
|-------|
1. | 4,099 |
2. | 4,749 |
3. | 3,799 |
4. | 4,816 |
5. | 7,827 |
+-------+
quietly list price in 1/5
.
capture list price in 1/5
.
list abcd in 1/5
. variable abcd not found
r(111);
quietly list abcd in 1/5
. variable abcd not found
r(111);
capture list abcd in 1/5 .
With a command that doesn’t error (listing price
), both quietly
and capture
perform the same. However, with a command that does error, quietly
still errors, whereas capture
just ignores it!
There are obviously other ways to compute this, but this gives a flavor of the use.↩︎